分类图表#

Bokeh 提供了多种处理和可视化分类数据的方法。分类数据是指可以划分为不同的组或类别的数据，可以有或没有自然的顺序或数值。示例包括表示国家、产品名称或颜色的数据。与可能表示温度或距离等值的连续数据不同，分类数据是关于标签和分组的。

许多数据集包含连续数据和分类数据。例如，一个数据集包含不同国家/地区不同产品的在一段时间内的销售数量。

分类数据也可能每个类别有多个值。之前的章节，如条形图和散点图已经介绍了一些可视化每个类别具有单个值的分类数据的方法。

本章重点介绍更复杂的分类数据，其中每个类别都有一系列值，以及具有一个或多个分类变量的数据集。

一个分类变量#

带有抖动的分类散点图#

关于散点图的章节包含可视化每个类别具有单个值的数据的示例。

如果您的数据包含每个类别的多个值，则可以使用分类散点图可视化您的数据。如果您有每周不同日期的不同系列测量值，这将非常有用。

为了避免单个类别的众多散点重叠，请使用jitter()函数为每个点提供一个随机偏移。

下面的示例显示了 2012 年至 2016 年之间 GitHub 用户每次提交时间的散点图。它使用星期几作为类别来分组提交。默认情况下，此图会显示数千个点，每个点都重叠在每天的窄线上。jitter 函数使您可以区分这些点，从而生成有用的图

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show
from bokeh.sampledata.commits import data
from bokeh.transform import jitter

DAYS = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']

source = ColumnDataSource(data)

p = figure(width=800, height=300, y_range=DAYS, x_axis_type='datetime',
           title="Commits by Time of Day (US/Central) 2012-2016")

p.scatter(x='time', y=jitter('day', width=0.6, range=p.y_range), source=source, alpha=0.3)

p.xaxis.formatter.days = '%Hh'
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None

show(p)

带有偏移的分类序列#

可视化分类数据的一个简单示例是使用条形图来表示每个类别的单个值。

但是，如果您想表示每个类别的有序数据序列，则可以使用分类偏移来定位每个类别值的字形。除了带有躲避的视觉偏移之外，分类偏移还提供了对类别“内部”定位的显式控制。

要显式地为分类位置提供偏移量，请在类别的末尾添加一个数值。例如：["Jan", 0.2] 给类别“Jan”一个 0.2 的偏移量。

对于多级类别，在现有列表的末尾添加值：["West", "Sales", -0,2]。Bokeh 将类别列表末尾的任何数值解释为偏移量。

以“条形图”章节中的水果示例为例，通过添加offsets列表来修改它

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']

offsets = [-0.5, -0.2, 0.0, 0.3, 0.1, 0.3]

# This results in [ ['Apples', -0.5], ['Pears', -0.2], ... ]
x = list(zip(fruits, offsets))

p.vbar(x=x, top=[5, 3, 4, 2, 4, 6], width=0.8)

这将使每个条形在水平方向上移动相应的偏移量。

下面是一个更复杂的山脊图示例。它使用分类偏移来指定每个类别的补丁坐标

import colorcet as cc
from numpy import linspace
from scipy.stats import gaussian_kde

from bokeh.models import ColumnDataSource, FixedTicker, PrintfTickFormatter
from bokeh.plotting import figure, show
from bokeh.sampledata.perceptions import probly


def ridge(category, data, scale=20):
    return list(zip([category]*len(data), scale*data))

cats = list(reversed(probly.keys()))

palette = [cc.rainbow[i*15] for i in range(17)]

x = linspace(-20, 110, 500)

source = ColumnDataSource(data=dict(x=x))

p = figure(y_range=cats, width=900, x_range=(-5, 105), toolbar_location=None)

for i, cat in enumerate(reversed(cats)):
    pdf = gaussian_kde(probly[cat])
    y = ridge(cat, pdf(x))
    source.add(y, cat)
    p.patch('x', cat, color=palette[i], alpha=0.6, line_color="black", source=source)

p.outline_line_color = None
p.background_fill_color = "#efefef"

p.xaxis.ticker = FixedTicker(ticks=list(range(0, 101, 10)))
p.xaxis.formatter = PrintfTickFormatter(format="%d%%")

p.ygrid.grid_line_color = None
p.xgrid.grid_line_color = "#dddddd"
p.xgrid.ticker = p.xaxis.ticker

p.axis.minor_tick_line_color = None
p.axis.major_tick_line_color = None
p.axis.axis_line_color = None

p.y_range.range_padding = 0.12

show(p)

坡度图#

坡度图是用于可视化两个或多个数据点之间相对变化的图表。例如，这对于可视化两个类别之间的差异或类别内变量随时间的变化非常有用。

在坡度图中，您将单个测量值可视化为排列成两列的点，并通过用线连接成对的点来指示配对。每条线的斜率突出显示了变化的大小和方向。

以下坡度图可视化了不同国家/地区每人二氧化碳排放量在几年或几十年内的相对变化。它使用Segment字形来绘制连接成对点的线

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show
from bokeh.sampledata.emissions import data

countries = [
    "Trinidad and Tobago",
    "Qatar",
    "United Arab Emirates",
    "Oman",
    "Bahrain",
    "Singapore",
    "Netherlands Antilles",
    "Kazakhstan",
    "Equatorial Guinea",
    "Kuwait",
]

df = data[data["country"].isin(countries)].reset_index(drop=True)
years = (df["year"] == 2000.0) | (df["year"] == 2010.0)

df = df[years].reset_index(drop=True)
df["year"] = df.year.astype(int)
df["year"] = df.year.astype(str)

# create separate columns for the different years
a = df[df["year"] == "2000"]
b = df[df["year"] == "2010"]
new_df = a.merge(b, on="country")

source = ColumnDataSource(new_df)

p = figure(x_range=("2000", "2010"), y_range=(0, 60), x_axis_location="above", y_axis_label="CO2 emissions (tons / person)")

p.scatter(x="year_x", y="emissions_x", source=source, size=7)

p.scatter(x="year_y", y="emissions_y", source=source, size=7)

p.segment(x0="year_x", y0="emissions_x", x1="year_y", y1="emissions_y", source=source, color="black")

p.text(x="year_y", y="emissions_y", text="country", source=source, x_offset=7, y_offset=8, text_font_size="12px")

p.xaxis.major_tick_line_color = None
p.xaxis.major_tick_out = 0
p.xaxis.axis_line_color = None

p.yaxis.minor_tick_out = 0
p.yaxis.major_tick_in = 0
p.yaxis.ticker = [0, 20, 40, 60]

p.grid.grid_line_color = None
p.outline_line_color = None

show(p)

两个或多个分类变量#

分类热图#

可以有与类别对关联的值。在这种情况下，将不同的颜色阴影应用于表示类别对的矩形将产生分类热图。这是一个具有两个分类轴的图表。

下图在其 x 轴上列出了 1948 年至 2016 年的年份，在其 y 轴上列出了月份。图表的每个矩形对应一个 (year, month) 对。矩形的颜色表示给定年份给定月份的失业率。

此示例使用linear_cmap()来映射绘图的颜色，因为失业率是一个连续变量。此图还使用construct_color_bar()在右侧提供可视化图例

from math import pi

import pandas as pd

from bokeh.models import BasicTicker, PrintfTickFormatter
from bokeh.plotting import figure, show
from bokeh.sampledata.unemployment1948 import data
from bokeh.transform import linear_cmap

data['Year'] = data['Year'].astype(str)
data = data.set_index('Year')
data.drop('Annual', axis=1, inplace=True)
data.columns.name = 'Month'

years = list(data.index)
months = list(reversed(data.columns))

# reshape to 1D array or rates with a month and year for each row.
df = pd.DataFrame(data.stack(), columns=['rate']).reset_index()

# this is the colormap from the original NYTimes plot
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]

TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"

p = figure(title=f"US Unemployment ({years[0]} - {years[-1]})",
           x_range=years, y_range=months,
           x_axis_location="above", width=900, height=400,
           tools=TOOLS, toolbar_location='below',
           tooltips=[('date', '@Month @Year'), ('rate', '@rate%')])

p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "7px"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = pi / 3

r = p.rect(x="Year", y="Month", width=1, height=1, source=df,
           fill_color=linear_cmap("rate", colors, low=df.rate.min(), high=df.rate.max()),
           line_color=None)

p.add_layout(r.construct_color_bar(
    major_label_text_font_size="7px",
    ticker=BasicTicker(desired_num_ticks=len(colors)),
    formatter=PrintfTickFormatter(format="%d%%"),
    label_standoff=6,
    border_line_color=None,
    padding=5,
), 'right')

show(p)

以下元素周期表使用了本章中的几种技术

from bokeh.plotting import figure, show
from bokeh.sampledata.periodic_table import elements
from bokeh.transform import dodge, factor_cmap

periods = ["I", "II", "III", "IV", "V", "VI", "VII"]
groups = [str(x) for x in range(1, 19)]

df = elements.copy()
df["atomic mass"] = df["atomic mass"].astype(str)
df["group"] = df["group"].astype(str)
df["period"] = [periods[x-1] for x in df.period]
df = df[df.group != "-"]
df = df[df.symbol != "Lr"]
df = df[df.symbol != "Lu"]

cmap = {
    "alkali metal"         : "#a6cee3",
    "alkaline earth metal" : "#1f78b4",
    "metal"                : "#d93b43",
    "halogen"              : "#999d9a",
    "metalloid"            : "#e08d49",
    "noble gas"            : "#eaeaea",
    "nonmetal"             : "#f1d4Af",
    "transition metal"     : "#599d7A",
}

TOOLTIPS = [
    ("Name", "@name"),
    ("Atomic number", "@{atomic number}"),
    ("Atomic mass", "@{atomic mass}"),
    ("Type", "@metal"),
    ("CPK color", "$color[hex, swatch]:CPK"),
    ("Electronic configuration", "@{electronic configuration}"),
]

p = figure(title="Periodic Table (omitting LA and AC Series)", width=1000, height=450,
           x_range=groups, y_range=list(reversed(periods)),
           tools="hover", toolbar_location=None, tooltips=TOOLTIPS)

r = p.rect("group", "period", 0.95, 0.95, source=df, fill_alpha=0.6, legend_field="metal",
           color=factor_cmap('metal', palette=list(cmap.values()), factors=list(cmap.keys())))

text_props = dict(source=df, text_align="left", text_baseline="middle")

x = dodge("group", -0.4, range=p.x_range)

p.text(x=x, y="period", text="symbol", text_font_style="bold", **text_props)

p.text(x=x, y=dodge("period", 0.3, range=p.y_range), text="atomic number",
       text_font_size="11px", **text_props)

p.text(x=x, y=dodge("period", -0.35, range=p.y_range), text="name",
       text_font_size="7px", **text_props)

p.text(x=x, y=dodge("period", -0.2, range=p.y_range), text="atomic mass",
       text_font_size="7px", **text_props)

p.text(x=["3", "3"], y=["VI", "VII"], text=["LA", "AC"], text_align="center", text_baseline="middle")

p.outline_line_color = None
p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_standoff = 0
p.legend.orientation = "horizontal"
p.legend.location ="top_center"
p.hover.renderers = [r] # only hover element boxes

show(p)