Interactive plots with Bokeh
I’m most comfortable with Matplotlib. I think that’s obvious at this point. I’m also well aware that it has a (*ahem*) troubled reputation, despite its large footprint in the data world.
Much of that reputation is deserved, of course. It’s over 20 years old and the design shows it. It usually provides several puzzlingly different ways to accomplish the same goal and they’re all clunky.
One of its biggest problems is that the default style is unforgivably ugly. People see the output and get turned off. They realize how much effort it would take to produce an average-looking plot and decide it isn’t worth the trouble. I think that’s a shame because once you have a feel for the library, it offers probably the finest degree of control in customizing your work.

I didn’t intend for this blog to become exclusively the realm of Matplotlib, but I enjoy taking a library that most people dislike and creating (what I hope are) nice data visualizations.
That said, I think it’s a good time to step outside my comfort zone and use a different tool. Bokeh is a data viz library that helps you create clean, modern-looking plots that embed in a web page. You could easily spin up a dashboard and display stock prices, weather data, or anything else. One of Bokeh’s biggest draws is that, unlike Matplotlib, it looks good without putting any real effort into the style.
For me, its most exciting feature is that the plots can be interactive. Even default plots allow users to zoom and drag the window. You can push it further with sliders, spinners, and other built-in widgets, all backed up by fully customizable Javascript.
Some of its methods are unintuitive, in my opinion, but it’s understandable because Bokeh is designed to translate Python code into Javascript. It’s kind of like writing instructions for a robot to bake a cake rather than baking it yourself.
Bokeh gets bonus points for dumping all the code into a single HTML file. You could easily share your work with someone and they wouldn’t need to install anything. They could use the interactive features without a server running somewhere in the background.
This post will focus mostly on the visualization. I won’t spend too much time gathering data. Instead I’ll extend my previous post about county-level election results. In that post, we looked at population change and how it related to partisan preference in the 2024 election. Here, we’ll narrow the scope to counties in California, but we’ll add two new variables:
- The percentage of residents age 25+ with a bachelor’s degree or higher (S1501).
- Median annual earnings of residents age 16+ (S2001).
Both of these statistics are available from the US Census Bureau as part of the American Community Survey. Our primary focus will be on the relationship between education and vote preference. Population and median earnings will add a degree of interactivity to the plot.
1. Prepare the data.
I’m going to skip past merging the datasets. I already covered what’s essentially the same process in my previous post, so click the link for a more detailed explanation. Today, we’ll start with a cleaned, ready-to-go county-level CSV that holds all the relevant variables.
Load the dataset into a pandas DataFrame and filter it down to California rows. The set_option
method allows for more columns to be printed on screen.
import pandas as pd pd.set_option("display.expand_frame_repr", False) df = pd.read_csv("county_data.csv") df = df[df['state'] == "California"]
df.head()
is shown below. County and state have their own columns, as do the three demographic variables. dem_margin is the percentage point difference between Democratic and Republican vote share. A 55-45 blue county would show up as 10.0.
county state population bachelor_or_higher_pct median_earnings dem_margin 41 Alameda County California 1649060 51.901837 64242 53.562556 42 Butte County California 208334 29.851859 36717 -3.121991 43 Contra Costa County California 1172607 46.191173 60735 37.946395 44 El Dorado County California 192823 40.665969 55187 -11.981410 45 Fresno County California 1024125 24.769337 40124 -4.385357
We have education and earnings data for 42 of California’s 58 counties. I’d love to have all 58 but unfortunately the American Community Survey doesn’t have universal coverage. The Census Bureau can’t collect enough responses in some counties to provide a reliable estimate. Still, those 42 counties account for approximately 99% of the state’s population, so it’s not a huge loss in terms of people.
I’m getting ahead of myself and we have to take care of something before moving on. The plan is to create a scatter plot of education and vote preference, and hovering over a marker will display that county’s population and median earnings. Since it will be presented to the user, we should clean up how figures are presented on screen. For example, earnings of 50000 can be displayed as $50,000. Similarly, let’s include commas in population numbers. Create new columns to hold display strings.
df.loc[:, 'population_display'] = df['population'].apply(lambda x: f"{x:,}") df.loc[:, 'earnings_display'] = df['median_earnings'].apply(lambda x: f"${x:,}")
Now df.head()
looks like this. I think the two columns on the right are much easier to read.
county state population bachelor_or_higher_pct median_earnings dem_margin population_display earnings_display 41 Alameda County California 1649060 51.901837 64242 53.562556 1,649,060 $64,242 42 Butte County California 208334 29.851859 36717 -3.121991 208,334 $36,717 43 Contra Costa County California 1172607 46.191173 60735 37.946395 1,172,607 $60,735 44 El Dorado County California 192823 40.665969 55187 -11.981410 192,823 $55,187 45 Fresno County California 1024125 24.769337 40124 -4.385357 1,024,125 $40,124
Now run a simple linear regression of dem_margin on bachelor_or_higher_pct. This will give us a “line of best fit” through the trend.
from scipy.stats import linregress slope, intercept, r_value, p_value, std_err = linregress(df['bachelor_or_higher_pct'], df['dem_margin']) x_reg = [df['bachelor_or_higher_pct'].min(), df['bachelor_or_higher_pct'].max()] y_reg = [n * slope + intercept for n in x_reg]
R2 measures how tightly data points fit the trend line. If you square r_value
to get R2, you’ll find it’s 0.63. It’s fairly high for a relationship like this in the real world.
In fact, California has the 8th-highest R2 of all the states in our dataset. That’s one reason the state was a good subject for this post. Nationwide, R2 for all 830 counties is 0.35. So the relationship between education and vote is significantly stronger in California than the US as a whole.
As a disclaimer, linear regressions don’t measure cause and effect. It’s always worth reminding yourself, especially in a political context. It’s possible but you would have to do more work to show that x causes y.
2. Plot the data
Now we can build a scatter plot.
As a Matplotlib user, I find a Bokeh figure
object similar to a Matplotlib Axes
. We can call figure.line
or figure.scatter
to plot the data.
There are many differences, like how we immediately set labels in the code below. Setting sizing_mode
to “stretch_width” allows the figure to expand and fill its parent container. It’s probably not necessary in most cases but it will help to display the output on this page.
from bokeh.plotting import figure fig = figure(height=400, sizing_mode="stretch_width", title=f"California Counties | 2024 Presidential Vote | R² = {r_value ** 2:.2f}", x_axis_label="Bachelor's Degree or Higher", y_axis_label="Vote Margin")
A ColumnDataSource
object is how we’ll pass data to the plotting methods. It’s a Bokeh data structure that works much like a Python dictionary. You could pass a dictionary type and it would work just as well. Conveniently for us, it accepts pandas DataFrames.
from bokeh.models import ColumnDataSource data_source = ColumnDataSource(data=df)
Now let’s call scatter
on the figure
we created a moment ago.
Our ColumnDataSource
gets passed to the source
parameter. Then declare which variables go along the horizontal and vertical axes and set marker style.
It’s not strictly necessary but I want to save the scatter plot as a variable (scatter1
) and enable hover tooltips later.
scatter1 = fig.scatter(source=data_source, x="bachelor_or_higher_pct", y="dem_margin", size=12, color="#7865BF", alpha=0.75)
Now draw the regression line on the figure
. We saved its points in x_reg
and y_reg
lists.
This doesn’t need to be assigned a variable name because we don’t want tooltips enabled. We only want to show county population and earnings when hovering.
fig.line(x=x_reg, y=y_reg, line_width=2, color="#333333")
Next, define the window. Bokeh doesn’t accept range
types so we have to convert x-ticks into a Python list. Or use numpy.arange
or similar.
The x variable is percentage of residents with a bachelor’s degree or higher so let’s make them strings with a % symbol. Set major_label_overrides
to a dictionary of {tick: label} pairs.
x_ticks = list(range(10, 70, 10)) fig.xaxis.ticker = x_ticks fig.xaxis.major_label_overrides = {n: f"{n}%" for n in x_ticks} fig.x_range.start = 8 fig.x_range.end = 63
The process is essentially the same for the y-axis. I wrote a function to convert vote margins into D-R partisanship. For example, -20 becomes R+20.
def get_ytick_labels(ticks): labels = [] for tick in ticks: if tick > 0: labels.append(f"D+{tick}") elif tick < 0: labels.append(f"R+{abs(tick)}") else: labels.append("TIE") return dict(zip(ticks, labels)) y_ticks = list(range(-40, 80, 20)) fig.yaxis.ticker = y_ticks fig.yaxis.major_label_overrides = get_ytick_labels(y_ticks) fig.y_range.start = -50 fig.y_range.end = 70
To create hover tooltips, we pass a list of tuples to HoverTool
. The first element of the tuple is the label on screen. The second element is @ the DataFrame column name.
This is why we saved the scatter
instance as a variable. Only objects passed to renderers
will have tooltips. The regression line won’t be included.
from bokeh.models import HoverTool tool_tips = [("County", "@county"), ("Population", "@population_display"), ("Earnings", "@earnings_display")] fig.add_tools(HoverTool(renderers=[scatter1], tooltips=tool_tips))
This is already more interactive than my usual Matplotlib plots but I think we can do better. It seems like a waste to stop at tooltips.
Let’s create a slider below the plot that can set minimum earnings. Only counties that exceed the slider’s value will appear on the plot. So as you drag it to the right, counties with the lowest earnings will gradually disappear. You might be interested to see how the relationship looks for counties with above-average incomes.
start
and end
define the lowest- and highest-possible values of the slider. step
is the distance between each value, e.g. with step=4
you could snap to 0, 4, 8, 12, etc. value
is where the slider is located when the page loads. Let’s start it at the far left so all points are included. title
and width
define the widget’s appearance.
from bokeh.models import Slider slider = Slider(start=30000, end=75000, step=1000, value=30000, title="Median Earnings (Minimum)", width=250)
Next is the complicated part (at least for me). We have to write a custom Javascript function to handle the slider’s logic. Thankfully, there’s nothing too far out there. I’m far from a Javascript expert and I managed to work through it.
A “callback” is a function that’s triggered when some action or event occurs, like a key press or a button click. In this case, we’ll create a callback function to run whenever the slider moves.
The CustomJS
method is a little unintuitive because it tells Bokeh how to write a Javascript function. It’s like meta-code. We have to create parameters and arguments at the same time.
First, save an extra copy of the ColumnDataSource
. Every time the slider moves, it will filter values from the data source hooked up to the plot, so we’ll need an original copy to work from.
original_data_source = ColumnDataSource(data=df)
Now build the callback function. args
creates parameters and defines the arguments that will be passed to them. We need the data_source
that’s hooked up to the plot, the original_data_source
that we’ll pull from, and the slider
widget itself.
The code’s syntax is similar enough to Python. Every time the slider moves, we rebuild the whole data source from scratch. The function creates an empty Javascript “dictionary.” Then it steps through each row of the original, complete data source. If median earnings exceed the slider’s value, that row is appended to the dictionary. Finally, the dictionary is assigned back to data_source
, which is hooked up to the plot.
It might look intimidating if you aren’t used to Javascript, but take it slow and it will make sense. There’s nothing here that you haven’t done a million times in Python. Now that I have this working example, I’m confident that I’ll be able to create all kinds of custom logic for Bokeh widgets in the future. It boils down to filtering the data on every pass through the function and assigning it back to the ColumnDataSource
.
from bokeh.models.callbacks import CustomJS callback = CustomJS(args={"data_source": data_source, "original_data_source": original_data_source, "slider": slider}, code=""" var original_data = original_data_source.data; var filtered_data = {bachelor_or_higher_pct: [], dem_margin: [], county: [], population_display: [], earnings_display: []}; for (var i = 0; i < original_data['median_earnings'].length; i++) { if (original_data['median_earnings'][i] >= slider.value) { filtered_data.bachelor_or_higher_pct.push(original_data['bachelor_or_higher_pct'][i]); filtered_data.dem_margin.push(original_data['dem_margin'][i]); filtered_data.county.push(original_data['county'][i]); filtered_data.population_display.push(original_data['population_display'][i]); filtered_data.earnings_display.push(original_data['earnings_display'][i]); } } data_source.data = filtered_data;""")
Attach the custom Javascript function to the slider. “value” represents the number currently selected on the slider.
slider.js_on_change("value", callback)
With that, it’s time to save the output. Bokeh has a few layout methods for arranging various elements. In this case, I think it makes sense to place the slider below the plot. That means fig
and slider
are arranged as a column
.
Earlier we set the figure’s sizing_mode
to “stretch_width” so it will fill the column, but we also need to set the column’s sizing_mode
to fill the page. It took me a minute to understand this but I think it makes sense.
show()
is what triggers the whole backend process. Bokeh builds the plot and saves it as an HTML file in the project folder.
from bokeh.plotting import show from bokeh.layouts import column show(column(fig, slider, sizing_mode="stretch_width"))
Normally the plot would open in a new browser window but I’ve embedded it in this page. Bokeh’s portability is one of its greatest strengths!
3. The output.
I tried to scale the plot to fit both desktop and mobile browsers but it ends up looking awkward in both. You get the idea.
The hover tooltips may not work on mobile devices. If you’re using a desktop browser, hover your mouse over a dot to see information about the county it represents.
Drag the slider to change minimum earnings. You can see that higher-income counties tended to vote for Harris. Earnings are only slightly less correlated with vote preference than our actual x-axis variable, education. That’s not surprising because education and income tend to go together.
In this model, earnings would be considered a confounding variable. It’s associated with both the x and y variables so we can’t draw conclusions about cause and effect. Maybe high incomes caused people to vote for Harris regardless of education. Or maybe there’s a hidden variable driving the correlations. That uncertainty is okay. It’s still helpful to know that education is strongly correlated with vote preference.
I get it if you aren’t super interested in California elections, but you can begin to imagine what’s possible with Bokeh interactive plots. Check out their example gallery, especially the Interaction tab. They show off several widgets like Slider
that can be added to plots.
I’ve immediately become a big fan of the library. Chances are I’ll use it again in a blog post. I have an ancient-looking Google Charts dashboard that I put together nearly a decade ago. If Google ever shuts down the API, I think I’ll rebuild it with Bokeh.
Full code:
import pandas as pd from scipy.stats import linregress from bokeh.plotting import figure, show from bokeh.models import ColumnDataSource, HoverTool, Slider from bokeh.models.callbacks import CustomJS from bokeh.layouts import column def get_ytick_labels(ticks): labels = [] for tick in ticks: if tick > 0: labels.append(f"D+{tick}") elif tick < 0: labels.append(f"R+{abs(tick)}") else: labels.append("TIE") return dict(zip(ticks, labels)) pd.set_option("display.expand_frame_repr", False) df = pd.read_csv("county_data.csv") df = df[df['state'] == "California"] df.loc[:, 'population_display'] = df['population'].apply(lambda x: f"{x:,}") df.loc[:, 'earnings_display'] = df['median_earnings'].apply(lambda x: f"${x:,}") slope, intercept, r_value, p_value, std_err = linregress(df['bachelor_or_higher_pct'], df['dem_margin']) x_reg = [df['bachelor_or_higher_pct'].min(), df['bachelor_or_higher_pct'].max()] y_reg = [n * slope + intercept for n in x_reg] fig = figure(height=400, sizing_mode="stretch_width", title=f"California Counties | 2024 Presidential Vote | R² = {r_value ** 2:.2f}", x_axis_label="Bachelor's Degree or Higher", y_axis_label="Vote Margin") data_source = ColumnDataSource(data=df) scatter1 = fig.scatter(source=data_source, x="bachelor_or_higher_pct", y="dem_margin", size=12, color="#7865BF", alpha=0.75) fig.line(x=x_reg, y=y_reg, line_width=2, color="#333333") x_ticks = list(range(10, 70, 10)) fig.xaxis.ticker = x_ticks fig.xaxis.major_label_overrides = {n: f"{n}%" for n in x_ticks} fig.x_range.start = 8 fig.x_range.end = 63 y_ticks = list(range(-40, 80, 20)) fig.yaxis.ticker = y_ticks fig.yaxis.major_label_overrides = get_ytick_labels(y_ticks) fig.y_range.start = -50 fig.y_range.end = 70 tool_tips = [("County", "@county"), ("Population", "@population_display"), ("Earnings", "@earnings_display")] fig.add_tools(HoverTool(renderers=[scatter1], tooltips=tool_tips)) slider = Slider(start=30000, end=75000, step=1000, value=30000, title="Median Earnings (Minimum)", width=250) original_data_source = ColumnDataSource(data=df) callback = CustomJS(args={"data_source": data_source, "original_data_source": original_data_source, "slider": slider}, code=""" var original_data = original_data_source.data; var filtered_data = {bachelor_or_higher_pct: [], dem_margin: [], county: [], population_display: [], earnings_display: []}; for (var i = 0; i < original_data['median_earnings'].length; i++) { if (original_data['median_earnings'][i] >= slider.value) { filtered_data.bachelor_or_higher_pct.push(original_data['bachelor_or_higher_pct'][i]); filtered_data.dem_margin.push(original_data['dem_margin'][i]); filtered_data.county.push(original_data['county'][i]); filtered_data.population_display.push(original_data['population_display'][i]); filtered_data.earnings_display.push(original_data['earnings_display'][i]); } } data_source.data = filtered_data;""") slider.js_on_change("value", callback) show(column(fig, slider, sizing_mode="stretch_width"))