Government, Stats

Interactive plots with Bokeh

August 7, 2025August 6, 2025 by Jeff Wollen

I’m most comfortable with Matplotlib. I think that’s obvious at this point. I’m also well aware that it has a (*ahem*) troubled reputation, despite its large footprint in the data world.

Much of that reputation is deserved, of course. It’s over 20 years old and the design shows it. It usually provides several puzzlingly different ways to accomplish the same goal and they’re all clunky.

One of its biggest problems is that the default style is unforgivably ugly. People see the output and get turned off. They realize how much effort it would take to produce an average-looking plot and decide it isn’t worth the trouble. I think that’s a shame because once you have a feel for the library, it offers probably the finest degree of control in customizing your work.

An example of Matplotlib’s default style. It would be right at home in a 1970s high school chemistry textbook.

I didn’t intend for this blog to become exclusively the realm of Matplotlib, but I enjoy taking a library that most people dislike and creating (what I hope are) nice data visualizations.

That said, I think it’s a good time to step outside my comfort zone and use a different tool. Bokeh is a data viz library that helps you create clean, modern-looking plots that embed in a web page. You could easily spin up a dashboard and display stock prices, weather data, or anything else. One of Bokeh’s biggest draws is that, unlike Matplotlib, it looks good without putting any real effort into the style.

For me, its most exciting feature is that the plots can be interactive. Even default plots allow users to zoom and drag the window. You can push it further with sliders, spinners, and other built-in widgets, all backed up by fully customizable Javascript.

Some of its methods are unintuitive, in my opinion, but it’s understandable because Bokeh is designed to translate Python code into Javascript. It’s kind of like writing instructions for a robot to bake a cake rather than baking it yourself.

Bokeh gets bonus points for dumping all the code into a single HTML file. You could easily share your work with someone and they wouldn’t need to install anything. They could use the interactive features without a server running somewhere in the background.

This post will focus mostly on the visualization. I won’t spend too much time gathering data. Instead I’ll extend my previous post about county-level election results. In that post, we looked at population change and how it related to partisan preference in the 2024 election. Here, we’ll narrow the scope to counties in California, but we’ll add two new variables:

The percentage of residents age 25+ with a bachelor’s degree or higher (S1501).
Median annual earnings of residents age 16+ (S2001).

Both of these statistics are available from the US Census Bureau as part of the American Community Survey. Our primary focus will be on the relationship between education and vote preference. Population and median earnings will add a degree of interactivity to the plot.

1. Prepare the data.

I’m going to skip past merging the datasets. I already covered what’s essentially the same process in my previous post, so click the link for a more detailed explanation. Today, we’ll start with a cleaned, ready-to-go county-level CSV that holds all the relevant variables.

Load the dataset into a pandas DataFrame and filter it down to California rows. The set_option method allows for more columns to be printed on screen.

import pandas as pd

pd.set_option("display.expand_frame_repr", False)

df = pd.read_csv("county_data.csv")

df = df[df['state'] == "California"]

df.head() is shown below. County and state have their own columns, as do the three demographic variables. dem_margin is the percentage point difference between Democratic and Republican vote share. A 55-45 blue county would show up as 10.0.

                 county       state  population  bachelor_or_higher_pct  median_earnings  dem_margin
41       Alameda County  California     1649060               51.901837            64242   53.562556
42         Butte County  California      208334               29.851859            36717   -3.121991
43  Contra Costa County  California     1172607               46.191173            60735   37.946395
44     El Dorado County  California      192823               40.665969            55187  -11.981410
45        Fresno County  California     1024125               24.769337            40124   -4.385357

We have education and earnings data for 42 of California’s 58 counties. I’d love to have all 58 but unfortunately the American Community Survey doesn’t have universal coverage. The Census Bureau can’t collect enough responses in some counties to provide a reliable estimate. Still, those 42 counties account for approximately 99% of the state’s population, so it’s not a huge loss in terms of people.

I’m getting ahead of myself and we have to take care of something before moving on. The plan is to create a scatter plot of education and vote preference, and hovering over a marker will display that county’s population and median earnings. Since it will be presented to the user, we should clean up how figures are presented on screen. For example, earnings of 50000 can be displayed as $50,000. Similarly, let’s include commas in population numbers. Create new columns to hold display strings.

df.loc[:, 'population_display'] = df['population'].apply(lambda x: f"{x:,}")

df.loc[:, 'earnings_display'] = df['median_earnings'].apply(lambda x: f"${x:,}")

Now df.head() looks like this. I think the two columns on the right are much easier to read.

                 county       state  population  bachelor_or_higher_pct  median_earnings  dem_margin population_display earnings_display
41       Alameda County  California     1649060               51.901837            64242   53.562556          1,649,060          $64,242
42         Butte County  California      208334               29.851859            36717   -3.121991            208,334          $36,717
43  Contra Costa County  California     1172607               46.191173            60735   37.946395          1,172,607          $60,735
44     El Dorado County  California      192823               40.665969            55187  -11.981410            192,823          $55,187
45        Fresno County  California     1024125               24.769337            40124   -4.385357          1,024,125          $40,124

Now run a simple linear regression of dem_margin on bachelor_or_higher_pct. This will give us a “line of best fit” through the trend.

from scipy.stats import linregress

slope, intercept, r_value, p_value, std_err = linregress(df['bachelor_or_higher_pct'], df['dem_margin'])

x_reg = [df['bachelor_or_higher_pct'].min(), df['bachelor_or_higher_pct'].max()]
y_reg = [n * slope + intercept for n in x_reg]

R² measures how tightly data points fit the trend line. If you square r_value to get R², you’ll find it’s 0.63. It’s fairly high for a relationship like this in the real world.

In fact, California has the 8th-highest R² of all the states in our dataset. That’s one reason the state was a good subject for this post. Nationwide, R² for all 830 counties is 0.35. So the relationship between education and vote is significantly stronger in California than the US as a whole.

As a disclaimer, linear regressions don’t measure cause and effect. It’s always worth reminding yourself, especially in a political context. It’s possible but you would have to do more work to show that x causes y.

2. Plot the data

Now we can build a scatter plot.

As a Matplotlib user, I find a Bokeh figure object similar to a Matplotlib Axes. We can call figure.line or figure.scatter to plot the data.

There are many differences, like how we immediately set labels in the code below. Setting sizing_mode to “stretch_width” allows the figure to expand and fill its parent container. It’s probably not necessary in most cases but it will help to display the output on this page.

from bokeh.plotting import figure

fig = figure(height=400,
             sizing_mode="stretch_width",
             title=f"California Counties  |  2024 Presidential Vote  |  R² = {r_value ** 2:.2f}",
             x_axis_label="Bachelor's Degree or Higher",
             y_axis_label="Vote Margin")

A ColumnDataSource object is how we’ll pass data to the plotting methods. It’s a Bokeh data structure that works much like a Python dictionary. You could pass a dictionary type and it would work just as well. Conveniently for us, it accepts pandas DataFrames.

from bokeh.models import ColumnDataSource

data_source = ColumnDataSource(data=df)

Now let’s call scatter on the figure we created a moment ago.

Our ColumnDataSource gets passed to the source parameter. Then declare which variables go along the horizontal and vertical axes and set marker style.

It’s not strictly necessary but I want to save the scatter plot as a variable (scatter1) and enable hover tooltips later.

scatter1 = fig.scatter(source=data_source,
                       x="bachelor_or_higher_pct",
                       y="dem_margin",
                       size=12,
                       color="#7865BF",
                       alpha=0.75)

Now draw the regression line on the figure. We saved its points in x_reg and y_reg lists.

This doesn’t need to be assigned a variable name because we don’t want tooltips enabled. We only want to show county population and earnings when hovering.

fig.line(x=x_reg,
         y=y_reg,
         line_width=2,
         color="#333333")

Next, define the window. Bokeh doesn’t accept range types so we have to convert x-ticks into a Python list. Or use numpy.arange or similar.

The x variable is percentage of residents with a bachelor’s degree or higher so let’s make them strings with a % symbol. Set major_label_overrides to a dictionary of {tick: label} pairs.

x_ticks = list(range(10, 70, 10))
fig.xaxis.ticker = x_ticks
fig.xaxis.major_label_overrides = {n: f"{n}%" for n in x_ticks}
fig.x_range.start = 8
fig.x_range.end = 63

The process is essentially the same for the y-axis. I wrote a function to convert vote margins into D-R partisanship. For example, -20 becomes R+20.

def get_ytick_labels(ticks):
    labels = []
    for tick in ticks:
        if tick > 0:
            labels.append(f"D+{tick}")
        elif tick < 0:
            labels.append(f"R+{abs(tick)}")
        else:
            labels.append("TIE")
    return dict(zip(ticks, labels))


y_ticks = list(range(-40, 80, 20))
fig.yaxis.ticker = y_ticks
fig.yaxis.major_label_overrides = get_ytick_labels(y_ticks)
fig.y_range.start = -50
fig.y_range.end = 70

To create hover tooltips, we pass a list of tuples to HoverTool. The first element of the tuple is the label on screen. The second element is @ the DataFrame column name.

This is why we saved the scatter instance as a variable. Only objects passed to renderers will have tooltips. The regression line won’t be included.

from bokeh.models import HoverTool

tool_tips = [("County", "@county"),
             ("Population", "@population_display"),
             ("Earnings", "@earnings_display")]
fig.add_tools(HoverTool(renderers=[scatter1], tooltips=tool_tips))

This is already more interactive than my usual Matplotlib plots but I think we can do better. It seems like a waste to stop at tooltips.

Let’s create a slider below the plot that can set minimum earnings. Only counties that exceed the slider’s value will appear on the plot. So as you drag it to the right, counties with the lowest earnings will gradually disappear. You might be interested to see how the relationship looks for counties with above-average incomes.

start and end define the lowest- and highest-possible values of the slider. step is the distance between each value, e.g. with step=4 you could snap to 0, 4, 8, 12, etc. value is where the slider is located when the page loads. Let’s start it at the far left so all points are included. title and width define the widget’s appearance.

from bokeh.models import Slider

slider = Slider(start=30000,
                end=75000,
                step=1000,
                value=30000,
                title="Median Earnings (Minimum)",
                width=250)

Next is the complicated part (at least for me). We have to write a custom Javascript function to handle the slider’s logic. Thankfully, there’s nothing too far out there. I’m far from a Javascript expert and I managed to work through it.

A “callback” is a function that’s triggered when some action or event occurs, like a key press or a button click. In this case, we’ll create a callback function to run whenever the slider moves.

The CustomJS method is a little unintuitive because it tells Bokeh how to write a Javascript function. It’s like meta-code. We have to create parameters and arguments at the same time.

First, save an extra copy of the ColumnDataSource. Every time the slider moves, it will filter values from the data source hooked up to the plot, so we’ll need an original copy to work from.

original_data_source = ColumnDataSource(data=df)

Now build the callback function. args creates parameters and defines the arguments that will be passed to them. We need the data_source that’s hooked up to the plot, the original_data_source that we’ll pull from, and the slider widget itself.

The code’s syntax is similar enough to Python. Every time the slider moves, we rebuild the whole data source from scratch. The function creates an empty Javascript “dictionary.” Then it steps through each row of the original, complete data source. If median earnings exceed the slider’s value, that row is appended to the dictionary. Finally, the dictionary is assigned back to data_source, which is hooked up to the plot.

It might look intimidating if you aren’t used to Javascript, but take it slow and it will make sense. There’s nothing here that you haven’t done a million times in Python. Now that I have this working example, I’m confident that I’ll be able to create all kinds of custom logic for Bokeh widgets in the future. It boils down to filtering the data on every pass through the function and assigning it back to the ColumnDataSource.

from bokeh.models.callbacks import CustomJS

callback = CustomJS(args={"data_source": data_source,
                          "original_data_source": original_data_source,
                          "slider": slider},
                    code="""
    var original_data = original_data_source.data;

    var filtered_data = {bachelor_or_higher_pct: [], 
                         dem_margin: [],
                         county: [],
                         population_display: [],
                         earnings_display: []};

    for (var i = 0; i < original_data['median_earnings'].length; i++) {
        if (original_data['median_earnings'][i] >= slider.value) {
            filtered_data.bachelor_or_higher_pct.push(original_data['bachelor_or_higher_pct'][i]);
            filtered_data.dem_margin.push(original_data['dem_margin'][i]);
            filtered_data.county.push(original_data['county'][i]);
            filtered_data.population_display.push(original_data['population_display'][i]);
            filtered_data.earnings_display.push(original_data['earnings_display'][i]);
        }
    }
    
    data_source.data = filtered_data;""")

Attach the custom Javascript function to the slider. “value” represents the number currently selected on the slider.

slider.js_on_change("value", callback)

With that, it’s time to save the output. Bokeh has a few layout methods for arranging various elements. In this case, I think it makes sense to place the slider below the plot. That means fig and slider are arranged as a column.

Earlier we set the figure’s sizing_mode to “stretch_width” so it will fill the column, but we also need to set the column’s sizing_mode to fill the page. It took me a minute to understand this but I think it makes sense.

show() is what triggers the whole backend process. Bokeh builds the plot and saves it as an HTML file in the project folder.

from bokeh.plotting import show
from bokeh.layouts import column

show(column(fig, slider, sizing_mode="stretch_width"))

Normally the plot would open in a new browser window but I’ve embedded it in this page. Bokeh’s portability is one of its greatest strengths!

3. The output.

I tried to scale the plot to fit both desktop and mobile browsers but it ends up looking awkward in both. You get the idea.

The hover tooltips may not work on mobile devices. If you’re using a desktop browser, hover your mouse over a dot to see information about the county it represents.

Drag the slider to change minimum earnings. You can see that higher-income counties tended to vote for Harris. Earnings are only slightly less correlated with vote preference than our actual x-axis variable, education. That’s not surprising because education and income tend to go together.

In this model, earnings would be considered a confounding variable. It’s associated with both the x and y variables so we can’t draw conclusions about cause and effect. Maybe high incomes caused people to vote for Harris regardless of education. Or maybe there’s a hidden variable driving the correlations. That uncertainty is okay. It’s still helpful to know that education is strongly correlated with vote preference.

I get it if you aren’t super interested in California elections, but you can begin to imagine what’s possible with Bokeh interactive plots. Check out their example gallery, especially the Interaction tab. They show off several widgets like Slider that can be added to plots.

I’ve immediately become a big fan of the library. Chances are I’ll use it again in a blog post. I have an ancient-looking Google Charts dashboard that I put together nearly a decade ago. If Google ever shuts down the API, I think I’ll rebuild it with Bokeh.

Download the data.

Full code:

import pandas as pd
from scipy.stats import linregress
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool, Slider
from bokeh.models.callbacks import CustomJS
from bokeh.layouts import column


def get_ytick_labels(ticks):
    labels = []
    for tick in ticks:
        if tick > 0:
            labels.append(f"D+{tick}")
        elif tick < 0:
            labels.append(f"R+{abs(tick)}")
        else:
            labels.append("TIE")
    return dict(zip(ticks, labels))


pd.set_option("display.expand_frame_repr", False)

df = pd.read_csv("county_data.csv")

df = df[df['state'] == "California"]

df.loc[:, 'population_display'] = df['population'].apply(lambda x: f"{x:,}")

df.loc[:, 'earnings_display'] = df['median_earnings'].apply(lambda x: f"${x:,}")

slope, intercept, r_value, p_value, std_err = linregress(df['bachelor_or_higher_pct'], df['dem_margin'])

x_reg = [df['bachelor_or_higher_pct'].min(), df['bachelor_or_higher_pct'].max()]
y_reg = [n * slope + intercept for n in x_reg]

fig = figure(height=400,
             sizing_mode="stretch_width",
             title=f"California Counties  |  2024 Presidential Vote  |  R² = {r_value ** 2:.2f}",
             x_axis_label="Bachelor's Degree or Higher",
             y_axis_label="Vote Margin")

data_source = ColumnDataSource(data=df)

scatter1 = fig.scatter(source=data_source,
                       x="bachelor_or_higher_pct",
                       y="dem_margin",
                       size=12,
                       color="#7865BF",
                       alpha=0.75)

fig.line(x=x_reg,
         y=y_reg,
         line_width=2,
         color="#333333")

x_ticks = list(range(10, 70, 10))
fig.xaxis.ticker = x_ticks
fig.xaxis.major_label_overrides = {n: f"{n}%" for n in x_ticks}
fig.x_range.start = 8
fig.x_range.end = 63

y_ticks = list(range(-40, 80, 20))
fig.yaxis.ticker = y_ticks
fig.yaxis.major_label_overrides = get_ytick_labels(y_ticks)
fig.y_range.start = -50
fig.y_range.end = 70

tool_tips = [("County", "@county"),
             ("Population", "@population_display"),
             ("Earnings", "@earnings_display")]
fig.add_tools(HoverTool(renderers=[scatter1], tooltips=tool_tips))

slider = Slider(start=30000,
                end=75000,
                step=1000,
                value=30000,
                title="Median Earnings (Minimum)",
                width=250)

original_data_source = ColumnDataSource(data=df)

callback = CustomJS(args={"data_source": data_source,
                          "original_data_source": original_data_source,
                          "slider": slider},
                    code="""
    var original_data = original_data_source.data;

    var filtered_data = {bachelor_or_higher_pct: [], 
                         dem_margin: [],
                         county: [],
                         population_display: [],
                         earnings_display: []};

    for (var i = 0; i < original_data['median_earnings'].length; i++) {
        if (original_data['median_earnings'][i] >= slider.value) {
            filtered_data.bachelor_or_higher_pct.push(original_data['bachelor_or_higher_pct'][i]);
            filtered_data.dem_margin.push(original_data['dem_margin'][i]);
            filtered_data.county.push(original_data['county'][i]);
            filtered_data.population_display.push(original_data['population_display'][i]);
            filtered_data.earnings_display.push(original_data['earnings_display'][i]);
        }
    }
    
    data_source.data = filtered_data;""")

slider.js_on_change("value", callback)

show(column(fig, slider, sizing_mode="stretch_width"))

wollen.org

Interactive plots with Bokeh

1. Prepare the data.

2. Plot the data

3. The output.

Leave a Reply Cancel reply