UK Researchers#
UK research networks with HoloViews+Bokeh+Datashader#
Datashader makes it possible to plot very large datasets in a web browser, while Bokeh makes those plots interactive, and HoloViews provides a convenient interface for building these plots. Here, let’s use these three programs to visualize an example dataset of 600,000 collaborations between 15000 UK research institutions, previously laid out using a force-directed algorithm by Ian Calvert.
First, we’ll import the packages we are using and set up some defaults.
import pandas as pd
import holoviews as hv
from holoviews import opts
from colorcet import fire
from datashader.bundling import directly_connect_edges, hammer_bundle
from holoviews.operation.datashader import datashade, dynspread
from holoviews.operation import decimate
from dask.distributed import Client
client = Client()
hv.notebook_extension('bokeh','matplotlib')
decimate.max_samples=20000
dynspread.threshold=0.01
datashade.cmap=fire[40:]
sz = dict(width=150,height=150)
opts.defaults(
opts.RGB(width=400, height=400, xaxis=None, yaxis=None, show_grid=False, bgcolor="black"))
The files are stored in the efficient Parquet format:
r_nodes_df = pd.read_parquet("./data/graph/calvert_uk_research2017_nodes.snappy.parq")
r_edges_df = pd.read_parquet("./data/graph/calvert_uk_research2017_edges.snappy.parq")
r_nodes_df = r_nodes_df.set_index("id")
r_edges_df = r_edges_df.set_index("id")
r_nodes = hv.Points(r_nodes_df, label="Nodes")
r_edges = hv.Curve(r_edges_df, label="Edges")
len(r_nodes), len(r_edges)
(15001, 593915)
We can render each collaboration as a single-line direct connection, but the result is a dense tangle:
%%time
r_direct = hv.Curve(directly_connect_edges(r_nodes.data, r_edges.data),label="Direct")
CPU times: user 3.5 s, sys: 328 ms, total: 3.83 s
Wall time: 3.84 s
dynspread(datashade(r_nodes,cmap=["cyan"])) + datashade(r_direct)
Detailed substructure of this graph becomes visible after bundling edges using a variant of Hurter, Ersoy, & Telea (ECV-2012), which takes several minutes even using multiple cores with Dask:
%%time
r_bundled = hv.Curve(hammer_bundle(r_nodes.data, r_edges.data),label="Bundled")
/home/runner/work/examples/examples/uk_researchers/envs/default/lib/python3.9/site-packages/distributed/client.py:3125: UserWarning: Sending large graph of size 16.49 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
/home/runner/work/examples/examples/uk_researchers/envs/default/lib/python3.9/site-packages/distributed/worker.py:2995: UserWarning: Large object of size 1.13 MiB detected in task graph:
([array([[0.54141443, 0.58125185],
[0.42114 ... .008, 0.016, 2)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
warnings.warn(
CPU times: user 44.3 s, sys: 3.8 s, total: 48.1 s
Wall time: 4min 31s
dynspread(datashade(r_nodes,cmap=["cyan"])) + datashade(r_bundled)
Zooming into these plots reveals interesting patterns (if you are running a live Python server), but immediately one then wants to ask what the various groupings of nodes might represent. With a small number of nodes or a small number of categories one could color-code the dots (using datashader’s categorical color coding support), but here we just have thousands of indistinguishable dots. Instead, let’s use hover information so the viewer can at least see the identity of each node on inspection.
To do that, we’ll first need to pull in something useful to hover, so let’s load the names of each institution in the researcher list and merge that with our existing layout data:
node_names = pd.read_csv("./data/institutions/calvert_uk_research2017_nodes.csv",
index_col="node_id", usecols=["node_id","name"])
node_names = node_names.rename(columns={"name": "Institution"})
node_names
r_nodes_named = pd.merge(r_nodes.data, node_names, left_index=True, right_index=True)
r_nodes_named.tail()
x | y | Institution | |
---|---|---|---|
33517 | -8832.56100 | 1903.04940 | The Asset Factor Limited |
33519 | -9448.65500 | 1292.72130 | Ingenia Limited |
33522 | -1256.02720 | 2628.33400 | United Therapeutics |
33523 | 45.72761 | -365.93396 | Max Fordham LLP |
33525 | -8857.48100 | 1426.97060 | First Greater Western Limited |
We can now overlay a set of points on top of the datashaded edges, which will provide hover information for each node. Here, the entire set of 15000 nodes would be reasonably feasible to plot, but to show how to work with larger datasets we wrap the hv.Points()
call with decimate
so that only a finite subset of the points will be shown at any one time. If a node of interest is not visible in a particular zoom, then you can simply zoom in on that region; at some point the number of visible points will be below the specified decimate limit and the required point should be revealed.
edges = datashade(r_bundled, width=900, height=650)
points = decimate( hv.Points(r_nodes_named),max_samples=10000)
(edges * points).opts(
opts.Points(color="cyan", tools=["hover"], width=900, height=650),
opts.Overlay(width=900, height=650))
If you click around and hover, you should see interesting groups of nodes, and can then set up further interactive tools using HoloViews’ stream support to reveal aspects relevant to your research interests or questions.
As you can see, datashader lets you work with very large graph datasets, though there are a number of decisions to make by trial and error, you do have to be careful when doing computationally expensive operations like edge bundling, and interactive information will only be available for a limited subset of the data at any one time due to data-size limitations of current web browsers.