Large Timeseries Data#
Effectively representing temporal dynamics in large datasets requires selecting appropriate visualization techniques that ensure responsiveness while providing both a macroscopic view of overall trends and a microscopic view of fine details. This guide will explore various methods, such as WebGL Rendering, LTTB Downsampling, Datashader Rasterizing, and Minimap Contextualizing, each suited for different aspects of large timeseries data visualization. We predominantly demonstrate the use of hvPlot syntax, leveraging HoloViews for more complex requirements. Although hvPlot supports multiple backends, including Matplotlib and Plotly, our focus will be on Bokeh due to its advanced capabilities in handling large timeseries data.
Getting the data#
Here we have a DataFrame with 1.2 million rows containing standardized data from 5 different sensors.
import pandas as pd
df = pd.read_parquet("https://datasets.holoviz.org/sensor/v1/data.parq")
df.sample(5)
sensor | value | time | |
---|---|---|---|
942209 | 3 | 0.134363 | 2023-06-16 18:14:00 |
309703 | 1 | -1.889788 | 2023-02-25 01:38:00 |
94043 | 0 | 0.369041 | 2023-03-09 06:46:00 |
958303 | 3 | 0.145548 | 2023-06-27 22:28:00 |
726204 | 3 | 0.225511 | 2023-01-02 05:16:00 |
df0 = df[df.sensor=='0']
Let’s go ahead and plot this data using various approaches.
WebGL Rendering#
WebGL is a JavaScript API that allows rendering content in the browser using hardware acceleration from a Graphics Processing Unit (GPU). WebGL is standardized and available in all modern browsers.
Canvas Rendering - Prior Default#
Rendering Bokeh plots in hvPlot or HoloViews has evolved significantly. Prior to 2023, Bokeh’s custom HTML Canvas rendering was the default. This approach works well for datasets up to a few tens of thousands of points but struggles above 100K points, particularly in terms of zooming and panning speed. These days, if you want to utilize Bokeh’s Canvas rendering, use import holoviews as hv; hv.renderer("bokeh").webgl = False
prior to creating your hvPlot or HoloViews object.
WebGL Rendering - Current Default#
Around mid-2023, the adoption of improved WebGL as the default for hvPlot and HoloViews allowed for smoother interactions with larger datasets by utilizing GPU-acceleration. It’s important to note that WebGL performance can vary based on your machine’s specifications. For example, some Apple Mac models may not exhibit a marked improvement in WebGL performance over Canvas due to GPU hardware configuration.
import holoviews as hv
import hvplot.pandas # noqa
# Set notebook hvPlot/HoloViews default options
hv.opts.defaults(hv.opts.Curve(responsive=True))
df0.hvplot(x="time", y="value", autorange='y', title="WebGL", min_height=300)
Note: autorange='y'
is demonstrated here for automatic y-axis scaling, a feature from HoloViews 1.17 and hvPlot 0.9.0. You can omit that option if you prefer to set the y scaling manually using the zoom tool.
Alone, both Canvas and WebGL rendering have a common limitation: they transfer the entire dataset from the server to the browser. This can be a significant bottleneck, especially for remote server setups or datasets larger than a million points. To address this, we’ll explore other techniques like LTTB Downsampling, which focus on delivering only the necessary data for the current view. These methods offer more scalable solutions for interacting with large timeseries data, as we’ll see in the following sections.
LTTB Downsampling#
The Challenge with Simple Downsampling#
A straightforward approach to handling large datasets might involve plotting every _n_th datapoint using a method like df.sample
:
df0.hvplot(x="time", y="value", color= '#003366', label = "All the data") *\
df0.sample(500).hvplot(x="time", y="value", alpha=0.8, color='#FF6600', min_height=300,
label="Decimation", title="Decimation: Don't do this!")
However, this method, known as decimation or arbitrarily strided sampling, can lead to aliasing, where the resulting plot misrepresents the actual data by missing crucial peaks, troughs, or slopes. For instance, significant variations visible in the WebGL plot of the previous section might be entirely absent in a decimated plot, making this approach generally inadvisable for accurate data representation.
The LTTB Solution#
To address this, a more sophisticated method like the Largest Triangle Three Buckets (LTTB) algorithm can be employed. LTTB allows data points not contributing significantly to the visible shape to be dropped, reducing the amount of data to send to the browser but preserving the appearance (and particularly the envelope, i.e. highest and lowest values in a region).
In hvPlot, adding downsample=True
will enable the LTTB algorithm, which will automatically choose an appropriate number of samples for the current plot:
df0.hvplot(x="time", y="value", color='#003366', label = "All the data") *\
df0.hvplot(x="time", y="value", color='#00B3B3', label="LTTB", title="LTTB",
min_height=300, alpha=.8, downsample=True)
The LTTB plot will closely resemble the WebGL plot in appearance, but in general, it is rendered much more quickly (especially for local browsing of remote computation).
Note: As LTTB dynamically depends on Python and therefore won’t update as you zoom in on our website. If you are locally running this notebook with a live Python process, the plot will automatically update with additional detail as you zoom in.
With LTTB, it is now practical to include all of the different sensors in a single plot without slowdown:
df.hvplot(x="time", y="value", downsample=True, by='sensor', min_height=300, title="LTTB By Sensor")
This makes LTTB an ideal default method for exploring timeseries datasets, particularly when the dataset size is unknown or too large for standard WebGL rendering.
Enhanced Downsampling Options#
Starting in HoloViews version 1.19.0, integration with the tsdownsample library introduces enhanced downsampling functionality with the following methods, which will be accepted as inputs to downsample
in hvPlot:
lttb: Implements the Largest Triangle Three Buckets (LTTB) algorithm, optimizing the selection of points to retain the visual shape of the data.
minmax: For each segment of the data, this method retains the minimum and maximum values, ensuring that peaks and troughs are preserved.
minmax-lttb: A hybrid approach that combines the minmax strategy with LTTB.
m4: A multi-step process that leverages the min, max, first, and last values for each time segment.
Datashader Rasterizing#
Show code cell content
# Cell hidden on the website (hide-cell in tags)
from holoviews.operation.resample import ResampleOperation2D
ResampleOperation2D.width=1200
ResampleOperation2D.height=500
Principles of Datashader#
While WebGL and LTTB both send individual data points to the web browser, Datashader rasterizing offers a fundamentally different approach to visualizing large datasets. Datashader operates by generating a fixed-size 2D binned array tailored to your screen’s resolution during each zoom or pan event. In this array, each bin aggregates data points from its corresponding location, effectively creating a 2D histogram. So, instead of transmitting the entire dataset, only this optimized array is sent to the web browser, thereby displaying all relevant data at the current zoom level and facilitating the visualization of the largest datasets.
❗ A couple important details: ❗
As with LTTB downsampling, Datashader rasterization dynamically depends on Python and, therefore, won’t update as you zoom in on our website. If you are locally running this notebook with a live Python process, the plot will automatically update with additional detail as you zoom in.
Setting
line_width
to be greater than0
activates anti-aliasing, smoothing the visual representation of lines that might otherwise look too pixelated.
Single Line Example#
Activating Datashader rasterization for a single large timeseries curve in hvPlot is as simple as setting rasterize=True
!
Note: When plotting a single curve, the default behavior is to flatten the count in each pixel to better match the appearance of plotting a line without Datashader rasterization (see the relevant PR for details). If you want to restore these pixel count aggregations, just import Datashader (import datashader as ds
) and activate ‘self-intersection’ in a count aggregator to hvPlot (aggregator=ds.count(self_intersect=True)
).
df0.hvplot(x="time", y="value", rasterize=True, cnorm='eq_hist', padding=(0, 0.1),
min_height=300, autorange='y', title="Datashader Rasterize", colorbar=False, line_width=2)
Rasterize Conditionally#
Alternatively, it’s possible to activate rasterize
conditionally with resample_when
.
When the number of individual data points present in the current zoom range is below the provided threshold, the raw plot is displayed; otherwise the rasterize
, datashade
, or downsample
operation is applied.
df0.hvplot(x="time", y="value", rasterize=True, resample_when=1000, cnorm='eq_hist', padding=(0, 0.1),
min_height=300, autorange='y', title="Datashader Rasterize", colorbar=False, line_width=2)
Multiple Categories Example#
For data with a line for each of several “categories” (sensors, in this case), Datashader can assign a different color to each of the sensor categories. The resulting image then blends these colors where data overlaps, providing visual cues for areas with high category intersection. This is particularly useful for datasets with multiple data series:
df.hvplot(x="time", y="value", rasterize=True, hover=True, padding=(0, 0.1), min_height=300,
by='sensor', title="Datashader Rasterize Categories", line_width=2, colorbar=False, cmap='glasbey')