Large Timeseries Data#

Effectively representing temporal dynamics in large datasets requires selecting appropriate visualization techniques that ensure responsiveness while providing both a macroscopic view of overall trends and a microscopic view of fine details. This guide will explore various methods, such as WebGL Rendering, LTTB Downsampling, Datashader Rasterizing, and Minimap Contextualizing, each suited for different aspects of large timeseries data visualization. We predominantly demonstrate the use of hvPlot syntax, leveraging HoloViews for more complex requirements. Although hvPlot supports multiple backends, including Matplotlib and Plotly, our focus will be on Bokeh due to its advanced capabilities in handling large timeseries data.

Getting the data#

Here we have a DataFrame with 1.2 million rows containing standardized data from 5 different sensors.

import pandas as pd

df = pd.read_parquet("https://datasets.holoviz.org/sensor/v1/data.parq")
df.sample(5)
sensor value time
823708 3 0.286705 2023-03-11 19:24:00
732368 3 0.269841 2023-01-06 12:00:00
841569 3 0.262078 2023-04-06 09:14:00
395037 1 -0.166718 2023-05-08 22:24:00
1041004 4 0.239751 2023-02-27 04:30:00
df0 = df[df.sensor=='0']

Let’s go ahead and plot this data using various approaches.

WebGL Rendering#

WebGL is a JavaScript API that allows rendering content in the browser using hardware acceleration from a Graphics Processing Unit (GPU). WebGL is standardized and available in all modern browsers.

Canvas Rendering - Prior Default#

Rendering Bokeh plots in hvPlot or HoloViews has evolved significantly. Prior to 2023, Bokeh’s custom HTML Canvas rendering was the default. This approach works well for datasets up to a few tens of thousands of points but struggles above 100K points, particularly in terms of zooming and panning speed. These days, if you want to utilize Bokeh’s Canvas rendering, use import holoviews as hv; hv.renderer("bokeh").webgl = False prior to creating your hvPlot or HoloViews object.

WebGL Rendering - Current Default#

Around mid-2023, the adoption of improved WebGL as the default for hvPlot and HoloViews allowed for smoother interactions with larger datasets by utilizing GPU-acceleration. It’s important to note that WebGL performance can vary based on your machine’s specifications. For example, some Apple Mac models may not exhibit a marked improvement in WebGL performance over Canvas due to GPU hardware configuration.

import holoviews as hv
import hvplot.pandas  # noqa

# Set notebook hvPlot/HoloViews default options
hv.opts.defaults(hv.opts.Curve(responsive=True))

df0.hvplot(x="time", y="value", autorange='y', title="WebGL", min_height=300)

Note: autorange='y' is demonstrated here for automatic y-axis scaling, a feature from HoloViews 1.17 and hvPlot 0.9.0. You can omit that option if you prefer to set the y scaling manually using the zoom tool.

Alone, both Canvas and WebGL rendering have a common limitation: they transfer the entire dataset from the server to the browser. This can be a significant bottleneck, especially for remote server setups or datasets larger than a million points. To address this, we’ll explore other techniques like LTTB Downsampling, which focus on delivering only the necessary data for the current view. These methods offer more scalable solutions for interacting with large timeseries data, as we’ll see in the following sections.

LTTB Downsampling#

The Challenge with Simple Downsampling#

A straightforward approach to handling large datasets might involve plotting every _n_th datapoint using a method like df.sample:

df0.hvplot(x="time", y="value", color= '#003366', label = "All the data") *\
df0.sample(500).hvplot(x="time", y="value", alpha=0.8, color='#FF6600', min_height=300,
                       label="Decimation", title="Decimation: Don't do this!")

However, this method, known as decimation or arbitrarily strided sampling, can lead to aliasing, where the resulting plot misrepresents the actual data by missing crucial peaks, troughs, or slopes. For instance, significant variations visible in the WebGL plot of the previous section might be entirely absent in a decimated plot, making this approach generally inadvisable for accurate data representation.

The LTTB Solution#

To address this, a more sophisticated method like the Largest Triangle Three Buckets (LTTB) algorithm can be employed. LTTB allows data points not contributing significantly to the visible shape to be dropped, reducing the amount of data to send to the browser but preserving the appearance (and particularly the envelope, i.e. highest and lowest values in a region).

In hvPlot, adding downsample=True will enable the LTTB algorithm, which will automatically choose an appropriate number of samples for the current plot:

df0.hvplot(x="time", y="value", color='#003366', label = "All the data") *\
df0.hvplot(x="time", y="value", color='#00B3B3', label="LTTB", title="LTTB",
           min_height=300, alpha=.8, downsample=True)

The LTTB plot will closely resemble the WebGL plot in appearance, but in general, it is rendered much more quickly (especially for local browsing of remote computation).

Note: As LTTB dynamically depends on Python and therefore won’t update as you zoom in on our website. If you are locally running this notebook with a live Python process, the plot will automatically update with additional detail as you zoom in.

With LTTB, it is now practical to include all of the different sensors in a single plot without slowdown:

df.hvplot(x="time", y="value", downsample=True, by='sensor', min_height=300, title="LTTB By Sensor")

This makes LTTB an ideal default method for exploring timeseries datasets, particularly when the dataset size is unknown or too large for standard WebGL rendering.

Enhanced Downsampling Options#

Starting in HoloViews version 1.19.0, integration with the tsdownsample library introduces enhanced downsampling functionality with the following methods, which will be accepted as inputs to downsample in hvPlot:

  • lttb: Implements the Largest Triangle Three Buckets (LTTB) algorithm, optimizing the selection of points to retain the visual shape of the data.

  • minmax: For each segment of the data, this method retains the minimum and maximum values, ensuring that peaks and troughs are preserved.

  • minmax-lttb: A hybrid approach that combines the minmax strategy with LTTB.

  • m4: A multi-step process that leverages the min, max, first, and last values for each time segment.

Datashader Rasterizing#

Hide code cell content
# Cell hidden on the website (hide-cell in tags)
from holoviews.operation.resample import ResampleOperation2D
ResampleOperation2D.width=1200
ResampleOperation2D.height=500

Principles of Datashader#

While WebGL and LTTB both send individual data points to the web browser, Datashader rasterizing offers a fundamentally different approach to visualizing large datasets. Datashader operates by generating a fixed-size 2D binned array tailored to your screen’s resolution during each zoom or pan event. In this array, each bin aggregates data points from its corresponding location, effectively creating a 2D histogram. So, instead of transmitting the entire dataset, only this optimized array is sent to the web browser, thereby displaying all relevant data at the current zoom level and facilitating the visualization of the largest datasets.

❗ A couple important details: ❗

  1. As with LTTB downsampling, Datashader rasterization dynamically depends on Python and, therefore, won’t update as you zoom in on our website. If you are locally running this notebook with a live Python process, the plot will automatically update with additional detail as you zoom in.

  2. Setting line_width to be greater than 0 activates anti-aliasing, smoothing the visual representation of lines that might otherwise look too pixelated.

Single Line Example#

Activating Datashader rasterization for a single large timeseries curve in hvPlot is as simple as setting rasterize=True!

Note: When plotting a single curve, the default behavior is to flatten the count in each pixel to better match the appearance of plotting a line without Datashader rasterization (see the relevant PR for details). If you want to restore these pixel count aggregations, just import Datashader (import datashader as ds) and activate ‘self-intersection’ in a count aggregator to hvPlot (aggregator=ds.count(self_intersect=True)).

df0.hvplot(x="time", y="value", rasterize=True, cnorm='eq_hist', padding=(0, 0.1),
           min_height=300, autorange='y', title="Datashader Rasterize", colorbar=False, line_width=2)

Rasterize Conditionally#

Alternatively, it’s possible to activate rasterize conditionally with resample_when.

When the number of individual data points present in the current zoom range is below the provided threshold, the raw plot is displayed; otherwise the rasterize, datashade, or downsample operation is applied.

df0.hvplot(x="time", y="value", rasterize=True, resample_when=1000, cnorm='eq_hist', padding=(0, 0.1),
           min_height=300, autorange='y', title="Datashader Rasterize", colorbar=False, line_width=2)

Multiple Categories Example#

For data with a line for each of several “categories” (sensors, in this case), Datashader can assign a different color to each of the sensor categories. The resulting image then blends these colors where data overlaps, providing visual cues for areas with high category intersection. This is particularly useful for datasets with multiple data series:

df.hvplot(x="time", y="value", rasterize=True, hover=True, padding=(0, 0.1), min_height=300,
          by='sensor', title="Datashader Rasterize Categories", line_width=2, colorbar=False, cmap='glasbey')

When you’re zoomed out, Datashader’s effectiveness is apparent. The image it creates reveals the overall data distribution and patterns, with color and intensity showing areas of higher data concentration - where lines cross through the same pixel. Datashader rendering can therefore provide a good overview of the full shape of a long timeseries, helping you understand how the signal varies even when the variations involved are smaller than the pixels on the screen.

Multiple Lines Per Category Example#

Plotting hundreds or thousands of overlapping timeseries snippets relative to a set of events is important in domains like finance, sensor monitoring, and neuroscience. In neuroscience, for example, this approach is used to reveal distinct patterns across action potential waveforms from different neurons. Let’s load a dataset of neural waveforms:

waves = pd.read_parquet("https://datasets.holoviz.org/waveform/v1/waveforms.parq")
print(len(waves))
waves.head(2)
509949
Time Amplitude Waveform Neuron
0 0.25 -6.363322 0 15
1 0.26 -6.921164 0 15

This dataset contains numerous neural waveform snippets. To grasp its structure, we examine the length of each waveform and count of waveforms per neuron:

first_waveform = waves[(waves['Neuron'] == waves['Neuron'].unique()[0]) & (waves['Waveform'] == 0)]
print(f'Number of samples per waveform: {len(first_waveform)}')
waves.groupby('Neuron')['Waveform'].nunique().reset_index().rename(columns={'Waveform': '# Waveforms'})
Number of samples per waveform: 187
Neuron # Waveforms
0 15 492
1 17 1092
2 8 1143

With a substantial number of waveforms and multiple categories (neurons), the density of data can make it difficult to accurately visualize patterns in the data. We can utilize hvPlot and Datashader, but there is currently one caveat: each waveform must be distinctly separated in the dataframe with a row containing NaN to effectively separate one waveform from another and still color by neuron with Datashader. This ensures each waveform is treated as an individual entity, avoiding misleading connections between the end of one waveform and the start of the next. Below, we can see one of these NaN rows at the end of the first waveform.

first_waveform.tail(3)
Time Amplitude Waveform Neuron
184 2.09 0.334572 0 15
185 2.10 -0.038350 0 15
186 NaN NaN 0 15

Note: Work is planned to avoid having to prepare your dataset with NaN-separators. Stay tuned!

With the NaN-separators already in place, all we need to do is specify that hvPlot should color by neuron and apply datashader rasterization:

waves.hvplot.line('Time', 'Amplitude', by='Neuron', hover=True, datashade=True,
                  xlabel='Time (ms)', ylabel='Amplitude (µV)', min_height=300,
                  title="Datashade Multiple Lines Per Category", line_width=1)

Datashader’s approach, while comprehensive for large timeseries data, focuses on the entire dataset’s view at a specific resolution. To explore data across different timescales, particularly when dealing with years of data but focusing on shorter intervals like a day or an hour, the next “minimap” approach offers an effective solution.

Minimap Contextualizing#

Minimap Overview#

Minimap introduces a way to visualize and navigate through extensive time ranges in your dataset. It allows you to maintain awareness of the larger context while focusing on a specific, smaller time range. This technique is particularly useful when dealing with timeseries data that span long durations but require detailed study of shorter intervals.

Implementing Minimap#

To create a minimap, we use the HoloViews RangeToolLink, which links a main plot to a smaller overview plot. The smaller minimap plot provides a fixed, broad view of the data, and the main plot can be used for detailed examination. Note, we also make use of Datashader rasterization on the minimap and LTTB downsampling on the main plot to limit the data sent to the browser.

from holoviews.plotting.links import RangeToolLink

plot = df0.hvplot(x="time", y="value", rasterize=True, color='darkblue', line_width=2,
                  min_height=300, colorbar=False, ylim=(-9, 3), # optional: set initial y-range
                  xlim=(pd.Timestamp("2023-03-10"), pd.Timestamp("2023-04-10")), # optional: set initial x-range
                  ).opts(
    backend_opts={
        "x_range.bounds": (df0.time.min(), df0.time.max()), # optional: limit max viewable x-extent to data
        "y_range.bounds": (df0.value.min()-1, df0.value.max()+1), # optional: limit max viewable y-extent to data
    }
)

minimap = df0.hvplot(x="time", y="value", height=150, padding=(0, 0.1), rasterize=True,
                     color='darkblue', colorbar=False, line_width=2).opts(toolbar='disable')

link = RangeToolLink(minimap, plot, axes=["x", "y"])

(plot + minimap).opts(shared_axes=False).cols(1)

In this setup, you can interact with the minimap by dragging the grey selection box. The main plot above will update to reflect the selected range, allowing you to explore extensive datasets while focusing on specific segments.

Here, we also demonstrate the use of backend_opts to configure properties of the Bokeh plotting library that are not yet exposed as HoloViews/hvPlot options. By setting hard outer limits on the plot’s panning/zooming, we ensure that the view remains within the data’s range, enhancing the user experience.

Future Improvements#

As we look to the future, our roadmap includes several exciting enhancements. A significant focus is to enrich Datashader inspections by incorporating rich hover tooltips for Datashader images. This addition will greatly enhance the data exploration experience, allowing users to access detailed information more intuitively.

Additionally, we are working towards a more streamlined process for plotting multiple overlapping lines. Our goal is to evolve the current approach, eliminating the need for inserting NaN rows as separators in the data structure. This improvement will simplify data preparation, making the visualization of complex timeseries more accessible and user-friendly.

This web page was generated from a Jupyter notebook and not all interactivity will work on this website.