Non-geographical Analysis#
Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions. Here let’s start by plotting trip_distance
versus fare_amount
for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb.
import hvplot.dask # noqa
from holoviews import opts
opts.defaults(
opts.Scatter(width=800, height=500, color='blue'),
opts.RGB(width=800, height=500),
opts.Curve(width=800))
Load NYC Taxi data#
These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).
import dask.dataframe as dd
usecols = ['trip_distance','fare_amount','tip_amount','passenger_count']
%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()
CPU times: user 456 ms, sys: 123 ms, total: 579 ms
Wall time: 579 ms
df.tail()
trip_distance | fare_amount | tip_amount | passenger_count | |
---|---|---|---|---|
11842089 | 1.0 | 5.5 | 1.25 | 2 |
11842090 | 0.8 | 6.0 | 2.00 | 2 |
11842091 | 3.4 | 13.5 | 0.00 | 1 |
11842092 | 1.3 | 10.5 | 2.25 | 1 |
11842093 | 0.7 | 5.5 | 0.00 | 1 |
1000 points reveals the expected linear relationship#
samples = df.sample(frac=1e-4)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=5)
10,000 points show more detailed, systematic patterns in fares and times#
Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:
samples = df.sample(frac=1e-3)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=1)
Datashader reveals additional detail, especially when zooming in#
You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).
df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
threshold=1, max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))
Here we’re using a histogram-equalized color mapping function (cnorm='eq_hist'
) to reveal density differences across this space. If we used the default linear mapping, we can mainly see that there are a lot of values near the origin, but all the rest are colored the same minimum (defaulting to light blue) color:
df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, dynspread=True, threshold=1,
max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))
Fares are discretized to the nearest 50 cents, making patterns less visible, but there is both an upward trend in tips as fares increase (as expected), but also a large number of tips higher than the fare itself, which is surprising:
df.hvplot.scatter('tip_amount', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
threshold=1, max_px=1, xlabel='Tip, $', ylabel='Fare, $', xlim=(0,25), ylim=(0,20))
Interestingly, tips go down when the number of passengers is greater than 1:
df.hvplot.scatter('passenger_count', 'tip_amount', rasterize=True, cnorm='log', x_sampling=0.5,
y_sampling=0.5, xlabel='Passengers', ylabel='Tip, $', xlim=(-0.5, 6.5), ylim=(0, 60))
Here we’ve reduced the resolution along the x axis so that instead of getting isolated points for this inherently discrete data, you can see more-visible horizontal line segments.
The above plots use the Hvplot library, which builds Bokeh, Plotly, and Matplotlib plots from high-level specifications.