Loading data with sources¶
Sources load data into your dashboard from files, databases, or APIs.
Source fundamentals¶
Every Lumen dashboard needs at least one source. Sources define where data comes from and how to access it.
Basic source syntax¶
sources:
source_name: # Choose any name
type: source_type # file, duckdb, intake, rest, etc.
# Additional parameters depend on type
Available source types¶
| Type | Purpose | Best for |
|---|---|---|
file |
CSV, Excel, Parquet, JSON files | Local or remote file data |
duckdb |
DuckDB SQL queries | SQL-based data access |
intake |
Intake catalog entries | Data catalogs |
rest |
REST API endpoints | API data |
live |
Live website status checks | Website monitoring |
This guide focuses on file sources (most common). See Lumen's source reference for other types.
File sources¶
FileSource loads data from files in various formats:
- CSV - Comma-separated values
- Excel - XLSX and XLS files
- Parquet - Columnar storage format
- JSON - JavaScript Object Notation
Local files¶
Load files from your local filesystem:
File paths can be relative (to your YAML file) or absolute.
Remote files¶
Load files from URLs:
Remote files load over HTTP/HTTPS. Lumen downloads them when needed.
Multiple tables¶
Sources can provide multiple tables:
sources:
sales_data:
type: file
tables:
customers: data/customers.csv
orders: data/orders.csv
products: data/products.csv
Reference each table independently in pipelines or layouts:
pipelines:
customer_pipeline:
source: sales_data
table: customers # Uses the customers table
order_pipeline:
source: sales_data
table: orders # Uses the orders table
File format options¶
Pass format-specific options using the kwargs parameter:
These kwargs pass directly to pandas read functions (read_csv, read_excel, etc.).
Caching data¶
Caching saves remote data locally to speed up dashboard loading. This is especially useful for large files.
Basic caching¶
Enable caching with the cache_dir parameter:
First load timing
Initial load may take several minutes for large files. Subsequent loads are much faster using the cached version.
How caching works¶
- First load: Lumen downloads the file and saves it to
cache_diras Parquet - Subsequent loads: Lumen reads from the local cache instead of downloading again
- Cache structure: Lumen creates the cache directory if it doesn't exist
Cache files are stored as:
- {cache_dir}/{table_name}.parq - Parquet data file
- {cache_dir}/{table_name}.json - Schema metadata
Cache strategies¶
Sources support different caching strategies via cache_per_query:
| Strategy | Behavior | Best for |
|---|---|---|
cache_per_query: false |
Cache entire table | Small to medium datasets |
cache_per_query: true |
Cache each filtered query separately | Large datasets with many filters |
Example with per-query caching:
sources:
large_db:
type: duckdb
uri: large_database.db
cache_dir: cache
cache_per_query: true # Cache filtered results separately
This is useful when users filter data in many different ways—each unique filter combination caches separately.
Pre-caching¶
Pre-populate caches with expected queries to eliminate initial loading delays.
Define pre-cache configurations in two forms:
Form 1: Cross-product of values
sources:
weather_data:
type: file
cache_dir: cache
cache_per_query: true
tables:
weather: weather.csv
pre_cache:
filters:
region: [north, south, east, west]
year: [2020, 2021, 2022, 2023]
This creates 16 cache entries (4 regions × 4 years).
Form 2: Explicit combinations
sources:
weather_data:
type: file
cache_dir: cache
cache_per_query: true
tables:
weather: weather.csv
pre_cache:
- filters:
region: north
year: 2023
- filters:
region: south
year: 2023
This creates only 2 specific cache entries.
To populate caches, initialize a pipeline and trigger loading:
from lumen.pipeline import Pipeline
pipeline = Pipeline.from_spec({
"source": {
"type": "file",
"cache_dir": "cache",
"cache_per_query": True,
"tables": {"weather": "weather.csv"},
"pre_cache": {
"filters": {
"region": ["north", "south"],
"year": [2022, 2023]
}
}
}
})
# Trigger cache population
pipeline.populate_cache()
Source parameters reference¶
Common parameters¶
These parameters work with all source types:
| Parameter | Type | Purpose |
|---|---|---|
type |
string | Source type (required) |
tables |
dict | Mapping of table names to file paths/URLs |
cache_dir |
string | Directory for caching data |
cache_per_query |
boolean | Cache strategy (table vs query) |
kwargs |
dict | Format-specific options |
FileSource-specific parameters¶
| Parameter | Type | Purpose | Default |
|---|---|---|---|
tables |
dict | Table name → file path mapping | Required |
root |
string | Root directory for relative paths | Current directory |
kwargs |
dict | Pandas read function parameters | {} |
Common patterns¶
Simple local file¶
Multiple remote files with caching¶
sources:
datasets:
type: file
cache_dir: .cache
tables:
penguins: https://datasets.holoviz.org/penguins/v1/penguins.csv
iris: https://datasets.holoviz.org/iris/v1/iris.csv
stocks: https://datasets.holoviz.org/stocks/v1/stocks.csv
CSV with custom parsing¶
sources:
custom_csv:
type: file
tables:
data: data.csv
kwargs:
sep: "|" # Pipe-separated
parse_dates: [date] # Parse date column
dtype:
id: str # Force ID as string
value: float
Large file with aggressive caching¶
sources:
big_data:
type: file
cache_dir: cache
cache_per_query: true
tables:
large: https://example.com/huge_dataset.parq
kwargs:
engine: fastparquet
pre_cache:
filters:
category: [A, B, C]
year: [2020, 2021, 2022]
Next steps¶
Now that you can load data:
- Pipelines guide - Filter and transform your data
- Views guide - Visualize your data
- Variables guide - Make sources dynamic with variables