Data Sources¶

Connect Lumen to files, databases, or data warehouses.

See also: Navigating the UI — Learn how to manage data sources from the interface.

Quick start¶

Load a file

lumen-ai serve penguins.csv

Works with CSV, Parquet, JSON, and URLs.

Multiple files

lumen-ai serve penguins.csv earthquakes.parquet

In Python

import lumen.ai as lmai

ui = lmai.ExplorerUI(data=['penguins.csv', 'earthquakes.parquet'])
ui.servable()

Supported sources¶

Source	Use for
Files	CSV, Parquet, JSON (local or URL)
DuckDB	Local SQL queries on files
Snowflake	Cloud data warehouse
BigQuery	Google's data warehouse
PostgreSQL	PostgreSQL via SQLAlchemy
MySQL	MySQL via SQLAlchemy
SQLite	SQLite via SQLAlchemy
Oracle	Oracle via SQLAlchemy
MSSQL	Microsoft SQL Server via SQLAlchemy
Intake	Data catalogs

Database connections¶

Snowflake¶

Snowflake with SSO

from lumen.sources.snowflake import SnowflakeSource
import lumen.ai as lmai

source = SnowflakeSource(
    account='your-account',
    database='your-database',
    authenticator='externalbrowser',  # SSO
)

ui = lmai.ExplorerUI(data=source)
ui.servable()

Authentication options:

authenticator='externalbrowser' - SSO (recommended)
authenticator='snowflake' - Username/password (needs password=)
authenticator='oauth' - OAuth token (needs token=)

Select specific tables:

source = SnowflakeSource(
    account='your-account',
    database='your-database',
    tables=['CUSTOMERS', 'ORDERS']
)

BigQuery¶

BigQuery connection

from lumen.sources.bigquery import BigQuerySource

source = BigQuerySource(
    project_id='your-project-id',
    tables=['dataset.table1', 'dataset.table2']
)

ui = lmai.ExplorerUI(data=source)
ui.servable()

Authentication:

gcloud auth application-default login

Or set service account:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

PostgreSQL¶

PostgreSQL via SQLAlchemy

from lumen.sources.sqlalchemy import SQLAlchemySource

source = SQLAlchemySource(
    url='postgresql://user:password@localhost:5432/database'
)

ui = lmai.ExplorerUI(data=source)
ui.servable()

Or use individual parameters:

source = SQLAlchemySource(
    drivername='postgresql+psycopg2',
    username='user',
    password='password',
    host='localhost',
    port=5432,
    database='mydb'
)

MySQL¶

MySQL connection

from lumen.sources.sqlalchemy import SQLAlchemySource

source = SQLAlchemySource(
    url='mysql+pymysql://user:password@localhost:3306/database'
)

SQLite¶

SQLite file

from lumen.sources.sqlalchemy import SQLAlchemySource

source = SQLAlchemySource(url='sqlite:///data.db')

Advanced file handling¶

DuckDB for SQL on files¶

Run SQL directly on CSV/Parquet files:

SQL on files

from lumen.sources.duckdb import DuckDBSource

source = DuckDBSource(
    tables={
        'penguins': 'penguins.csv',
        'quakes': "read_csv('https://earthquake.usgs.gov/data.csv')",
    }
)

Load remote files:

Remote files with DuckDB

source = DuckDBSource(
    tables=['https://datasets.holoviz.org/penguins/v1/penguins.csv'],
    initializers=[
        'INSTALL httpfs;',
        'LOAD httpfs;'
    ]  # (1)!
)

Required for HTTP/S3 access

Multiple sources¶

Mix sources

from lumen.sources.snowflake import SnowflakeSource
from lumen.sources.duckdb import DuckDBSource

snowflake = SnowflakeSource(account='...', database='...')
local = DuckDBSource(tables=['local.csv'])

ui = lmai.ExplorerUI(data=[snowflake, local])
ui.servable()

Custom table names¶

Rename tables

source = DuckDBSource(
    tables={
        'customers': 'customer_data.csv',  # (1)!
        'orders': 'order_history.parquet',
    }
)

Use 'customers' instead of 'customer_data.csv' in queries

Troubleshooting¶

"Table not found" - Table names are case-sensitive. Check exact names.

"Connection failed" - Verify credentials and network access.

"File not found" - Use absolute paths or URLs. Relative paths are relative to where you run the command.

Slow queries - If using DuckDB on files, it's fast. Slowness usually comes from the database or network, not Lumen.

Best practices¶

Start with files for development. Move to databases for production.

Use URLs for shared datasets that don't change often.

Limit tables when possible - faster planning and lower LLM costs.

Name tables clearly - Use meaningful names instead of generic file names.