Vector Stores¶

Store and search text using semantic similarity.

Quick start¶

In-memory vector store:

Simple vector store

from lumen.ai.vector_store import NumpyVectorStore

vector_store = NumpyVectorStore()
await vector_store.add_file('documentation.pdf')

results = await vector_store.query('authentication setup', top_k=3)

Persistent storage with DuckDB:

Persistent vector store

from lumen.ai.vector_store import DuckDBVectorStore

vector_store = DuckDBVectorStore(uri='embeddings.db')
await vector_store.add_file('documentation.pdf')

See Embeddings for configuring how text is converted to vectors.

Store types¶

NumpyVectorStore¶

In-memory storage using Numpy arrays.

from lumen.ai.vector_store import NumpyVectorStore

vector_store = NumpyVectorStore()

✅ Fast, simple
⚠️ Data lost on restart
Best for: Development, testing, small datasets

DuckDBVectorStore¶

Persistent storage with HNSW indexing.

from lumen.ai.vector_store import DuckDBVectorStore

vector_store = DuckDBVectorStore(uri='embeddings.db')

✅ Persists on disk
✅ Scales to millions of documents
✅ Fast similarity search
Best for: Production, large datasets

Adding documents¶

Add files¶

Add any file

await vector_store.add_file('documentation.pdf')
await vector_store.add_file('guide.md')
await vector_store.add_file('https://example.com/page')  # URLs work too

Add directories¶

Add all files

await vector_store.add_directory(
    'docs/',
    pattern='*.md',                  # Only markdown
    exclude_patterns=['**/draft/*'], # Skip drafts
    max_concurrent=10                # Process 10 at once
)

Add text¶

Add text directly

await vector_store.add([
    {
        'text': 'Lumen is a data exploration framework.',
        'metadata': {'source': 'intro', 'category': 'overview'}
    }
])

Searching¶

Semantic search¶

Find similar text

results = await vector_store.query(
    'How do I authenticate users?',
    top_k=5,        # Top 5 results
    threshold=0.3   # Min similarity
)

for result in results:
    print(f"{result['similarity']:.2f}: {result['text']}")

Similarity is powered by embeddings - see Embeddings - Providers for quality options.

Filter by metadata¶

Filter search

results = await vector_store.query(
    'authentication',
    filters={'category': 'security', 'version': '2.0'}
)

Exact metadata lookup¶

Metadata-only search

results = vector_store.filter_by(
    filters={'author': 'admin'},
    limit=10
)

Upsert (prevent duplicates)¶

Upsert instead of add

# First call - adds new item
await vector_store.upsert([
    {'text': 'Hello world', 'metadata': {'source': 'greeting'}}
])

# Second call - skips (already exists)
await vector_store.upsert([
    {'text': 'Hello world', 'metadata': {'source': 'greeting'}}
])

Use upsert() when reprocessing documents that may not have changed.

Management¶

Delete, clear, count

# Delete by ID
vector_store.delete([1, 2, 3])

# Clear everything
vector_store.clear()

# Count documents
num_docs = len(vector_store)

Contextual augmentation (situate)¶

Add context descriptions to chunks:

Enable situate

vector_store = DuckDBVectorStore(
    situate=True,  # Generate context for each chunk
)

await vector_store.add_file('long_document.pdf')

Each chunk gets context like:

"This section discusses OAuth2 authentication. It follows the introduction and references token refresh mechanisms."

When to use:

✅ Long technical documents, books, research papers
✅ Documents with forward/backward references
❌ Short documents, FAQs, independent chunks

Requires an LLM to generate context - see LLM Providers for configuration.

Integration with Lumen AI¶

Document search¶

Enable document search

import lumen.ai as lmai

vector_store = DuckDBVectorStore(uri='docs.db')
await vector_store.add_directory('documentation/')

ui = lmai.ExplorerUI(
    data='penguins.csv',
    vector_store=vector_store
)
ui.servable()

Users can now ask questions about uploaded documents. See Tools - DocumentLookup for how this works.

Table discovery¶

Vector-powered table lookup

from lumen.ai.tools import IterativeTableLookup

tool = IterativeTableLookup(
    vector_store=vector_store,
    tables=['customers', 'orders', 'products']
)

See Tools - IterativeTableLookup for details.

Configuration¶

Custom embeddings¶

Use different embeddings

from lumen.ai.embeddings import HuggingFaceEmbeddings

vector_store = DuckDBVectorStore(
    embeddings=HuggingFaceEmbeddings(model="BAAI/bge-small-en-v1.5")
)

See Embeddings - Providers for all embedding options.

Read-only mode¶

Read-only access

vector_store = DuckDBVectorStore(
    uri='embeddings.db',
    read_only=True
)

Chunk size¶

Control chunking

vector_store = DuckDBVectorStore(
    chunk_size=512  # Smaller chunks
)

See Embeddings - Chunk size for chunking strategies.

Best practices¶

Choose the right store:

Development → NumpyVectorStore
Production → DuckDBVectorStore

Optimize threshold:

Exploratory → threshold=0.3
Precise → threshold=0.5
Very strict → threshold=0.7

Use upsert for idempotency:

Reprocessing → upsert()
New content → add()