Vector Stores¶
Store and search text using semantic similarity.
Quick start¶
In-memory vector store:
from lumen.ai.vector_store import NumpyVectorStore
vector_store = NumpyVectorStore()
await vector_store.add_file('documentation.pdf')
results = await vector_store.query('authentication setup', top_k=3)
Persistent storage with DuckDB:
from lumen.ai.vector_store import DuckDBVectorStore
vector_store = DuckDBVectorStore(uri='embeddings.db')
await vector_store.add_file('documentation.pdf')
See Embeddings for configuring how text is converted to vectors.
Store types¶
NumpyVectorStore¶
In-memory storage using Numpy arrays.
- ✅ Fast, simple
- ⚠️ Data lost on restart
- Best for: Development, testing, small datasets
DuckDBVectorStore¶
Persistent storage with HNSW indexing.
from lumen.ai.vector_store import DuckDBVectorStore
vector_store = DuckDBVectorStore(uri='embeddings.db')
- ✅ Persists on disk
- ✅ Scales to millions of documents
- ✅ Fast similarity search
- Best for: Production, large datasets
Adding documents¶
Add files¶
await vector_store.add_file('documentation.pdf')
await vector_store.add_file('guide.md')
await vector_store.add_file('https://example.com/page') # URLs work too
Add directories¶
await vector_store.add_directory(
'docs/',
pattern='*.md', # Only markdown
exclude_patterns=['**/draft/*'], # Skip drafts
max_concurrent=10 # Process 10 at once
)
Add text¶
await vector_store.add([
{
'text': 'Lumen is a data exploration framework.',
'metadata': {'source': 'intro', 'category': 'overview'}
}
])
Searching¶
Semantic search¶
results = await vector_store.query(
'How do I authenticate users?',
top_k=5, # Top 5 results
threshold=0.3 # Min similarity
)
for result in results:
print(f"{result['similarity']:.2f}: {result['text']}")
Similarity is powered by embeddings - see Embeddings - Providers for quality options.
Filter by metadata¶
results = await vector_store.query(
'authentication',
filters={'category': 'security', 'version': '2.0'}
)
Exact metadata lookup¶
Upsert (prevent duplicates)¶
# First call - adds new item
await vector_store.upsert([
{'text': 'Hello world', 'metadata': {'source': 'greeting'}}
])
# Second call - skips (already exists)
await vector_store.upsert([
{'text': 'Hello world', 'metadata': {'source': 'greeting'}}
])
Use upsert() when reprocessing documents that may not have changed.
Management¶
# Delete by ID
vector_store.delete([1, 2, 3])
# Clear everything
vector_store.clear()
# Count documents
num_docs = len(vector_store)
Contextual augmentation (situate)¶
Add context descriptions to chunks:
vector_store = DuckDBVectorStore(
situate=True, # Generate context for each chunk
)
await vector_store.add_file('long_document.pdf')
Each chunk gets context like:
"This section discusses OAuth2 authentication. It follows the introduction and references token refresh mechanisms."
When to use:
- ✅ Long technical documents, books, research papers
- ✅ Documents with forward/backward references
- ❌ Short documents, FAQs, independent chunks
Requires an LLM to generate context - see LLM Providers for configuration.
Integration with Lumen AI¶
Document search¶
import lumen.ai as lmai
vector_store = DuckDBVectorStore(uri='docs.db')
await vector_store.add_directory('documentation/')
ui = lmai.ExplorerUI(
data='penguins.csv',
vector_store=vector_store
)
ui.servable()
Users can now ask questions about uploaded documents. See Tools - DocumentLookup for how this works.
Table discovery¶
from lumen.ai.tools import IterativeTableLookup
tool = IterativeTableLookup(
vector_store=vector_store,
tables=['customers', 'orders', 'products']
)
See Tools - IterativeTableLookup for details.
Configuration¶
Custom embeddings¶
from lumen.ai.embeddings import HuggingFaceEmbeddings
vector_store = DuckDBVectorStore(
embeddings=HuggingFaceEmbeddings(model="BAAI/bge-small-en-v1.5")
)
See Embeddings - Providers for all embedding options.
Read-only mode¶
Chunk size¶
See Embeddings - Chunk size for chunking strategies.
Best practices¶
Choose the right store:
- Development →
NumpyVectorStore - Production →
DuckDBVectorStore
Optimize threshold:
- Exploratory →
threshold=0.3 - Precise →
threshold=0.5 - Very strict →
threshold=0.7
Use upsert for idempotency:
- Reprocessing →
upsert() - New content →
add()
See also¶
- Embeddings - Configure how text is converted to vectors
- Tools - Built-in tools that use vector stores
- Agents - Agents that leverage document search
- LLM Providers - Required for situate feature