Dense Vectors
Dense vectors (embeddings) are numerical representations of text that capture semantic meaning.
How It Works
Text is transformed into a high-dimensional vector:
"The quick brown fox" → [0.12, -0.45, 0.78, ..., 0.23] // 1024 dimensionsSimilar texts produce similar vectors, enabling semantic search.
BGE-M3: Our Default Model
LH42 uses BGE-M3, which stands for:
- BAAI General Embedding
- Multi-lingual (100+ languages)
- Multi-functional (retrieval, classification, clustering)
- Multi-granularity (sentence to document)
Vector Dimensions
python
# BGE-M3 produces 1024-dimensional vectors
embedding = client.embed("Hello world")
print(len(embedding)) # 1024Similarity Metrics
We use cosine similarity to compare vectors:
similarity = (A · B) / (||A|| × ||B||)Range: -1 (opposite) to 1 (identical)
Batch Embedding
For efficiency, embed multiple texts at once:
python
texts = ["Document 1", "Document 2", "Document 3"]
embeddings = client.embed_batch(texts)Custom Models
Bring your own embedding model:
python
client = LakehouseClient(
api_key="...",
embedding_model="your-custom-model"
)Best Practices
- Chunk appropriately - 256-512 tokens per chunk
- Include context - Add titles and metadata to chunks
- Normalize vectors - Ensures consistent similarity scores
- Cache embeddings - Avoid re-computing for the same text