When we designed the storage layer for Lakehouse42, we faced a critical architectural decision: use a proprietary database optimized for AI workloads, or build on an open table format. We chose Apache Iceberg, and that decision has become one of our strongest differentiators. This post explains why.
The Problem with Proprietary Knowledge Stores
The current generation of AI-native databases -- vector databases like Pinecone, Weaviate, and Qdrant -- are impressive pieces of engineering. They deliver excellent query performance for their specific use case. But they share a set of structural limitations that become painful at enterprise scale:
Vendor lock-in -- Your embeddings, metadata, and indexes are stored in a proprietary format. Migrating to a different vendor means re-ingesting and re-embedding your entire corpus. For an enterprise with millions of documents, this is a months-long project.
Single query pattern -- Vector databases are optimized for approximate nearest neighbor search. If you also need full-text search, you add Elasticsearch. If you need analytics, you add a data warehouse. Each additional query pattern means another system, another data copy, and another integration to maintain.
Limited governance -- Most vector databases offer basic API key authentication and namespace-level isolation. They do not provide row-level security, column-level encryption, audit logging, or data lineage tracking. For regulated industries, this is a non-starter.
Opaque storage -- You cannot inspect, audit, or independently query the data stored in a proprietary vector database. If the vendor goes down or changes pricing, your data is effectively held hostage.
What Apache Iceberg Provides
Apache Iceberg is an open table format originally developed at Netflix and now an Apache Software Foundation top-level project. It defines how tabular data is stored in files (Parquet, ORC, Avro) on object storage (S3, GCS, Azure Blob, R2) with rich metadata for efficient query planning.
Open and Portable
Iceberg tables are just Parquet files on object storage with a metadata layer. Any engine that speaks Iceberg can read and write your data: Apache Spark, Apache Flink, Trino, DuckDB, Snowflake, Databricks, BigQuery, and dozens more.
This means:
Schema Evolution
Real-world schemas change. You might add new metadata fields, change embedding dimensions (when you upgrade models), or add new entity types to your knowledge graph. Iceberg handles this gracefully:
This is particularly important for knowledge platforms where the schema evolves as you add new extraction capabilities and data types.
Time Travel and Versioning
Every write to an Iceberg table creates a new snapshot. Previous snapshots remain accessible, enabling:
Partition Evolution
As your data grows, you need to change how it is partitioned for query efficiency. Iceberg supports partition evolution -- changing the partitioning scheme without rewriting existing data. Old data keeps its original partitioning; new data uses the new scheme. Queries transparently span both.
For knowledge management, this means you can start with a simple partition-by-organization scheme and later add date-based partitioning or content-type partitioning as query patterns evolve.
Fine-Grained Access Control
With Apache Polaris (the REST catalog for Iceberg), you get storage-level access control:
This level of governance is table stakes for enterprise deployments and simply not available with most vector databases.
How Lakehouse42 Uses Iceberg
Our Iceberg schema has five core tables:
documents
The source-of-truth for all ingested content. Each row represents a single document with its metadata, processing status, and storage location. Partitioned by organization and ingestion date.
chunks
Document chunks with their embeddings stored as columns. The dense embedding is a 1024-element float array; the sparse embedding is stored as two parallel arrays (indices and values). This co-location means a single table scan can filter by metadata, score by BM25 (via the content column), and rank by vector similarity (via the embedding columns).
entities
Extracted entities with their types, properties, and source document references. Partitioned by organization and entity type. This table forms the node set of our knowledge graph.
relationships
Entity-to-entity relationships with typed edges and confidence scores. Combined with the entities table, this enables graph traversal queries without a separate graph database.
audit_log
Every access, modification, and processing event. Append-only, partitioned by date, with 90-day retention by default (configurable per organization). Essential for compliance.
Embeddings in Tables, Not in a Vector Database
One of our most consequential design decisions was storing embeddings as columns within the chunks table rather than in a separate vector database. This seems counterintuitive -- vector databases are optimized for nearest neighbor search, while Iceberg is optimized for analytical queries. But the benefits of co-location outweigh the performance trade-off:
Atomic updates -- When a document is re-processed, its chunks, embeddings, and metadata update atomically within a single Iceberg transaction. No consistency issues between separate systems.
Unified filtering -- A query like "find chunks similar to X, written by author Y, from department Z, in the last 30 days" is a single scan with predicate pushdown. No post-filtering of vector results against a separate metadata store.
Cost efficiency -- One storage system instead of two. No data duplication between a vector database and a metadata store.
Simpler operations -- One system to back up, monitor, scale, and debug. Operational simplicity compounds over time.
The performance gap is bridged by our tiered architecture: ClickHouse serves hot queries with sub-100ms latency using materialized vector indexes, while Iceberg handles the cold tail with DuckDB's in-process query engine at 1-10 second latency. For most workloads, 90%+ of queries hit the hot tier.
Migration from Proprietary Systems
If you are currently using a vector database and considering migration, the path is straightforward:
For enterprises with large corpora (millions of documents), we offer a parallel migration mode where both systems run simultaneously while the new system catches up.
The Broader Trend
Iceberg adoption is accelerating across the data industry. Snowflake, Databricks, Google BigQuery, and AWS all support Iceberg natively. The AI/ML ecosystem is following suit -- embedding stores, feature stores, and model registries are increasingly building on open table formats.
By choosing Iceberg now, you align with this industry direction. Your knowledge management infrastructure becomes part of your broader data platform rather than an isolated silo.
Conclusion
The choice of storage format is the most important architectural decision in a knowledge management platform. It determines your portability, governance capabilities, query flexibility, and long-term cost structure.
Apache Iceberg gives us open formats, schema evolution, time travel, fine-grained access control, and compatibility with the broader data ecosystem. These are not nice-to-haves -- they are requirements for any enterprise platform that expects to operate for years, not months.
At Lakehouse42, Iceberg is the foundation that everything else is built on. It is the reason we can offer true multi-tenancy, zero lock-in, and the flexibility to support query patterns that do not exist yet.
Interested in how Iceberg fits into your data architecture? Schedule a technical deep-dive with our engineering team. We will walk through the storage architecture, show you how to query your data independently of Lakehouse42, and discuss migration strategies for your specific environment.