Apache Iceberg for Knowledge Management: Why Open Table Formats Matter

When we designed the storage layer for Lakehouse42, we faced a critical architectural decision: use a proprietary database optimized for AI workloads, or build on an open table format. We chose Apache Iceberg, and that decision has become one of our strongest differentiators. This post explains why.

The Problem with Proprietary Knowledge Stores

The current generation of AI-native databases -- vector databases like Pinecone, Weaviate, and Qdrant -- are impressive pieces of engineering. They deliver excellent query performance for their specific use case. But they share a set of structural limitations that become painful at enterprise scale:

Vendor lock-in -- Your embeddings, metadata, and indexes are stored in a proprietary format. Migrating to a different vendor means re-ingesting and re-embedding your entire corpus. For an enterprise with millions of documents, this is a months-long project.

Single query pattern -- Vector databases are optimized for approximate nearest neighbor search. If you also need full-text search, you add Elasticsearch. If you need analytics, you add a data warehouse. Each additional query pattern means another system, another data copy, and another integration to maintain.

Limited governance -- Most vector databases offer basic API key authentication and namespace-level isolation. They do not provide row-level security, column-level encryption, audit logging, or data lineage tracking. For regulated industries, this is a non-starter.

Opaque storage -- You cannot inspect, audit, or independently query the data stored in a proprietary vector database. If the vendor goes down or changes pricing, your data is effectively held hostage.

What Apache Iceberg Provides

Apache Iceberg is an open table format originally developed at Netflix and now an Apache Software Foundation top-level project. It defines how tabular data is stored in files (Parquet, ORC, Avro) on object storage (S3, GCS, Azure Blob, R2) with rich metadata for efficient query planning.

Open and Portable

Iceberg tables are just Parquet files on object storage with a metadata layer. Any engine that speaks Iceberg can read and write your data: Apache Spark, Apache Flink, Trino, DuckDB, Snowflake, Databricks, BigQuery, and dozens more.

This means:

No lock-in -- You can query your Lakehouse42 data with any Iceberg-compatible tool, independently of our platform.

Portability -- Moving to a different vendor (or bringing processing in-house) means pointing a new engine at the same files. No data migration required.

Interoperability -- Your data engineering team can use Spark or Trino for batch analytics over the same tables that Lakehouse42 uses for real-time search.

Schema Evolution

Real-world schemas change. You might add new metadata fields, change embedding dimensions (when you upgrade models), or add new entity types to your knowledge graph. Iceberg handles this gracefully:

Add columns without rewriting existing data

Rename columns without breaking downstream queries

Widen types (int to long, float to double) safely

Reorder columns without data movement

This is particularly important for knowledge platforms where the schema evolves as you add new extraction capabilities and data types.

Time Travel and Versioning

Every write to an Iceberg table creates a new snapshot. Previous snapshots remain accessible, enabling:

Time travel queries -- "Show me the state of this document's entities as of last Tuesday." Essential for debugging extraction issues and auditing changes.

Rollback -- If a bad extraction pipeline corrupts entity data, roll back to the previous snapshot instantly. No data loss.

Reproducibility -- Pin queries to a specific snapshot for reproducible analysis. Critical for regulated environments where you need to prove what data a decision was based on.

Partition Evolution

As your data grows, you need to change how it is partitioned for query efficiency. Iceberg supports partition evolution -- changing the partitioning scheme without rewriting existing data. Old data keeps its original partitioning; new data uses the new scheme. Queries transparently span both.

For knowledge management, this means you can start with a simple partition-by-organization scheme and later add date-based partitioning or content-type partitioning as query patterns evolve.

Fine-Grained Access Control

With Apache Polaris (the REST catalog for Iceberg), you get storage-level access control:

Namespace-level isolation -- Each organization gets its own Iceberg namespace. Queries physically cannot access data from other namespaces.

Table-level permissions -- Control which users and services can read or write specific tables.

Column-level security -- Mask or hide sensitive columns (e.g., PII in entity extractions) based on user role.

Row-level filtering -- Apply dynamic filters based on user attributes, ensuring users only see documents they are authorized to access.

This level of governance is table stakes for enterprise deployments and simply not available with most vector databases.

How Lakehouse42 Uses Iceberg

Our Iceberg schema has five core tables:

documents

The source-of-truth for all ingested content. Each row represents a single document with its metadata, processing status, and storage location. Partitioned by organization and ingestion date.

chunks

Document chunks with their embeddings stored as columns. The dense embedding is a 1024-element float array; the sparse embedding is stored as two parallel arrays (indices and values). This co-location means a single table scan can filter by metadata, score by BM25 (via the content column), and rank by vector similarity (via the embedding columns).

entities

Extracted entities with their types, properties, and source document references. Partitioned by organization and entity type. This table forms the node set of our knowledge graph.

relationships

Entity-to-entity relationships with typed edges and confidence scores. Combined with the entities table, this enables graph traversal queries without a separate graph database.

audit_log

Every access, modification, and processing event. Append-only, partitioned by date, with 90-day retention by default (configurable per organization). Essential for compliance.

Embeddings in Tables, Not in a Vector Database

One of our most consequential design decisions was storing embeddings as columns within the chunks table rather than in a separate vector database. This seems counterintuitive -- vector databases are optimized for nearest neighbor search, while Iceberg is optimized for analytical queries. But the benefits of co-location outweigh the performance trade-off:

Atomic updates -- When a document is re-processed, its chunks, embeddings, and metadata update atomically within a single Iceberg transaction. No consistency issues between separate systems.

Unified filtering -- A query like "find chunks similar to X, written by author Y, from department Z, in the last 30 days" is a single scan with predicate pushdown. No post-filtering of vector results against a separate metadata store.

Cost efficiency -- One storage system instead of two. No data duplication between a vector database and a metadata store.

Simpler operations -- One system to back up, monitor, scale, and debug. Operational simplicity compounds over time.

The performance gap is bridged by our tiered architecture: ClickHouse serves hot queries with sub-100ms latency using materialized vector indexes, while Iceberg handles the cold tail with DuckDB's in-process query engine at 1-10 second latency. For most workloads, 90%+ of queries hit the hot tier.

Migration from Proprietary Systems

If you are currently using a vector database and considering migration, the path is straightforward:

Export your documents and metadata from your current system.

Ingest into Lakehouse42 -- our ingestion pipeline handles re-chunking and re-embedding automatically. You bring the source documents; we handle the processing.

Validate retrieval quality -- run your existing evaluation queries against the new system and compare relevance scores.

Cut over -- point your application to Lakehouse42's API. The query interface is a standard REST API that returns ranked results with metadata.

For enterprises with large corpora (millions of documents), we offer a parallel migration mode where both systems run simultaneously while the new system catches up.

The Broader Trend

Iceberg adoption is accelerating across the data industry. Snowflake, Databricks, Google BigQuery, and AWS all support Iceberg natively. The AI/ML ecosystem is following suit -- embedding stores, feature stores, and model registries are increasingly building on open table formats.

By choosing Iceberg now, you align with this industry direction. Your knowledge management infrastructure becomes part of your broader data platform rather than an isolated silo.

Conclusion

The choice of storage format is the most important architectural decision in a knowledge management platform. It determines your portability, governance capabilities, query flexibility, and long-term cost structure.

Apache Iceberg gives us open formats, schema evolution, time travel, fine-grained access control, and compatibility with the broader data ecosystem. These are not nice-to-haves -- they are requirements for any enterprise platform that expects to operate for years, not months.

At Lakehouse42, Iceberg is the foundation that everything else is built on. It is the reason we can offer true multi-tenancy, zero lock-in, and the flexibility to support query patterns that do not exist yet.

Interested in how Iceberg fits into your data architecture? Schedule a technical deep-dive with our engineering team. We will walk through the storage architecture, show you how to query your data independently of Lakehouse42, and discuss migration strategies for your specific environment.

The Problem with Proprietary Knowledge Stores

What Apache Iceberg Provides

Open and Portable

This means:

No lock-in -- You can query your Lakehouse42 data with any Iceberg-compatible tool, independently of our platform.

Portability -- Moving to a different vendor (or bringing processing in-house) means pointing a new engine at the same files. No data migration required.

Interoperability -- Your data engineering team can use Spark or Trino for batch analytics over the same tables that Lakehouse42 uses for real-time search.

Schema Evolution

Real-world schemas change. You might add new metadata fields, change embedding dimensions (when you upgrade models), or add new entity types to your knowledge graph. Iceberg handles this gracefully:

Add columns without rewriting existing data

Rename columns without breaking downstream queries

Widen types (int to long, float to double) safely

Reorder columns without data movement

This is particularly important for knowledge platforms where the schema evolves as you add new extraction capabilities and data types.

Time Travel and Versioning

Every write to an Iceberg table creates a new snapshot. Previous snapshots remain accessible, enabling:

Time travel queries -- "Show me the state of this document's entities as of last Tuesday." Essential for debugging extraction issues and auditing changes.

Rollback -- If a bad extraction pipeline corrupts entity data, roll back to the previous snapshot instantly. No data loss.

Reproducibility -- Pin queries to a specific snapshot for reproducible analysis. Critical for regulated environments where you need to prove what data a decision was based on.

Partition Evolution

For knowledge management, this means you can start with a simple partition-by-organization scheme and later add date-based partitioning or content-type partitioning as query patterns evolve.

Fine-Grained Access Control

With Apache Polaris (the REST catalog for Iceberg), you get storage-level access control:

Namespace-level isolation -- Each organization gets its own Iceberg namespace. Queries physically cannot access data from other namespaces.

Table-level permissions -- Control which users and services can read or write specific tables.

Column-level security -- Mask or hide sensitive columns (e.g., PII in entity extractions) based on user role.

Row-level filtering -- Apply dynamic filters based on user attributes, ensuring users only see documents they are authorized to access.

This level of governance is table stakes for enterprise deployments and simply not available with most vector databases.

How Lakehouse42 Uses Iceberg

Our Iceberg schema has five core tables:

documents

The source-of-truth for all ingested content. Each row represents a single document with its metadata, processing status, and storage location. Partitioned by organization and ingestion date.

chunks

entities

Extracted entities with their types, properties, and source document references. Partitioned by organization and entity type. This table forms the node set of our knowledge graph.

relationships

Entity-to-entity relationships with typed edges and confidence scores. Combined with the entities table, this enables graph traversal queries without a separate graph database.

audit_log

Every access, modification, and processing event. Append-only, partitioned by date, with 90-day retention by default (configurable per organization). Essential for compliance.

Embeddings in Tables, Not in a Vector Database

Atomic updates -- When a document is re-processed, its chunks, embeddings, and metadata update atomically within a single Iceberg transaction. No consistency issues between separate systems.

Cost efficiency -- One storage system instead of two. No data duplication between a vector database and a metadata store.

Simpler operations -- One system to back up, monitor, scale, and debug. Operational simplicity compounds over time.

Migration from Proprietary Systems

If you are currently using a vector database and considering migration, the path is straightforward:

Export your documents and metadata from your current system.

Ingest into Lakehouse42 -- our ingestion pipeline handles re-chunking and re-embedding automatically. You bring the source documents; we handle the processing.

Validate retrieval quality -- run your existing evaluation queries against the new system and compare relevance scores.

Cut over -- point your application to Lakehouse42's API. The query interface is a standard REST API that returns ranked results with metadata.

For enterprises with large corpora (millions of documents), we offer a parallel migration mode where both systems run simultaneously while the new system catches up.

The Broader Trend

By choosing Iceberg now, you align with this industry direction. Your knowledge management infrastructure becomes part of your broader data platform rather than an isolated silo.

Apache Iceberg for Knowledge Management: Why Open Table Formats Matter

The Problem with Proprietary Knowledge Stores

What Apache Iceberg Provides

Open and Portable

Schema Evolution

Time Travel and Versioning

Partition Evolution

Fine-Grained Access Control

How Lakehouse42 Uses Iceberg

documents

chunks

entities

relationships

audit_log

Embeddings in Tables, Not in a Vector Database

Migration from Proprietary Systems

The Broader Trend

Conclusion

Ready to transform your knowledge management?

Apache Iceberg for Knowledge Management: Why Open Table Formats Matter

The Problem with Proprietary Knowledge Stores

What Apache Iceberg Provides

Open and Portable

Schema Evolution

Time Travel and Versioning

Partition Evolution

Fine-Grained Access Control

How Lakehouse42 Uses Iceberg

documents

chunks

entities

relationships

audit_log

Embeddings in Tables, Not in a Vector Database

Migration from Proprietary Systems

The Broader Trend

Conclusion

Ready to transform your knowledge management?