Document Sources Overview

LH42's connector framework enables you to sync documents from external sources into your unified knowledge base. This guide explains how connectors work and the architecture behind them.

How Connectors Work

The connector system follows a three-stage pipeline:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  OAuth/Auth     │ ──▶ │  Sync Engine     │ ──▶ │  Iceberg        │
│  (credentials)  │     │  (fetch content) │     │  (storage)      │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Authentication: Connect via OAuth 2.0 or API keys
Sync Engine: Fetch and transform content from the source
Storage: Index content into Apache Iceberg with embeddings

Available Connectors

Connector	Auth Type	Sync Type	Status
Google Drive	OAuth 2.0	Incremental	Available
Notion	OAuth 2.0	Incremental	Available
Confluence	OAuth 2.0	Incremental	Available
SharePoint	OAuth 2.0	Incremental	Available
Dropbox	OAuth 2.0	Incremental	Available
Box	OAuth 2.0	Incremental	Available
Slack	OAuth 2.0	Webhook	Available
GitHub	OAuth 2.0	Webhook	Available
Jira	OAuth 2.0	Incremental	Available

Authentication Flow

All OAuth-based connectors follow the same secure flow:

1. User clicks "Connect" in Settings → Integrations
2. Redirect to provider's OAuth consent screen
3. User grants permissions
4. Callback exchanges code for tokens
5. Tokens encrypted and stored securely
6. Connection ready for sync

OAuth Security

State tokens: CSRF protection on all OAuth flows
Encrypted storage: Fernet symmetric encryption for credentials
Token refresh: Automatic refresh before expiration
Scoped access: Minimal permissions requested

Sync Engine

The sync engine handles content retrieval efficiently:

Full Sync

First-time sync fetches all accessible content:

python

# Triggered on initial connection
client.connectors.sync(connector_id, mode="full")

Incremental Sync

Subsequent syncs only fetch changed content:

python

# Uses cursor/timestamp to detect changes
client.connectors.sync(connector_id, mode="incremental")

Webhook-Based Sync

Some connectors support real-time updates:

python

# Configure webhook for instant updates
client.connectors.configure_webhook(connector_id)

Content Processing Pipeline

When documents are synced, they go through:

Content Extraction: Parse PDFs, Office docs, HTML, etc.
Chunking: Split into optimal segments (256-512 tokens)
Embedding: Generate BGE-M3 dense + sparse vectors
Storage: Write to Iceberg tables with full metadata

Multi-Tenancy

Connectors respect organization boundaries:

Each organization has isolated Iceberg namespaces
Credentials stored per-organization
Sync state tracked per-connector per-organization
Documents tagged with source_connector for filtering

Sync Scheduling

Configure automatic sync intervals:

python

# Sync every 4 hours
client.connectors.update(connector_id, {
    "sync_schedule": "0 */4 * * *"  # Cron expression
})

Monitoring Sync Status

Track sync progress and health:

python

status = client.connectors.get_sync_status(connector_id)

print(f"Last sync: {status.last_sync}")
print(f"Items synced: {status.items_count}")
print(f"Status: {status.status}")  # pending, in_progress, completed, failed
print(f"Errors: {status.error_count}")

Sync History

View historical sync operations:

python

history = client.connectors.get_sync_history(connector_id, limit=10)

for entry in history:
    print(f"{entry.started_at}: {entry.status}")
    print(f"  Items: {entry.items_synced} synced, {entry.items_created} new")
    if entry.errors:
        print(f"  Errors: {entry.errors}")

Error Handling

The sync engine handles errors gracefully:

Rate limiting: Exponential backoff with jitter
Transient failures: Automatic retry (up to 3 attempts)
Partial failures: Continue sync, log failed items
Token expiry: Automatic refresh and retry

Best Practices

Start with incremental: Full syncs are resource-intensive
Set reasonable schedules: Every 4-6 hours for most use cases
Monitor sync health: Set up alerts for failed syncs
Use webhooks: When available, for near-real-time updates
Filter content: Sync only relevant folders/spaces

Next Steps

Google Drive - Connect Google Drive
Notion - Connect Notion workspaces
Confluence - Connect Atlassian Confluence
SharePoint - Connect Microsoft SharePoint
Dropbox - Connect Dropbox
Box - Connect Box
Custom Connectors - Build your own