Document Sources Overview
LH42's connector framework enables you to sync documents from external sources into your unified knowledge base. This guide explains how connectors work and the architecture behind them.
How Connectors Work
The connector system follows a three-stage pipeline:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ OAuth/Auth │ ──▶ │ Sync Engine │ ──▶ │ Iceberg │
│ (credentials) │ │ (fetch content) │ │ (storage) │
└─────────────────┘ └──────────────────┘ └─────────────────┘- Authentication: Connect via OAuth 2.0 or API keys
- Sync Engine: Fetch and transform content from the source
- Storage: Index content into Apache Iceberg with embeddings
Available Connectors
| Connector | Auth Type | Sync Type | Status |
|---|---|---|---|
| Google Drive | OAuth 2.0 | Incremental | Available |
| Notion | OAuth 2.0 | Incremental | Available |
| Confluence | OAuth 2.0 | Incremental | Available |
| SharePoint | OAuth 2.0 | Incremental | Available |
| Dropbox | OAuth 2.0 | Incremental | Available |
| Box | OAuth 2.0 | Incremental | Available |
| Slack | OAuth 2.0 | Webhook | Available |
| GitHub | OAuth 2.0 | Webhook | Available |
| Jira | OAuth 2.0 | Incremental | Available |
Authentication Flow
All OAuth-based connectors follow the same secure flow:
1. User clicks "Connect" in Settings → Integrations
2. Redirect to provider's OAuth consent screen
3. User grants permissions
4. Callback exchanges code for tokens
5. Tokens encrypted and stored securely
6. Connection ready for syncOAuth Security
- State tokens: CSRF protection on all OAuth flows
- Encrypted storage: Fernet symmetric encryption for credentials
- Token refresh: Automatic refresh before expiration
- Scoped access: Minimal permissions requested
Sync Engine
The sync engine handles content retrieval efficiently:
Full Sync
First-time sync fetches all accessible content:
# Triggered on initial connection
client.connectors.sync(connector_id, mode="full")Incremental Sync
Subsequent syncs only fetch changed content:
# Uses cursor/timestamp to detect changes
client.connectors.sync(connector_id, mode="incremental")Webhook-Based Sync
Some connectors support real-time updates:
# Configure webhook for instant updates
client.connectors.configure_webhook(connector_id)Content Processing Pipeline
When documents are synced, they go through:
- Content Extraction: Parse PDFs, Office docs, HTML, etc.
- Chunking: Split into optimal segments (256-512 tokens)
- Embedding: Generate BGE-M3 dense + sparse vectors
- Storage: Write to Iceberg tables with full metadata
Multi-Tenancy
Connectors respect organization boundaries:
- Each organization has isolated Iceberg namespaces
- Credentials stored per-organization
- Sync state tracked per-connector per-organization
- Documents tagged with
source_connectorfor filtering
Sync Scheduling
Configure automatic sync intervals:
# Sync every 4 hours
client.connectors.update(connector_id, {
"sync_schedule": "0 */4 * * *" # Cron expression
})Monitoring Sync Status
Track sync progress and health:
status = client.connectors.get_sync_status(connector_id)
print(f"Last sync: {status.last_sync}")
print(f"Items synced: {status.items_count}")
print(f"Status: {status.status}") # pending, in_progress, completed, failed
print(f"Errors: {status.error_count}")Sync History
View historical sync operations:
history = client.connectors.get_sync_history(connector_id, limit=10)
for entry in history:
print(f"{entry.started_at}: {entry.status}")
print(f" Items: {entry.items_synced} synced, {entry.items_created} new")
if entry.errors:
print(f" Errors: {entry.errors}")Error Handling
The sync engine handles errors gracefully:
- Rate limiting: Exponential backoff with jitter
- Transient failures: Automatic retry (up to 3 attempts)
- Partial failures: Continue sync, log failed items
- Token expiry: Automatic refresh and retry
Best Practices
- Start with incremental: Full syncs are resource-intensive
- Set reasonable schedules: Every 4-6 hours for most use cases
- Monitor sync health: Set up alerts for failed syncs
- Use webhooks: When available, for near-real-time updates
- Filter content: Sync only relevant folders/spaces
Next Steps
- Google Drive - Connect Google Drive
- Notion - Connect Notion workspaces
- Confluence - Connect Atlassian Confluence
- SharePoint - Connect Microsoft SharePoint
- Dropbox - Connect Dropbox
- Box - Connect Box
- Custom Connectors - Build your own