Under the hood

Architecture & internals

How all three layers work inside, the store, concurrency, branch selection, extraction, and the offline boundary.

Deep-dive companion to the README: how the core sync and the knowledge layer work under the hood.

Core sync internals#

Architecture#

The tool is built as a modular Python CLI application with the following components:

Configuration System#

The tool uses a hierarchical configuration system with the following precedence:

  1. Configuration Files (using Python's configparser): - Local config: .contextlake.ini in current directory - Global config: ~/.contextlake.ini in home directory - Custom config: Specified via --config CLI argument

  2. Default Values: Built-in defaults for all settings

  3. CLI Arguments: Override all other settings

Configuration loading flow, load_config() merges each layer over the one before it, so the most specific source wins:

Configuration precedence: built-in defaults, then the global ~/.contextlake.ini, then the local ./.contextlake.ini, then a --config custom file, then CLI flags, each layer overrides the one before it.

Core modules#

The sync core is plain Python (stdlib only). Its functions group by responsibility, each command in the Usage guide maps onto one group:

Responsibility What it does
Config Load and merge INI files (local / global / custom), expand ~, resolve cache paths.
Discover fetch accessible GitLab projects (via glab, filtered by group prefix, archived dropped) and cache them; enumerate local .git repos.
Clone Clone missing repos concurrently (ThreadPoolExecutor, max_workers), creating namespace parents, with per-op timeouts.
Update Fetch + fast-forward each repo's current branch concurrently, handling detached HEAD.
Branches Rank each repo's branches by git rev-list --count and switch to the most active (subject to branch safety).
Verify / status Compare local vs GitLab, detect nested .git, report missing / extra / synced.
CLI main() loads config, parses args (CLI overrides config), and dispatches to a command handler.

Data flow#

Sync data flow: a contextlake command is parsed and dispatched, then fetches the accessible projects via glab, caches them, scans the workspace for local .git repos, compares GitLab vs local, runs git operations (clone/fetch/pull/switch), and logs a per-repo report.

Concurrency Model#

The tool uses Python's ThreadPoolExecutor for concurrent operations:

Each worker operates independently with its own timeout:

Error Handling#

The tool implements comprehensive error handling:

  1. Timeout Handling: All subprocess calls have explicit timeouts
  2. Exception Catching: All functions catch and report exceptions
  3. Status Reporting: Operations return status tuples for tracking
  4. Graceful Degradation: Failed operations don't stop the entire process
  5. Detailed Logging: All errors are logged with context

Cache Management#

The tool uses two cache files:

  1. /tmp/gitlab_projects.json
  1. /tmp/gitlab_projects.txt

Cache is refreshed by running the fetch command or sync (which includes fetch).

Branch Selection Algorithm#

To identify the most active branch the tool:

  1. Fetches all branches with git for-each-ref (collecting each branch's last commit date)
  2. Calculates commit count per branch via git rev-list --count origin/branch
  3. Scores and ranks branches according to branch_strategy
  4. Switches to the top branch (and pulls) if it differs from the current one

The branch_strategy setting controls step 3:

Branch switching is skipped entirely for repositories checked out on a working branch (see Branch Safety).

Directory Structure Mapping#

The tool maintains GitLab's exact directory structure:

GitLab Path: your-gitlab-group/backend/services/api-gateway
Local Path:  backend/services/api-gateway

GitLab Path: your-gitlab-group/backend/pricing/quote-engine
Local Path:  backend/pricing/quote-engine

GitLab Path: your-gitlab-group/frontend/platform/ui-toolkit
Local Path:  frontend/platform/ui-toolkit

The your-gitlab-group/ prefix is stripped when creating local paths.

Performance Characteristics#

Typical Performance Metrics (based on a large workspace of several hundred repositories):

Performance Optimization Tips:

  1. Run fetch less frequently if repository list doesn't change often
  2. Use update for frequent syncs (faster than full sync)
  3. Run branches only when branch management is needed
  4. Adjust ThreadPoolExecutor worker count based on system resources

Security Considerations#

  1. Authentication: Uses glab authentication (tokens, SSH keys, etc.)
  2. HTTPS Cloning: Default cloning method uses HTTPS for better compatibility
  3. No Credential Storage: Does not store credentials; relies on glab auth
  4. Local Operations: All git operations are local; no external API calls beyond initial fetch
  5. File Permissions: Respects existing file permissions; creates directories with default umask

Extension Points#

The tool can be extended by:

  1. Adding New Commands: Add new function and update main() command dispatch
  2. Custom Branch Selection: Modify switch_repository_branch() algorithm
  3. Additional Verification: Add checks in verify_structure()
  4. Custom Output Formats: Modify logging functions
  5. Integration Hooks: Add pre/post operation hooks
  6. Configuration File: Add support for .contextlake.yml config

Dependencies#

Git Integration#

To avoid committing sensitive configuration to version control, add the configuration file to your .gitignore:

# Add to .gitignore
.contextlake.ini
~/.contextlake.ini

For team usage, consider including a sample configuration file:

# Add .contextlake.ini.example to git
cp .contextlake.ini .contextlake.ini.example
git add .contextlake.ini.example

# Update .gitignore
echo ".contextlake.ini" >> .gitignore

Team members can then:

cp .contextlake.ini.example .contextlake.ini
# Edit with their personal settings

Knowledge-layer architecture#

The optional contextlake.kb subsystem (the [kb] extra) layers a knowledge graph over the mirrored repos. Its pieces:

Storage & invariants#

Everything contextlake generates lives under one store directory (default ~/.contextlake/kb, store_dir in kb.toml), never scattered into your home, your cwd, or your repos. Two invariants make this safe by construction, each locked by a test:

Under the store: index.sqlite (graph + FTS), graph/ (per-repo JSON shards), history/<repo>/ (bitemporal snapshots), graphs/ (rendered visualizations), wiki/ (LLM pages), embeddings.sqlite (vectors). The one deliberate carve-out is steering files (AGENTS.md, .mcp.json, skills), which an IDE must find at the workspace root, so steer --out writes them to the target you point it at (never inside a synced repo). Full detail: storage.md.

Next steps