Changelog
Release history for contextlake.
All notable changes to contextlake will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]#
[2.10.0] - 2026-06-28#
Added#
- Incremental
embed.embednow re-embeds only repos whose indexed HEAD has moved since they were last embedded (tracked per-repo in the vector store), so a scheduled embed over a large fleet stays cheap, likeindexalready is.--forcere-embeds everything; a partial--limitrun never updates the gate. .contextlakeignore, drop one at a repo's root to exclude your own paths from indexing (one glob per line;*.lockignores by name anywhere,vendor/prunes a directory). A small, dependency-free subset of gitignore syntax; ignored files are counted and reported, never silently dropped.
Changed#
- Colorful output now reaches
statusandfetch.statusprints a right-aligned, glyph-coded summary (✓synchronized,⚠missing/extra), andfetchstyles its header and final count, matching the existing coloured per-repo output ofclone/update/branches. Still plain andNO_COLOR-friendly when not a TTY.
[2.9.1] - 2026-06-26#
Changed#
- README overhaul (this also fixes the instruction shown on PyPI): corrected the primary install
to
pip install contextlake(the oldpip install .only works from a clone), led with the value prop, a real graph screenshot, the Pebble mascot, and a branded "How it works" architecture diagram, and tightened the prose. Images are committed PNG/JPG with absolute URLs so the README renders identically on GitHub and PyPI (no SVG-only assets). Removed em-dashes across the prose docs.
[2.9.0] - 2026-06-26#
Added#
- Graph readability overhaul, the dense-graph pain points are fixed. Three long-standing
complaints addressed in the shared visualizer (
graph --serve,--site, and every embedded graph): - Zoom floor, "fit" no longer shrinks a big graph into unreadable specks. A clamp keeps any fit at or above a readable zoom (≥0.45); below that it snaps to the floor and re-centres, so you always land somewhere scannable instead of scrolling in 5–10 times.
- Level-of-detail labels, dense graphs no longer pile their text into an illegible smear. Below a readable zoom only the higher-degree hubs keep their labels (degree-gated by zoom tier); hovering or selecting any node always reveals its label, and search/highlight are unaffected.
- Semantic cluster zoom (namespace overview), zoom into a region and the on-screen namespace clusters expand into their repos; zoom back out and they collapse. A hysteresis gap prevents flapping, and the zoom path never re-frames, so it can't feed back on itself.
- Minimap, a custom radar (bottom-right, no new dependency) showing every visible node; click or drag to recentre the main view. Tracks filters and cluster expand/collapse live.
- On-canvas legend key, the node legend now shows each kind's actual glyph (the same icon the node paints), plus a collapsible key for edge-confidence line styles and per-language repo lettermarks, so the iconography is self-explanatory. All still offline/self-contained.
Changed#
- Captured docstrings + signatures now feed the wiki and
get_repo_brief.repo_brief's top symbols carry theirdoc+signature, so the LLM-wiki is synthesized from real docstrings (not just symbol names) andget_repo_briefreturns them per symbol, closing the capture→consume loop for the doc/signature feature (richer, better-grounded wikis and repo anatomy). build_vector_storeandSqliteStore.searchno longer fall back silently. A sqlite-vec load failure now warns that search dropped to brute force; a searchOperationalErroris logged (DEBUG for an expected malformed-FTS query, WARNING for a real DB problem) instead of always returning[].- Deduplicated HTTP/util helpers (
_ollama_reachable,_post_json,_chunks), previously copied across the llm/ and embeddings/ providers and the connector, into one stdlib-onlykb/_util. No behaviour change.
Fixed#
- Safety gate now fails closed on an indeterminate git state.
has_uncommitted_changesand the branch/HEAD reads in the sync core swallowed errors and returned a permissive default, so a failed, timed-out, or non-repo git call read as "clean / safe to modify" or "no change", silently mis-driving the destructive update/stash/merge they guard. They now check return codes + add timeouts and treat any unknown state as unsafe;_rev_parseand_collect_branch_inforaise on a git failure instead of returning an empty string that misreads the update. bootstrapandembed/wiki/connectnow exit non-zero on failure.bootstrapignored every stage's result and always reported success; the three commands returned0even when every repo or source in a non-empty work set failed (embedder/LLM/connector unreachable → zero output, CI green on a broken knowledge layer).bootstrapnow propagates stage failures (and hard-aborts if the foundational index stage fails); the commands return non-zero on total failure.
Security#
.dockerignorenow excludes the gitignored local config/secret files (.gitlab_sync.ini,.contextlake.ini,.contextlake.kb.toml,.genericity-denylist) so a localdocker build .can't bake them into an image. The published image is unaffected (built from a clean checkout).
[2.8.0] - 2026-06-26#
Added#
- Definitions now capture their docstring + signature (on node
attrs:doc,signature), surfaced through the MCPNodeOut(get_node/find_definition/ neighbors etc. now returndoc+signature), so an agent gets a function's purpose and parameters in one call. This is also the additive groundwork for body-aware embeddings, thenode_text()change that would feed bodies to the embedder stays gated on the eval harness (quality measured, not assumed). Best-effort and multi-language: signatures across py/js/ts/c#, and docstrings from Python first-statement strings, JSDoc (/** */), and C# XML (///) leading doc-comments (plain comments are ignored).
[2.7.0] - 2026-06-26#
Added#
- MCP:
repo_event_flow(repo, direction, limit), repo→repo event flow (who publishes events that whom consumes), from the topic two-hop (publishes_event ⨝ consumes_event). Completes the cross-repo flow trio alongsiderepo_dependencies(package) andrepo_flow(HTTP); the SQL already existed (used by the overview) but had no dedicated tool. - MCP:
get_readme(repo), the repo's own README read straight from its local clone (offline). Ground truth (the maintainers' words), distinct from the advisory synthesizedget_wikiprose. - MCP:
get_repo_brief(repo), a repo's "anatomy" from its indexed graph: node/edge counts, kind + language breakdown, top symbols by connectivity, packages, and a file sample. - MCP:
list_repos(include_stats), the repo fleet with per-repo branch, indexed head, last-index time, and node count, the dashboard's repository list. - MCP:
get_repo_links(repo), a repo's cross-links to Jira / Confluence / Figma / GitLab (url, title, status), grouped by relation. Populated byconnect; served offline afterward. - MCP:
graph_health(), knowledge-graph health as data (stale repos + dangling edges, with a sample) for the dashboard's health panel;lint's logic is now a reusablelint_result().
[2.6.0] - 2026-06-26#
Security#
- Genericity guard hardened, the leak-detector no longer leaks. The org-token denylist used to be
hardcoded in the test file (shipping real org identifiers in the published package). It now lives
outside the repo, supplied via the
CONTEXTLAKE_GENERICITY_DENYLISTenv var or a git-ignored.genericity-denylistfile (CI uses a secret), so no real token is ever committed. The scan also now covers every git-tracked file (not a fixed list), and an always-on structural check rejects any non-allowlisted email address even when no denylist is configured. - Removed deployment-scale figures from docs. Genericized specific fleet counts (the example
statusoutput, the overview-feature notes) to illustrative values, so nothing in the published repo is tied to any particular deployment's repository count. - Test-locked the offline boundary (INV-2). A new test blocks all outbound sockets and asserts the
core commands (
index/query/graph/lint/embed) still run, whileconnectdegrades rather than fails, proving contextlake is safe in air-gapped/egress-restricted environments, with enrichment the single opt-in online step. Documented indocs/storage.md.
Added#
evalnow scores any retriever and reports a cost dimension. Retrievers are built by factories (make_fts_retriever/make_semantic_retriever/make_hybrid_retriever) that close over their deps, so semantic and hybrid are scorable, not just FTS (the old fixed call site couldn't pass a vector store + embedder). The harness now also reports estimated tokens per query and precision per 1k tokens, making "route to the cheapest sufficient source" measurable, andeval --retriever fts|semantic|hybridselects which to score. Ships a seed golden set atexamples/fixtures/golden-queries.json.
[2.5.1] - 2026-06-26#
Fixed#
- README logo now renders on PyPI. The header glyph used a repo-relative
src, which PyPI can't resolve (it doesn't host the repo files), so it showed as a broken image on the project page. Pointed it at the absoluteraw.githubusercontent.comURL (correctimage/svg+xmlcontent-type, verified through PyPI's ownreadme_renderer). Badges were already absolute.
Changed#
- Docs reconciled with the shipped MCP surface.
docs/knowledge-layer.mdnow lists the cross-repo tools (repo_dependencies,repo_flow,blast_radius,get_wiki) alongside the existing graph tools, and the README command table documentseval(the golden-query retrieval-quality harness).
[2.5.0] - 2026-06-26#
Added#
[kb-full]one-step install for local semantic search,pip install "contextlake[kb-full]"pulls the knowledge layer + the built-in CPU embedder (kb-local) + the sqlite-vec ANN backend (kb-vec) together, soindex → embed → semantic searchjust works with no Ollama and no API key.- Repo nodes show their primary language, the fleet's tech stack at a glance. In the overview,
each repo node now carries a lettermark (
PY,JS,TS,C#, …) for its dominant language (a single GROUP-BY over data the parser already records), so an architecture map reads its stack without clicking in. Trademark-free white-on-navy lettermarks, inlined offline; unknown languages keep the generic repo glyph. - Architectural edges are now labelled, flows read like a C4 diagram. Dependency / flow edges
(
depends_on,calls_http,exposes,flow,publishes,publishes_event,consumes_event) carry an autorotated label of the relation plus its context where meaningful (depends_on · requests,calls_http · /v1/orders, the event topic). Structural edges (calls/contains/imports) stay unlabelled so the hundreds of them don't bury the diagram in text. - Graph nodes now carry type glyphs, the first step toward architecture diagrams. Every node is
painted with a Lucide-style icon for its kind (file, class, function, package, repo, HTTP endpoint,
event topic, …) so a graph reads by type at a glance instead of by colour alone. Glyphs are inlined
as percent-encoded SVG
data:URIs (no CDN, no sprite fetch, the page stays a single offline file), and each glyph's stroke colour is chosen per node fill at build time (white on the darkreponode, dark on the lightmodulenode) so it never washes out. Flow nodes (endpoint/topic) joined the palette + legend. --sitenow renders the LLM-wiki as cross-linked pages. Each repo with a generated wiki gets awiki-<slug>.html(the index links it, the page links back to the graph), rendered by a tiny dependency-free Markdown→HTML converter (HTML-escaped, the wiki is untrusted LLM output), carrying the same fresh/stale badge asget_wiki. Stays fully offline, zero new deps.- MCP:
get_wiki(repo), serve the LLM-wiki to agents (with a staleness signal). The generated wiki was written to<store>/wiki/but read by nothing; now an agent can fetch a repo's wiki prose (sanitised Markdown), explicitly labelled advisory (verify against cited sources; never outranks EXTRACTED facts) and carryingstale, true when the wiki'shead_commitdiffers from the repo's current indexed head, so prose describing changed code is never cited as current. - MCP:
blast_radius(node_id, hops), "what could break if I change this". Bounded transitive reverse reach over incomingcalls+depends_onedges (configurable), breadth-first, capped byhopsandlimit. Each hit carries its hop distance, the relation, and confidence (EXTRACTED-first,truncatedwhen capped), an impact slice for agents, made correct by the AMBIGUOUS-edge change below so the hottest symbols aren't missed.
Changed#
embed's "disabled" message is now actionable. Instead of the dead-end "Embeddings are disabled", it names the exact next step, installcontextlake[kb-full](when the embedder is missing) and/or set[embeddings] enabled = true, and notes the one-time ~30 MB model download, so the post-bootstrap"Build semantic vectors" stage no longer silently goes nowhere.- Documented and test-locked the no-pollution invariant (INV-1).
docs/storage.mdnow states that every generated artifact lives under the store (~/.contextlake/kbby default) and never inside a synced repo working tree, andtests/kb/test_no_repo_pollution.pyenforces it by driving the generating commands over a temp two-repo mirror and asserting each repo tree is byte-identical. doctornow probes ANN (sqlite-vec) availability. When embeddings are enabled it reports whether the native sqlite-vec KNN index actually loads in this environment, or whether semantic search will fall back to brute-force cosine, so the silent fallback (a known offline/corporate-env failure mode) is visible before you embed, not after.- Wiki generation is now incremental (skip-if-unchanged).
contextlake wikiskips the (expensive) LLM call for any repo whose existing page was already generated from its current head commit, so a no-op fleet re-run drops from O(repos × LLM calls) to ~0.--forceregenerates regardless; the summary reports how many were skipped. - Ambiguous calls are no longer silently dropped. When a call name resolves to 2–6 candidate
definitions, indexing now emits an
AMBIGUOUScallsedge to each candidate (de-duplicated, self-calls excluded) instead of discarding the call, so the hottest symbols aren't lost and blast-radius isn't undercounted. Names matching more than the cap are too generic to be signal and are still skipped. AMBIGUOUS edges render dotted in the visualizer.
Added#
contextlake eval --golden FILE.json, a retrieval-quality harness. Score a labelledquery → expected-nodesset against the index and get precision@k / recall@k / MRR / hit-rate (aggregate + per-query), over any retriever (FTS today; semantic/hybrid pluggable). Makes retrieval changes (embed-bodies, reranking, a futureaskrouter) falsifiable instead of vibes. Stdlib-only; the golden set is plain JSON,matchby node id or name.- Event/messaging flow extraction (Kafka/MSK, SNS, EventBridge). Indexing now detects, per file,
the message topics a repo publishes to and consumes from (literal topics in Kafka
producer/
@KafkaListener/subscribe, EventBridgeDetailType, SNS), asINFERREDedges to a sharedtopicnode. A two-hop join (publishes_event ⨝ consumes_event) yields directionalpublisher --flow--> consumerrepo edges, the direction an event travels, shown in the fleet overview alongside HTTPflowand structuraldepends_on. High-precision (literal topics only); config-variable topics are an honest undercount, never a false link. Re-runindexto populate.
[2.4.0] - 2026-06-25#
Added#
- MCP: repo-level architecture tools
repo_dependencies/repo_flow. Surface the cross-repo wedge to AI agents:repo_dependencies(repo, direction)returns the package two-hop (dependent → publisher, weighted),repo_flow(repo, direction)returns the HTTP endpoint two-hop (caller → exposer, weighted), both INFERRED, weight-ranked, with "undercount, verify" guidance. Previously these edges fed only the visualizer. contextlake graph --site DIR, a cross-linked offline graph site. Emitsindex.html+overview.html+ onerepo-<slug>.htmlper repo with a parsed graph, sharing a singlecytoscape.min.js/app.css/app.js(referenced, not inlined, so the folder stays small). Overview repo nodes link to their repo page (and the inspector gains an "Open this repo's graph →" button); every page has an Index/Overview nav. Fully offline. Scope it with--repos PATTERN(comma-separated glob/substring) to build pages for only a subset of repos.contextlake graph --overview --servenow serves the whole site live, rendering each repo page on demand from the store instead of materialising the fleet up front, so online serving never inlines hundreds of MB. Shared assets are served once (browser-cached);/neighborskeeps click-to-expand inside a repo view.- HTTP/REST flow extraction (the first true cross-repo flow signal). Indexing now detects, per
file, the HTTP endpoints a repo exposes (ASP.NET / Express / FastAPI·Flask routes) and calls
(HttpClient / axios·fetch / requests·httpx), as
INFERREDedges to a sharedendpointnode keyed by a normalised path. A two-hop join (exposes ⨝ calls_http) yields directionalcaller --flow--> exposerrepo edges, which the fleet overview now renders alongside structuraldepends_on(distinct colour, aggregated per namespace). Path matching is deliberately conservative (host/query stripped, params →{}, trivially-generic paths dropped) so unrelated repos don't falsely link. Re-runindex/bootstrapto populate. Event/messaging flow (SNS/SQS/EventBridge/Kafka) is the next slice.
Changed#
- MCP: result budgeting on
get_neighbors/find_callers/find_dependents. They now take alimit(default 50), order EXTRACTED-first, and return{..., total, truncated}instead of an unbounded list, so a hub node can't silently blow up an agent's context, and a clipped result announces itself. - Generated graphs now default to a dedicated
<store>/graphs/directory instead of the current working directory,graphHTML output and--siteland next to the knowledge base, not wherever the command happened to run. Pass--output/--site DIRto override.
[2.3.0] - 2026-06-24#
Added#
- Two interlocking overview views, a
Namespacemindmap and aDependenciesgraph. The fleet overview now has a mode toggle over one graph. Namespace (default) collapses the whole repo fleet into its top-level GitLab namespaces (sized by repo count), with aggregated, weight-labelled namespace→namespace dependency edges; tapping a namespace expands its repos in place as a compact mindmap branch (the rest dims to spotlight it) and tapping again collapses, every repo stays placed and searchable. Dependencies lays the connected repos out as readable hub-and-spoke clusters. Both modes share selection, search, and the inspector. - Inspector lists a node's relationships, each neighbour clickable to navigate to it (in-view hop-to-hop). Tapping a node/edge reframes the canvas onto the selection so it stays legible.
Changed#
- Graph visualizer reworked into an enterprise app shell. The floating translucent cards are
replaced by a real layout, a top bar (brand, mode, search), a collapsible left sidebar (view
controls + Nodes/Relationships legends with live counts), the graph filling the centre, a slide-in
right inspector, and a status bar, on a CSS grid with a tokenised design system. Adds a dark
mode (Deepwater theme; re-skins the canvas, not just the chrome), icon-button controls, empty/
loading states, keyboard shortcuts (
/search,ffit,ttheme,Escclear), and focus-visible rings. Still one self-contained offline HTML, zero new dependencies. - Fleet overview now shows real cross-repo dependencies. Repointed from the raw cross-repo
importsjoin (≈4,800 import-star artifacts from fleet-widemodulenodes) to the package two-hop (publishes ⨝ depends_on), 217 trustworthy, manifest-deriveddepends_onedges, markedINFERRED(a deliberate, honest undercount). Repos are labelled by short name (the full path moves to the inspector + search) so nodes are distinguishable. - Graph-visualizer CSS/JS extracted into
static/app.css+static/app.js(inlined at emit time like the vendored cytoscape), so the source is lint/node --check-able. Output is still one self-contained offline HTML.
Fixed#
- Truncation is now visible in the UI. A bounded subgraph that was clipped used to read as complete; a persistent status-bar banner now says "showing N of M, truncated" (honest counts only).
- Overview readability. Isolated/no-dependency repos, typically the bulk of a large fleet, no longer scatter the connected map into an unreadable speck, they're hidden by default behind a toggle (and revealed by search), and the layout frames the meaningful core. Expanding a namespace no longer triggers a disorienting global re-layout (scoped, position-stable).
- Canvas now reflows/reframes correctly when the inspector or sidebar opens (was leaving the old
zoom/pan). Dark-mode faded opacity and
prefers-reduced-motiongating for JS animations.
[2.2.0] - 2026-06-23#
Added#
-
Post-sync repo audit (
contextlake audit, also auto-runs aftersync/bootstrap). Scans every local clone and reports which repos are effectively empty, empty (no commits / no files), readme-only (just a template README), or boilerplate (only meta files like LICENSE/.gitignore) , plus age/activity: each repo's creation date (GitLabcreated_at, captured during fetch; falls back to the first git commit) and last commit date (from the local clone). Prints an aggregate summary (counts, oldest/newest, how many stale >1y/>2y, repos with no commits) and writes a full per-repo report as JSON + CSV (--report PATH, default<cache_dir>/repo_audit.json). The scan is parallel, read-only, and works offline;--no-auditskips the automatic run. Zero new dependencies. -
contextlake graph, visualize the knowledge graph. Extracts a bounded subgraph (the full graph is far too large to draw) and renders it to an interactive, offline-first HTML page (vendored cytoscape.js, inlined, no network needed;--cdnfor a small online file), or todot/mermaid/json. Seed from a symbol (--node/--name+--kind/--search), a single repo (--repo), or the whole fleet (--overview= repos-as-nodes with aggregated cross-repo edges, the architecture map). Scoping knobs--hops/--max-nodes/--max-fanout/--relation/--directionkeep hub nodes from exploding (truncation is always logged). The HTML is a full mini-explorer: nodes coloured by kind and sized by degree; edge labels hidden until a node is selected; clickable edges with an inspector (relation, a confidence trust indicator, the sourcefile:lineprovenance with copy, context and weight), edges are coloured by relation, styled by confidence, and sized by weight, with a relationship legend that filters by relation; a node search box, a detail panel (kind / repo / qualified-name / file:line), a clickable legend that filters by kind, hover tooltips, a switchable layout (cose/concentric/breadthfirst/circle/grid, default via--layout), and a toolbar (fit / reset / save-PNG), all wrapped in the contextlake brand (inlined lake glyph, wordmark, palette, frosted material cards).--openlaunches the browser;--serveruns a local UI with click-to-expand. Adds zero required Python dependencies. -
Resilient project enumeration behind slow/corporate DNS (e.g. Zscaler). When
GITLAB_TOKEN(aread_apitoken) is set,fetch/sync/bootstrapenumerate a group's projects via contextlake's own GitLab REST client instead of theglabCLI. TheglabCLI imposes a short Go dial timeout that a multi-second corporate DNS lookup trips on every call; the native client uses the system resolver's more generous budget, so enumeration completes whereglabfails. Without a token it transparently falls back toglab(its own auth). Configurable viagitlab_token_env,gitlab_host, andnetwork_timeout; the per-page fetch now retries with backoff on transient errors. Additionally, childgitoperations get a widened per-process DNS budget (RES_OPTIONS=timeout:15 attempts:3, root-free, tunable viadns_timeout/dns_attempts, and skipped if you already setRES_OPTIONS) so slow lookups don't surface asi/o timeout.
[2.1.6] - 2026-06-23#
Fixed#
- Quadratic indexing slowdown at scale (the real fix for "indexing got slower the more repos I
had"). Each node was refreshed in the full-text index with a per-row
DELETE FROM node_fts WHERE node_id = ?; because the FTS5 table has no index onnode_id, every one of those scanned the entire, ever-growing global FTS table, so persisting a repo cost O(repo_nodes × total_store_nodes) and the 600th repo took minutes. Now done with one set-based delete + batchedexecutemanyinserts. Re-indexing a repo into a 23k-node store dropped from 6.5s to 0.11s (≈59×) and is now flat regardless of store size; the FTS contents are byte-for-byte identical.
Added#
- Parallel repository indexing.
contextlake index --workspace(andbootstrap) now parse repositories across worker processes (CPU-bound work), persisting to SQLite serially from the parent. Defaults tocpu_count - 1(capped at 8); tune with[kb] index_workers(set1to force serial). Uses thespawnstart method on every platform for identical behaviour on Linux, macOS and Windows, and falls back to serial automatically if a worker pool cannot start. With the quadratic fix above in place, a full warm re-index of a 33-repo subtree dropped from ~8.8s (serial) to ~3.1s (8 workers, ≈2.9×); the parse speedup grows with both repo count and core count.
Changed#
- Indexing skips generated/derived files and oversized blobs (configurable, logged). The code
graph no longer indexes machine-generated files (
*.designer.cs,*.min.js,AssemblyInfo.cs,@generated/<auto-generated>headers, …) or code files larger thanmax_file_bytes(5 MB default), derived noise that bloats the graph and slows legacy monorepos. Both are reported (no silent gaps) and tunable via[kb] skip_generated/[kb] max_file_bytes. The source the generated files derive from is still indexed, so there's no knowledge loss. On a real 3,230-file legacy repo this dropped ~26% of files / 4k generated nodes (22.5s → 16.6s).
[2.1.5] - 2026-06-23#
Added#
- Built-in, zero-config CPU models for the knowledge base, no Ollama and no API key.
The embeddings and wiki tiers now accept
provider = "auto"(the new default), which uses a reachable local Ollama, else an in-process built-in model, else skips. The built-in embedder ships two engines, model2vec (potion-base-8M, ~30MB, default;pip install "contextlake[kb-local]") and fastembed (ONNXbge-small;[kb-fastembed]), and the built-in wiki LLM runs a smallQwen2.5-0.5B-InstructGGUF viallama-cpp-python([llm-local]). Models auto-download once to~/.contextlake/modelson first use (honoringREQUESTS_CA_BUNDLE/SSL_CERT_FILEbehind a TLS proxy) and load lazily.doctorreports model presence. A new guard refuses to mix embedder models/dimensions in one vector store. - Container image on GitHub Container Registry (
ghcr.io/sayak-sarkar/contextlake), published by the release workflow. It bundles the[kb]+ built-in model extras and pre-downloaded models, sodocker run … contextlake bootstrapworks with zero config / offline.
[2.1.4] - 2026-06-22#
Changed#
bootstrap's "knowledge layer not installed" message is now actionable. It prints the exact Python interpreter in use and flags the common cause, running the bare./contextlake.py(system Python) while the[kb]extra was installed into a virtualenv, with the precise install command for that interpreter and the venv alternative (./.venv/bin/contextlake bootstrap).
[2.1.3] - 2026-06-22#
Changed#
- Sync is far more resilient to flaky networks and moved branches.
updateandbranchesnow retry transient proxy/network drops (e.g.unexpected eof,connection reset) with backoff instead of failing on the first hiccup. Pulls are fast-forward only: a branch that has diverged from origin is reported as a cleanDiverged …, skipped (manual reconcile)(the tool never merges or rebases, and git's multi-line "divergent branches" hint no longer leaks into the output), and a deleted upstream branch is reported asUpstream branch deletedinstead of a fatal error. Net effect: transient blips self-heal, and the remaining "errors" are real and few.
[2.1.2] - 2026-06-22#
Added#
- The release workflow now also publishes a GitHub Release on each
vX.Y.Ztag, with notes pulled from this changelog and the built sdist + wheel attached.
Changed#
- Adopt the SPDX
license = "MIT"form (PEP 639) and drop the deprecatedLicense ::classifier, silences the setuptools deprecation warnings emitted during the build. Building from source now needssetuptools >= 77.
[2.1.1] - 2026-06-22#
Added#
- Maintainer release runbook at
docs/releasing.md(versioning → tag → build → publish to PyPI, with first-token and TLS-proxy troubleshooting) and areleaseextra (pip install -e ".[release]") bundlingbuild+twine. - Automated PyPI publishing via
.github/workflows/release.yml: pushing avX.Y.Ztag verifies the tag matches the package version, runs lint + core tests, builds, and publishes using PyPI Trusted Publishing (OIDC), no stored API token.
[2.1.0] - 2026-06-22#
Added#
- Cleaner terminal output: the timestamp moves to the right edge. On an interactive
terminal each line now shows the message on the left with a dim
HH:MM:SSclock flushed to the right edge, re-flowed to the live terminal width and dropped automatically when a line is too long to fit (never wraps or misaligns). Alignment is ANSI- and wide-character aware, so it lines up uniformly across terminals. Piped/redirected output and the rotating log file keep the full[YYYY-MM-DD HH:MM:SS]prefix unchanged, so the audit trail is untouched.
Changed#
- Branch name alone no longer causes an
updateto be skipped. A repo with a clean working tree is now fetched and fast-forwarded on whatever branch it is checked out on, feature branches included. The only thing that blocks anupdateis a dirty working tree (uncommitted/unstaged/untracked changes), which is still skipped (or stashed with--auto-stash).protect_working_branchesnow applies only to thebranchescommand, where it keeps a repo from being switched off a non-safe branch. Previously a clean repo on any branch outsidesafe_brancheswas skipped outright.
[2.0.1] - 2026-06-22#
Changed#
- Clearer config-not-found warning. When
gitlab_groupis still the placeholder, the warning now lists the exact files searched (absolute paths, with[found]/[absent]) and notes that local.contextlake.iniis read from the current directory, so a config placed next to the example in the repo but run from elsewhere is no longer a silent miss.
[2.0.0] - 2026-06-22#
Changed#
- Renamed the project
gitlab-sync→contextlake. The tool grew from a GitLab mirror into a local context layer for AI tools, and the name now reflects that. This is a rename only, no behavior changes. - The command, Python package, and PyPI project are now
contextlake(contextlake <command>,python -m contextlake,python3 contextlake.py). - A deprecated
gitlab-synccommand alias is kept so existing installs and scripts keep working; it will be removed in a future major release. - Existing config keeps working. The former
~/.gitlab_sync.ini/.gitlab_sync.ini(and the[gitlab_sync]section) and the~/.gitlab-sync/knowledge store are still read; new installs use~/.contextlake.iniand~/.contextlake/. An already-built index at~/.gitlab-sync/kbis reused as-is, no re-index needed. - The MCP server is now named
contextlake-kb, andsteerwritescontextlakeinto the files it generates (.mcp.json,AGENTS.md, …).
Note#
- The GitHub repository and CI-badge URLs point at
.../contextlake; they resolve once the repository is renamed on GitHub (the old URL auto-redirects).
[1.18.1] - 2026-06-22#
Changed#
- Confirmed the mascot's name, Pebble the otter, in
BRANDING.mdand the mascot spec.
[1.18.0] - 2026-06-22#
Added#
- Brand identity,
contextlake. ABRANDING.mdguide establishes the project's name, voice, color palette (cool lake teals + a warm "spark" of fresh context), open-source typography, logo, and otter mascot. Hand-authored SVG assets live indocs/branding/(glyph.svg,wordmark.svg) alongside a mascot spec (mascot.md). The name says what the tool does, a local lake of real context for your AI, and stays source-agnostic so the brand survives growth beyond GitLab. This is the brand kit only; the package/command rename is a separate, later step.
[1.17.1] - 2026-06-22#
Changed#
- The genericity guard (no-org-data check) now scans the whole published surface
,
docs/,examples/,.github/, and every top-level doc, not justsrc/and a handful of root files, with a regression test pinningdocs/coverage. (tests/stays excluded: the guard itself contains the denylist tokens by design.)
[1.17.0] - 2026-06-22#
Changed#
- Documentation refactored for readability. The README is now a lean ~180-line
landing page (down from ~1,300); detailed command, configuration, branch-safety,
and scheduling docs live in
docs/usage.md, and the knowledge layer indocs/knowledge-layer.md. Standardized examples on thegitlab-synccommand, clarified thestatusoutput (what "Missing"/"Extra" mean), and removed the repetitive install/security prose.
[1.16.0] - 2026-06-22#
Added#
- A "Commands at a glance" reference table in the README covering all 17
commands, and
docs/internals.md, a deep-dive on the core-sync internals plus a new knowledge-layer architecture section.
Changed#
- Slimmed the README (~1,320 → ~860 lines): the deep Technical Documentation /
architecture moved into
docs/internals.md, and the inline version history now points toCHANGELOG.md. Fleshed out thelintanddoctordocs. (Docs only, no code or layout changes, which already follow standard src-layout conventions.)
[1.15.0] - 2026-06-22#
Added#
- QUICKSTART.md, a short install →
bootstrap→ wire-your-editor guide.
Changed#
steernow enhances existing files instead of skipping them: an existingAGENTS.md/CLAUDE.md/.windsurfrules/.kiro/steeringkeeps the user's content and gets a clearly-delimited managed block appended (only that block is refreshed on re-runs);.mcp.jsonis merged; a same-named skill file is kept; custom layers like.devin/are never touched. Nothing the user wrote is deleted.
[1.14.0] - 2026-06-22#
Added#
- GitLab knowledge connector: links each repo to its open merge requests and
issues (read through the authenticated
glab), on the same connector seam as Atlassian/Figma. Configure with[[sources]] type = "gitlab"(optionalgroup); it needs no association rules. The command runner is injectable, so the mapping is unit-tested without GitLab. - Scheduling recipe:
bootstrapis incremental and branch-safe, so it doubles as a refresh job, documented cron + systemd-timer examples (examples/gitlab-sync.service,examples/gitlab-sync.timer) keep the mirror and knowledge layer always-fresh without disturbing in-progress work.
[1.13.0] - 2026-06-22#
Added#
- Agent skills/workflows library:
steernow also installs a built-in, generic library of operating skills (investigate-root-cause, plan-before-coding, surgical-change, review-before-landing, ship-safely, use-knowledge-graph) into the workspace in the formats local tools read, Claude Code skills (.claude/skills/) and Windsurf workflows (.windsurf/workflows/), so even a small-context model has a strong operating playbook. Managed/idempotent like the other steering files.
[1.12.0] - 2026-06-22#
Added#
bootstrapcommand, one-command turnkey setup that chains mirror → index → connect → embed → wiki → steer, skipping unconfigured/disabled stages and never aborting on a single stage's failure. Takes--kb-config(separate from the sync INI) and--no-sync/--no-embed/--no-wiki/--no-connecttoggles, so a teammate goes from nothing to a fully-wired workspace in one step.
[1.11.0] - 2026-06-22#
Added#
- Steering-layer generation (
steercommand): writes workspace-specific steering files so local AI tools pick up the knowledge graph natively,AGENTS.md(overview + knowledge tools + guardrails), a thinCLAUDE.mdthat imports it,.windsurfrules,.kiro/steering/, and a merged.mcp.jsonentry for the MCP server. Content is grounded in the indexed repos/languages/ dependencies; it only overwrites files it manages (or with--force).
[1.10.0] - 2026-06-21#
Added#
- OpenAI-compatible providers for the embeddings and wiki tiers: set
provider = "openai"to use any OpenAI-compatible API, a hosted key or a local server (LM Studio, Jan, llama.cpp, vLLM), as an alternative to local Ollama. The API key is read from an env var named byapi_key_env(never stored in config); servers that need no key work with it unset. - MCP integration docs: a README section showing how to use
gitlab-sync serveas an MCP server from Claude Code and Windsurf/Devin (the graph tools need no model; only semantic search needs embeddings).
[1.9.1] - 2026-06-21#
Fixed#
serveover the stdio transport wrote human-facing log lines to stdout, which is the MCP JSON-RPC channel, corrupting the protocol stream (clients saw spurious parse errors). On stdio, logs now go to stderr.
Changed#
index --workspaceis quieter by default: the per-repo "parsed/resolved" detail is now debug-level (show it with-v), leaving the clean per-repo progress bar.- Added a
ROADMAP.mdlisting future good-to-haves.
[1.9.0] - 2026-06-21#
Added#
- Curated wiki tier (
wikicommand): a pluggable, local-first LLM client (Ollama) synthesizes a provenance-stamped Markdown page per repo, grounded strictly in graph facts, and an LLM verification council (accuracy / completeness / clarity reviewers + a chairman threshold) gates what gets written. Off unless[llm] enabled = true. index --watch(--interval): keep re-indexing the workspace incrementally on an interval (Ctrl-C to stop) for a long-running refresh.- Bi-temporal queries: each indexed shard is snapshotted by commit, and
query --repo R --as-of <commit>searches repoRas it was at a previously indexed commit (time-travel) without a schema overhaul.
[1.8.0] - 2026-06-21#
Added#
- Incremental workspace indexing:
index --workspacenow re-indexes only the repos whose git HEAD moved since their last index (skipping unchanged ones), with--forceto rebuild everything. Paired with cron this gives scheduled incremental refresh. lintcommand for the knowledge layer: reports graph-health issues, repos gone stale (HEAD moved since index) and dangling edges (an endpoint node missing from the store).- Colorful CLI: status glyphs, coloured per-repo lines, and a progress bar for
the sync and knowledge-layer commands. Honors
NO_COLOR/FORCE_COLORand falls back to plain text off a TTY (pipes, cron, and logs stay clean). No new dependencies.
[1.7.0] - 2026-06-21#
Added#
- Hybrid retrieval (
hybrid_searchMCP tool): seeds Personalized PageRank with the embedding hits and propagates relevance across the graph (HippoRAG-style), so structurally-related nodes (callers, dependents) surface even when their text does not match the query. PPR runs over a BFS-bounded subgraph to stay tractable. - Optional sqlite-vec ANN backend for the vector store, selectable via
[embeddings] vector_backend(auto|sqlite-vec|brute).autouses sqlite-vec when thegitlab-sync[kb-vec]extra is installed and falls back to the exact pure-Python cosine scan otherwise, same interface either way.
[1.6.0] - 2026-06-21#
Added#
- Semantic-search tier (optional, local-first): a pluggable embeddings
provider (
Embedderinterface + config-driven factory; a stdlib-only Ollama provider ships first), a local SQLite-backed vector store with cosine search, anembedcommand that vectorizes indexed nodes, and asemantic_searchMCP tool exposed byservewhen embeddings are enabled. Off by default;doctorreports embeddings status.
[1.5.0] - 2026-06-21#
Added#
- Figma knowledge connector: links repos to the design files they reference,
classifying
figma.comURLs (file/design/proto/board) to a stable file key and taking the human file name from the URL slug. When a Figma MCP is configured each design is additionally checked for reachability (best-effort, never required). Runs alongside Atlassian sources underconnect. Connector-agnostic helpers were extracted to a shared module so new connectors stay small.
Fixed#
link_scrapeassociation rules expressed as apatternslist (as in the example config) were silently ignored; both a singularpatternand apatternslist are now honored.
[1.4.0] - 2026-06-21#
Adds an optional knowledge layer (gitlab_sync.kb, the [kb] extra,
Python ≥ 3.10) that turns the mirrored repositories into a queryable knowledge
graph served to AI agents over MCP. The core sync tool is unchanged and the extra
is entirely opt-in. Everything is generic and config-driven, no
organization-specific data lives in the package.
Added#
- Knowledge-graph store and CLI:
index,query,serve, anddoctorcommands backed by a SQLite + FTS5 cross-repo index with per-repo JSON shards. Every node/edge is provenance-stamped (source file + verified date) and confidence-tagged (EXTRACTED/INFERRED/AMBIGUOUS). - Code graph via tree-sitter for Python, JavaScript, TypeScript/TSX, and C#:
files, classes, functions/methods, interfaces, imports, containment, and an
intra-repo call graph (the parser registry is pluggable).
index --workspaceindexes every git repository under a directory. - Cross-repo dependency graph from
pyproject.toml,package.json, and*.csprojmanifests through shared package nodes. - MCP server (stdio or streamable-http) exposing
search_code,find_definition,find_callers,find_dependents,get_neighbors,shortest_path, andgraph_stats, plus akb://statsresource. All output is sanitized before it reaches an agent. - Knowledge connectors (
connect): an Atlassian connector links each repo to the Jira issues and Confluence pages it references. Candidate issue keys (from branch/commit names) are confirmed and enriched against live sites with a single batched JQL call (unverified false-positives are dropped); Atlassian URLs in docs are classified into issue/page links. One or more sites are supported, each independently authenticated over MCP. Output is stored in an isolated graph partition so code re-indexing never disturbs external links. - Config (
examples/kb.toml.example→~/.gitlab-sync/kb.toml): store location, languages, knowledge sources, and association rules, all organization-specific facts live here, never in the package. - CI now runs a separate knowledge-layer job (Python 3.10-3.13) alongside the core job, including a genericity guard that fails the build if organization data appears in the source.
[1.3.0] - 2026-06-21#
This release stabilizes the core and makes the tool installable. It repairs several regressions introduced by the earlier modularization and fixes a critical configuration bug.
Fixed#
- Critical: repositories were keyed by their full
<group>/...path while local clones mirror the tree below the group, so every repo was misreported as missing-and-extra and a sync would clone duplicates into a bogus<group>/subtree. Paths are now mapped to their group-relative local form (the full path is retained forglabauthentication). - Critical: a
~(or$VAR) in a config-filework_dir/cache_dirwas treated literally, so the tool operated on a non-existent path and saw zero local repositories. Path values are now expanded. - Critical: boolean config settings (
protect_working_branches,require_clean_workspace,clean_corrupted,adaptive_workers,auto_stash) were silently overridden by CLI defaults on every run, which disabled branch protection and the clean-workspace requirement by default. Flags now default to "unset" so config-file values are honoured. --configwas accepted but ignored; the explicit config path is now loaded.- Config precedence corrected to: explicit
--config> local > global > defaults (previously global silently overrode local). AdaptiveWorkerPoolraisedAttributeError/ never actually resized the pool; it now initializes correctly and parallelism adapts to the live error rate.- Retry/backoff existed but was never wired in; clone now retries transient failures (network/timeout) and fails fast on DNS/TLS.
updatereported failedgit pull(conflicts, auth, network) as "Already up to date"; it now distinguishes updated / unchanged / error by comparing HEAD before and after.loadsilently discarded a list-shaped JSON cache; it is now normalized.fetchused a malformedglabinvocation; it now calls the GitLab API with a URL-encoded group path and correct pagination, and restores thepath|ssh|http|default_branch|archivedtext cache.verifyrecovers nested-repository (repo-inside-repo) detection.- Corrupted (non-git) target directories are detected and re-cloned again
(honouring
--clean-corrupted); cloning prefersglabfor authentication.
Added#
- Installable package with a
gitlab-syncconsole entry point,python -m gitlab_sync, and the barepython3 gitlab_sync.pyscript (src layout). --dry-runto preview clone/update/branch actions without changing anything.- Logging via the standard library with
-v/--verbose,-q/--quiet, and--log-file(rotating audit log). clone_method(auto|glab|git) andbranch_strategy(commits|recency|hybrid) configuration; the most-active-branch heuristic is now recency-aware.--versionflag.- A pytest test suite (68 tests) with fakes for
git/glab, and GitHub Actions CI running ruff + pytest on Python 3.9-3.14.
Changed#
- Code modularized into a
gitlab_syncpackage:cli.py,core.py,config.py,safety.py,logging_setup.py.
[1.2.0] - 2026-06-16#
Added#
- Branch safety checks to protect working branches from sync conflicts
- Workspace protection requiring clean workspace before operations
- Automatic stashing support for uncommitted changes
- Configurable safe branches list
- CLI arguments for branch safety control:
- --protect-working-branches / --no-protect-working-branches
- --safe-branches
- --require-clean-workspace / --no-require-clean-workspace
- --auto-stash / --no-auto-stash
- Enhanced error classification for better retry strategies
- Adaptive worker pool for dynamic parallelism
- Comprehensive branch safety documentation in README
Changed#
- Updated README with branch safety section including scenarios and examples
[1.1.0] - 2026-05-24#
Added#
- INI-based configuration file support
- Local and global config file support
- CLI arguments now override config file settings
- Improved security with externalized configuration
- Tilde expansion for home directory paths
- Configurable timeouts and worker counts
- Exponential backoff retry mechanism
- Adaptive worker pool for dynamic parallelism
- Enhanced error classification for better retry strategies
Changed#
- Removed all hardcoded company/personal identifiers
- Configuration files can be excluded from version control
[1.0.0] - 2026-05-10#
Added#
- Full synchronization pipeline
- Branch management with automatic active branch detection
- Structure verification
- Concurrent processing with ThreadPoolExecutor
- Error handling and timeout management
- Timestamped logging
