2026-05-20

cc-canary: a Local Drift Detector for Claude Code Sessions

For Claude Code users, this repository is interesting because it turns a vague feeling — “my agent seems to be getting worse” — into something measurable. I think that’s a genuinely useful direction: instead of relying on vibes, it mines the session logs Claude Code already writes locally and tries to surface drift, regressions, and behavioral shifts over time.

Key Points

cc-canary is a pair of installable Agent Skills for Claude Code.
It reads Claude Code session JSONL logs from ~/.claude/projects/ and analyzes them for drift in your own work.
The project emphasizes privacy: no network calls, no account, no telemetry, no background daemon.
It is explicitly marked 0.x / pre-alpha, so the metric set and output format may still change.
Two skills are provided:
- cc-canary → outputs a forensic markdown report
- cc-canary-html → outputs the same report as a dark-theme HTML dashboard that auto-opens in a browser when possible
Default analysis window is 60d, with support for 7d, 14d, 30d, 60d, 90d, and 180d.
Reports include:
- a verdict such as HOLDING, SUSPECTED REGRESSION, CONFIRMED REGRESSION, or INCONCLUSIVE
- headline metrics with pre/post comparisons and colored band verdicts
- weekly trend bars for cost, read:edit ratio, reasoning loops, and tokens per turn
- cross-version comparisons controlling for task mix
- an auto-detected inflection date
- appendices with more behavioral signals
Installation is via npx skills add delta-hq/cc-canary, or you can install only one of the two skills.
The script is Python-based, uses only the standard library, and requires python3 >= 3.8.
It deduplicates assistant messages using (message.id, requestId), because Claude Code can write the same message into multiple JSONLs when sessions are resumed or branched.
It computes metrics such as:
- read:edit ratio
- write share of mutations
- reasoning loops
- frustration rate
- thinking redaction rate
- mean thinking length
- API turns per user turn
- tokens per user turn
It also tracks extra signals like premature stopping, self-admitted errors, user interrupts, hour-of-day thinking depth, word-frequency shifts, and thinking-visibility transitions.
The report generation is split into two phases: a pre-rendered skeleton is filled with tables/charts, then Claude writes the narrative into a limited set of marked slots.
The repository is MIT licensed and currently shows a small amount of community activity.

My Take

What strikes me is how much this looks like an observability tool for agent behavior, not just a helper script. That feels important: once you start using Claude Code seriously, the hard problem is often not “can it do the task?” but “is it quietly getting less careful, more verbose, more brittle, or more shortcut-happy over time?” I think this repo is trying to answer that in a way that’s concrete enough to act on.

I also like the privacy stance. Local-only analysis of logs already on disk is exactly the sort of thing I’d prefer for developer workflow telemetry. If you’re going to inspect agent behavior, I’d much rather do it without shipping sensitive session data to another service. The fact that it works from existing Claude Code JSONL files makes it feel practical instead of aspirational.

That said, I’m a little skeptical of the more elaborate metric stack. Some of these signals — “reasoning loops,” “thinking redaction rate,” “mean thinking length,” and so on — may be useful, but I think there’s a risk of overfitting narrative to noisy proxies. A drift dashboard can be helpful, but it can also make people feel like they have a scientific handle on model quality when they really just have an instrument panel full of heuristics. I’d be curious whether the composite health score and inflection detection stay meaningful across very different project styles.

The most compelling part, to me, is the combination of hard counts and forensic output. A markdown or HTML report that’s ready to paste into an issue or gist is actually useful for debugging a real Claude Code workflow problem. If I were using this, I’d try it on a few weeks of sessions, then compare the output against my own intuition: did the model start over-editing files, looping more, or acting less deliberate after a model switch or workflow change?

Overall, this is a thoughtful, developer-centric attempt to make Claude Code behavior inspectable. It’s early, and some of the metrics may prove noisy, but the direction is strong: local, auditable, session-level analysis for spotting drift before it becomes a productivity tax.