For Claude Code users, this repository is interesting because it turns a vague feeling — “my agent seems to be getting worse” — into something measurable. I think that’s a genuinely useful direction: instead of relying on vibes, it mines the session logs Claude Code already writes locally and tries to surface drift, regressions, and behavioral shifts over time.
cc-canary is a pair of installable Agent Skills for Claude Code.~/.claude/projects/ and analyzes them for drift in your own work.0.x / pre-alpha, so the metric set and output format may still change.cc-canary → outputs a forensic markdown reportcc-canary-html → outputs the same report as a dark-theme HTML dashboard that auto-opens in a browser when possible60d, with support for 7d, 14d, 30d, 60d, 90d, and 180d.HOLDING, SUSPECTED REGRESSION, CONFIRMED REGRESSION, or INCONCLUSIVEnpx skills add delta-hq/cc-canary, or you can install only one of the two skills.python3 >= 3.8.(message.id, requestId), because Claude Code can write the same message into multiple JSONLs when sessions are resumed or branched.What strikes me is how much this looks like an observability tool for agent behavior, not just a helper script. That feels important: once you start using Claude Code seriously, the hard problem is often not “can it do the task?” but “is it quietly getting less careful, more verbose, more brittle, or more shortcut-happy over time?” I think this repo is trying to answer that in a way that’s concrete enough to act on.
I also like the privacy stance. Local-only analysis of logs already on disk is exactly the sort of thing I’d prefer for developer workflow telemetry. If you’re going to inspect agent behavior, I’d much rather do it without shipping sensitive session data to another service. The fact that it works from existing Claude Code JSONL files makes it feel practical instead of aspirational.
That said, I’m a little skeptical of the more elaborate metric stack. Some of these signals — “reasoning loops,” “thinking redaction rate,” “mean thinking length,” and so on — may be useful, but I think there’s a risk of overfitting narrative to noisy proxies. A drift dashboard can be helpful, but it can also make people feel like they have a scientific handle on model quality when they really just have an instrument panel full of heuristics. I’d be curious whether the composite health score and inflection detection stay meaningful across very different project styles.
The most compelling part, to me, is the combination of hard counts and forensic output. A markdown or HTML report that’s ready to paste into an issue or gist is actually useful for debugging a real Claude Code workflow problem. If I were using this, I’d try it on a few weeks of sessions, then compare the output against my own intuition: did the model start over-editing files, looping more, or acting less deliberate after a model switch or workflow change?
Overall, this is a thoughtful, developer-centric attempt to make Claude Code behavior inspectable. It’s early, and some of the metrics may prove noisy, but the direction is strong: local, auditable, session-level analysis for spotting drift before it becomes a productivity tax.
Reference: GitHub - delta-hq/cc-canary