git-secret-scanner — Three-Layer Secret Detection in Git History

A Python tool that walks git commit history for leaked credentials using regex pattern matching, Shannon entropy analysis, and LLM-powered context verification — each layer targeting the false positives the previous one leaves behind.

1. Why another scanner

Regex-only scanners work but generate a lot of noise — documentation examples, test fixtures, placeholder strings that match a known format structurally but aren't real credentials. The result is a report you have to manually triage, which quickly becomes tedious on any repo with a non-trivial commit history.

I built git-secret-scanner around a three-layer pipeline to deal with this. Regex catches everything with a known format. Shannon entropy scores each match so that obviously low-randomness strings can be flagged before spending an LLM call on them. The LLM does a final context-aware pass over whatever survives, with full access to the commit message, file path, and surrounding diff lines. The full source is on GitHub: ahossu/git-secret-scanner.

2. Layer 1 — Regex

The first pass runs 100+ patterns against every added line in each commit diff. The patterns cover the formats that actually show up in real leaks:

  • Cloud credentials: AWS access keys (AKIA[0-9A-Z]{16}) and session tokens, GCP service account key blocks, Azure storage keys and SAS tokens, DigitalOcean and other cloud provider tokens
  • API keys and tokens: GitHub PATs (both classic and fine-grained formats), Slack tokens and webhook URLs, Stripe publishable and secret keys, Twilio, SendGrid, Mailgun, and 80+ other services
  • Database credentials: connection strings with embedded passwords for PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch — anything that follows the ://user:pass@host pattern or equivalent
  • Private keys: RSA, EC, PGP, and SSH private key blocks identified by their PEM headers, plus PKCS8 and OpenVPN static key formats

Regex is fast and has high recall — it catches everything with a well-known format. It's also the noisiest layer: a documentation example or a test config file can look structurally identical to a real credential.

3. Layer 2 — Shannon entropy

For each regex match, the tool computes the Shannon entropy of the matched string:

H = -Σ p(c) · log₂(p(c))

where p(c) is the relative frequency of character c in the string. Real credentials are designed to be high-entropy — a uniformly random 32-character hexadecimal string hits exactly H = log₂(16) = 4.0 bits, and most real API keys and tokens are higher. Placeholder strings like YOUR_API_KEY_HERE or example-secret-value are structurally long but have low character diversity and low entropy.

The entropy score is combined with a check for explicit test-data indicators — file paths like test_fixtures.py or docs/examples/, variable names containing "example" or "placeholder", commit messages that mention test data. Together these populate the likely_false_positive field on each finding before the LLM pass.

This is a cheap pre-filter. Strings that are clearly low-randomness or sitting in test files can be flagged without spending an inference call on them, though the LLM still sees the full finding and can override the flag if the context makes a string clearly suspicious regardless.

4. Layer 3 — LLM verification

Anything that survives the first two layers goes to an LLM for a final context-aware pass. The model gets the matched string plus its full commit context: the commit message, author, date, file path, the specific diff line, and a few surrounding lines from the unified diff. It returns three fields:

  • is_real — boolean verdict
  • confidence — float 0–1
  • reasoning — plain-text explanation of the decision

This is the layer that handles cases entropy alone can't distinguish. A high-entropy string in a file committed with the message "add unit tests" is likely a test value. A low-entropy password in a file called postgres_model.js committed with "chore: add postgres connection information" is almost certainly real. The LLM has the context to tell the difference; entropy doesn't.

The reasoning field is included in the output so you can audit each decision instead of treating the model's verdict as a black box. The LLM backend is any HuggingFace Inference Endpoint serving an OpenAI-compatible /v1/ API, configured at runtime via --base-url and --hf-token.

5. Output and usage

The scanner writes a JSON file with a scan_metadata block (timestamp, total findings, high-confidence count) and a findings array. Each entry includes everything you need to act on the finding:

{
  "type": "Database Password",
  "matched_text": "pass=\"sup3rstr0ngpass1ForGG\"",
  "line_number": 6,
  "entropy": 3.70,
  "likely_false_positive": false,
  "detection_method": "regex",
  "llm_analysis": {
    "is_real": true,
    "confidence": 0.95,
    "reasoning": "Password in a database connection file, committed with a message
                  that explicitly describes adding production connection info."
  },
  "commit_hash": "d95287b420366311433f4610b94a2c0844f4dce3",
  "commit_message": "chore: add postgres connection information",
  "commit_author": "Henri Hubert",
  "commit_date": "2021-01-12 17:45:52+01:00",
  "file_path": "postgres_model.js",
  "diff_line": "+var pg_pass=\"sup3rstr0ngpass1ForGG\";",
  "context_snippet": "@@ -0,0 +1,7 @@\n+var pg_port=1212;\n+..."
}

Basic usage — both remote URLs and local paths work for --repo:

python scan.py \
    --repo https://github.com/user/repo.git \
    --base-url https://your-endpoint.huggingface.cloud/v1/ \
    --hf-token hf_yourtoken \
    --n 50 \
    --out report.json

--n controls how many commits back to scan. The terminal output shows a running count per commit and a summary line at the end. The JSON report contains only the high-confidence findings that passed all three layers.

The tool was tested against GitGuardian's public sample_secrets repository, which is specifically maintained as a reference target for secret scanner validation. All 6 high-confidence findings in that repo were correctly identified.