llm-decompile-cleaner — Decompiler Output Cleanup with llm4decompile-22b-v2
A post-processor for Ghidra, IDA, and Binary Ninja C dumps that splits the output by function
and pipes each one through llm4decompile-22b-v2 — a model trained specifically
on decompiler output — then deduplicates prototypes and normalizes whitespace.
1. The problem with decompiler output
When working through a binary — whether for RE research, a CTF, or auditing a native library — you often export anywhere from a handful to several dozen functions as C pseudocode from Ghidra, IDA, or Binary Ninja. The output is readable, but it's consistently noisy: stack variables come out as local_8 and local_10, unnamed functions are sub_140001234, forward declarations get duplicated when the decompiler emits a prototype at the top of the dump and then the full definition below, and formatting varies enough between tools that a batch export looks visually inconsistent throughout.
Cleaning this by hand for a few functions is fine. Cleaning 40 functions from a native library before you can actually reason about the code is tedious enough that I wanted a way to automate the renaming and cleanup pass. The full source is on GitHub: ahossu/llm-decompile-cleaner.
2. Why llm4decompile-22b-v2
The obvious approach is to pipe the decompiler output through a general-purpose LLM. It works, but general models have a tendency to over-rename: they'll confidently replace local_8 with something like buffer based on surrounding context, and sometimes that guess is wrong. A misnamed variable in RE output is worse than a generic one because it actively points you in the wrong direction.
llm4decompile-22b-v2 is a model trained specifically on decompiled code rather than general source code. It understands the naming patterns that decompilers produce — sub_XXXXXXXX for functions without symbols, local_N for stack-allocated variables, param_N for parameters — and works within those conventions rather than trying to impose source-code naming on top of them. The 22b-v2 checkpoint is the largest and most capable in the series. It's served through a HuggingFace Inference Endpoint configured in config.py.
3. How it works
The input is a raw C dump from any decompiler. The only requirement is that each function block starts with a comment marker — the default format is /* ---- Function: foo ---- */, which is what the bundled IDA and Ghidra export scripts produce. The regex _SPLIT_RE controls the exact pattern and can be adjusted for other formats.
Once split, each function is sent as a separate inference request. Sending functions individually rather than dumping the whole file into a single prompt avoids hitting the model's context window limit and gives cleaner results per function — the model can focus on one function at a time without 50 unrelated ones in the context.
After the model responds, a post-processing pass runs before the output is written:
- Duplicate prototype removal. Decompiler dumps often include a forward declaration at the top of the file and then the full function definition further down. After LLM processing the prototype can reappear in both places — the pass deduplicates them.
- Whitespace normalization. Inconsistent indentation and redundant blank lines are cleaned up to make the final output consistent regardless of which decompiler produced the input.
The --keep-intermediate flag saves the raw model response before this pass. Useful if the post-processing strips something it shouldn't, or if you want to diff raw model output against the cleaned version to understand what the pass is doing.
# after setting your endpoint and token in config.py:
python llm4decompile_clean.py -i raw_funcs.c -o clean_funcs.c
# keep the raw model output alongside the cleaned file
python llm4decompile_clean.py -i raw_funcs.c -o clean_funcs.c --keep-intermediate