Clean your fine-tuning data.
Every defect that actually hurts training.
Schema-validate, deduplicate, score every row on four axes, rewrite bad responses without dropping a single fact, and audit your DPO pairs for label errors — all before your trainer sees them.
Fine-tuning runs fail because of the data, not the trainer. Duplicates bias it. Bad rows teach it to be sycophantic. A single refusal in your chosen column teaches it to refuse. The model is the compression of your data — clean the data before you compress it.
Nine pillars. Every one is opinionated and defensible.
Each feature below exists because it stops a specific failure mode. We cite the literature so you can check our math.
Know your data
Schema validation catches the format defects that silently kill training jobs. We parse into a canonical row model so every downstream check sees the same shape.
Auto-detect format
JSONL, JSON, CSV, ShareGPT. We normalize to a CanonicalRow model so messages, instruction/output, and prompt/completion all route through the same validation path.
Real tokenizer
Token counts come from tiktoken cl100k_base, not a char/4 heuristic. Length-budget warnings against your target profile's real context window.
Structural checks
Invalid roles, system-not-first, invisible unicode, BOM artifacts, ragged CSV columns, truncated responses, placeholder answers, empty sides. 40+ codes.
Kill duplicates — three layers
Duplicates bias training toward whatever repeats. We catch them at three levels: identical, near-identical, and semantic.
Exact
SHA-256 over the canonical lowercased text. O(n).
Near-duplicate
MinHash + LSH (128 permutations, 3-word shingles), O(n) at 100k+ rows. Verified by RapidFuzz ratio on the LSH candidate pairs.
Semantic
Sentence-transformer cosine similarity catches paraphrase duplicates MinHash misses. Union-find clusters so you see which rows collapse together.
Score every row on four axes
A single quality number lies. Four explicit axes tell you why a row scored the way it did.
Factual Grounding
Density of verifiable claims — names, numbers, IDs, mechanisms, steps — relative to what the instruction asks for.
Instruction Fit
Does the response actually answer the instruction? B = 1 triggers a gated override that forces the aggregate score to 1: a row that answers the wrong question is fatal.
Coherence
Consistency, completeness, no truncation, no repetition or filler. Owns all filler-penalization so axis D doesn't double-count.
Training Signal
Penalizes only meta-commentary and AI-tells (“as an AI”, “Certainly!”). Domain jargon is not punished.
Calibration anchors
Seven calibration exemplars including mid-tier 6/10 and 7/10 anchors. Reduces judge-mean drift between runs.
Parse-retry
One retry on malformed JSON recovers the 1–3 % of rows that silently drop under concurrency.
Fix rows without inventing facts
Most rewrite tools drop facts and hallucinate new ones. Ours doesn't. When a rewrite fails the guard, the model gets an explicit diff of what it dropped — not a vague “try again.”
Hard-ID guard
ALLCAPS acronyms and tokens with digits (APT29, T1071.001, CVE-2024-1234, DD Form 1380) must survive verbatim. Zero tolerance.
Number, URL, quote preservation
Every number (with comma-normalization), every URL, every quoted string — exact.
Soft-noun survival
Proper-noun preservation at a tunable 60 % threshold. Stopword list excludes days, months, demonyms so the guard doesn't over-reject on glue words.
Hallucination detector
The rewrite may not introduce any number or hard-ID that wasn't in the original. Violations are rejected.
Fact-diff retry
Up to two retries, each with an explicit list of which facts were dropped and which hallucinations to remove. Vague “you dropped facts” produces near-zero lift; specific diffs work. Cites Madaan et al. Self-Refine, 2023.
Keep-original fallback
If the rewrite would require inventing content, the original is kept and you get the reason why.
DPO & ORPO pairs get their own audit
Row-level scanning is blind to preference-pair defects. Swap-invariant polarity, embedding-margin detection, prompt-conflict finding, and multi-turn role preservation — every DPO failure mode we know about has a check.
Swap-invariant polarity
Score each pair twice with sides flipped, average, flag disagreement > 1 as low-confidence. Mitigates the position bias Zheng et al. 2023 documented.
Embedding-margin detector
Chosen and rejected with cosine similarity ≥ 0.92 produce a vanishing DPO gradient. Flagged as near_identical_semantic. Cites Rafailov et al. 2023.
Prompt-conflict detector
Same prompt with flipped (chosen, rejected) direction across rows is a gradient-conflict footgun. UltraFeedback-style detection.
Length-gaming share
When chosen ≥ 2.5× longer than rejected in > 50 % of pairs, the dataset teaches “longer = better” — a known reward-hack (Singhal et al. 2024).
Identity & duplicate pairs
Identity pairs (chosen == rejected) contribute zero loss. Duplicate pairs come in exact (SHA-1 over full text) and near-duplicate (MinHash over 5-gram shingles).
Multi-turn role preservation
TRL-format pairs keep [user] / [assistant] boundaries intact through dedup, polarity, and margin scoring. Turn structure is not collapsed.
See the structure of your data
Coverage holes, over-represented topics, and missing diversity are invisible in a row-by-row view.
Topic clusters
HDBSCAN over UMAP-reduced sentence-transformer embeddings. Variable-density clusters, outliers marked. Graceful K-means fallback.
Diverse subset selection
Farthest-point / facility-location sampling when you want the most diverse N rows out of a larger set.
Teach a concept
Give positive and negative exemplars; every row gets scored against the concept you care about. Useful for finding gaps.
Safety & privacy pass
Honest labels. Regex is called regex. Heuristics are called heuristics. The ML pass is called ML.
PII detection
Ten regex types — email, phone, SSN, credit card, IBAN, API key, password pattern, address, DOB, public IP. Plus a capitalized-name heuristic flagged as “not NER-grade” so you know what it is.
ML toxicity
Detoxify (BERT, unbiased-small) with 0.80 / 0.55 thresholds. Skips above 200 rows with a visible warning — no silent passes.
Prompt-injection patterns
Critical-severity flag for rows that look like training-time injection.
Autofix — opt-in, not magical
Twenty-three autofix operations. Every one is explicit about what it changes. You approve the list.
Format repairs
Fix smart quotes, BOM, encoding. Trim whitespace. Normalize roles. Convert JSON arrays to JSONL. Convert ShareGPT to messages.
Row removal
Exact, near, and duplicate-output dedup. Empty-row, short-row, placeholder, refusal, unfinished-response removal.
Content strip
GPT slop openers and closers (“Certainly!”, “I hope this helps!”), repetition collapse, HTML stripping, markdown stripping, optional PII redaction.
Export to what your trainer expects
Seven target profiles. Rows you flagged during review are filtered. Critical issues are dropped by default.
Chat formats
OpenAI Chat, Azure OpenAI Chat, HuggingFace Messages. Proper system/user/assistant role tagging.
Instruction formats
ModelBrew FT (instruction/output), Alpaca, prompt_completion.
Text format
HuggingFace Text for causal-LM / plain-text continuation training.
Production safety beyond PII
PII catches names. Production catches attacks, agentic shape errors, and operational-security leaks. Three detector families that an SFT-only checker won't have.
Jailbreak detection — 8 categories
Prompt-injection, role-bypass, system-prompt extraction, encoding tricks, persona-jailbreak, security-researcher framing, instruction-hierarchy attack, multi-turn jailbreak. Critical-severity flag on any hit.
Tool-call validation
Validates OpenAI tool_calls and Anthropic tool_use shapes against your tool schema. missing_required_arg and wrong_arg_type are critical, unknown_arg is a warning. Built for shipping agentic fine-tunes.
Military / OPSEC redaction
Six detector codes — classification markings (UNCLASSIFIED//FOUO style), MGRS grid coordinates, EDIPI numbers, DTG timestamps, lat/long coordinates, internal network references. Same severity weight as critical PII.
Industry-specific typed PII
Nine industry-specific identifier detectors layered on top of the standard PII pass — medical (MRN, DEA, ICD-10, NPI), financial (CUSIP, SWIFT/BIC, ABA routing), legal (bar number, Bates number). Categorized so your downstream policy can branch by industry.
Hardcoded-secret block
Eleven secret patterns including AWS access keys, GitHub PATs, Slack tokens, OpenAI sk- keys, URL-embedded credentials. A row leaking a credential is forced to score 1 / 10 regardless of surface quality.
ML toxicity
Detoxify (BERT, unbiased-small) with 0.80 / 0.55 thresholds. Skips above 200 rows with a visible warning — no silent passes.
Label quality — two paths, not one
Single-judge label scoring is one signal. We give you three. Cross-check, disagree, vote — and tag the ones that don't agree.
Judge cross-check
The original LLM-judge label-error path: model is asked to flip the label and explain. Disagreement with the original label produces likely_mislabeled.
Judge-disagreement v2
Independent path: when the AI judge score and the heuristic quality_score disagree by ≥ 4 points, the row is flagged. Catches Cleanlab-style label errors without their certifier dependency.
Weak supervision — 11 labeling functions
Pattern-based labelers (refusal, sycophancy, off-topic, low-density, code, list-only, very-short, very-long, etc.) vote per row. Abstain-aware aggregation: a row with 2 flags + 9 abstains is treated very differently from 2 flags + 9 OKs.
14-dimension rubric (opt-in)
Phase 4a research rubric — instruction adherence, factuality, coherence, helpfulness, conciseness, safety, plus eight more named axes. Enable via CLEANER_JUDGE_RUBRIC=14dim when the 4-axis is too coarse for your use case.
G-Eval scorer (opt-in)
Reference-grounded G-Eval pass when you upload gold answers. Logprob-weighted per-row faithfulness scoring (when the provider supports it). Aligns with the 2024 G-Eval prompt research.
Pre-flight cost estimator
Before a single Gemini or Anthropic call goes out, you see the projected token spend, latency band, and credit cost. No surprise bills on a 100,000-row scan.
Model-aware fit and gap-filling
A dataset that trains TinyLlama beautifully will overflow Phi-mini and underfit Llama-70B. Score for the model you're actually targeting.
Per-model profiles
TinyLlama, Mistral-7B, Saul-7B, Qwen3-8B, Gemma-2-9B, Llama-3.1-8B — each with its real context window and chat-template expectations. Context-overflow penalty is calibrated per model, not a generic 4K cap.
Whole-word keyword match
Profile-specific keyword detectors use word-boundary regex — “math” no longer hits inside “mathematics” or “aftermath.” Removed a real false-positive class on multi-domain corpora.
Coverage gap detection
Reads the topic-cluster output and surfaces under-represented sub-topics. The first half of synthetic gap-filling: tell you precisely where your dataset is thin before you generate anything.
Synthetic generation prompts (beta)
Builds Evol-Instruct-style prompts to fill detected gaps. The generation step itself is scaffolded; today you can hand the prompts to your own model. Full inline generation rolls out next.
Novelty vs baseline
Diff your dataset against a reference baseline to see which rows are genuinely new vs recycled. Useful when iterating on a corpus across versions.
Proven at 100,000 rows
Real benchmark numbers from tools/bench_cleaner.py. Same code you upload to runs end-to-end at this scale on a single worker.
OASST1 — 100,000 rows
390 s end-to-end, 256 rows / sec, peak RSS 1.4 GB. Parse 0.76 s, normalize 2.49 s, validate 385 s, summary 1.64 s.
Military corpus — 100,000 rows
475 s end-to-end, 211 rows / sec, peak RSS 863 MB. OPSEC + jailbreak + judge-rubric all on, no skip-on-cap.
Single-process, no GPU
The 100k benchmark runs on one CPU worker. Add workers and the rate scales linearly. No tricky distributed-state setup.
Transparent by design.
No black boxes. The rubric, the retry protocol, the thresholds, and the literature we draw on are documented.
What you see in every scan
- Per-row axis breakdown. Every judge score shows A / B / C / D on a 1–5 scale plus a one-sentence rationale.
- Before / after diff for rewrites. Every rewritten row keeps
metadata.original_outputplus a list ofrewrite_changes. - Flag codes with row indices. 40+ validator codes. Every flag points back to the row that triggered it.
- Rewrite-same-model warning. If you re-score rewritten rows, they get tagged
rewrite_same_modelso the UI can warn about self-preference inflation (Panickssery et al. 2024). - Silent-skip signals. ML-toxicity cap, judge-parse failures, and provider-unavailable all surface explicitly rather than passing silently.
Literature we cite in the code
- Zheng et al. 2023 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Position bias in pairwise LLM judges; informs our swap-invariant polarity scoring.
- Madaan et al. 2023 — Self-Refine: Iterative Refinement with Self-Feedback. Specifies that vague retry feedback produces near-zero lift; informs our fact-diff retry design.
- Singhal et al. 2024 — A Long Way to Go: Investigating Length Correlations in RLHF. Length is one axis of reward-hack; informs length-gaming share detection.
- Rafailov et al. 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Gradient structure implies near-identical pairs vanish; informs the embedding-margin detector.
- Panickssery et al. 2024 — LLM Evaluators Recognize and Favor Their Own Generations. Informs the rewrite-same-model warning.
Open your dataset in the Optimizer.
Drag your JSONL into the browser. No signup. Your file isn't persisted beyond your scan session.