Free · No signup · Runs in your browser

Clean your fine-tuning data.
Every defect that actually hurts training.

Schema-validate, deduplicate, score every row on four axes, rewrite bad responses without dropping a single fact, and audit your DPO pairs for label errors — all before your trainer sees them.

SFT + DPO + ORPO JSONL · JSON · CSV · ShareGPT Up to 50 MB
Why this page exists

Fine-tuning runs fail because of the data, not the trainer. Duplicates bias it. Bad rows teach it to be sycophantic. A single refusal in your chosen column teaches it to refuse. The model is the compression of your data — clean the data before you compress it.

A principle popularized by Andrej Karpathy. Every feature on this page implements a specific form of it.

Nine pillars. Every one is opinionated and defensible.

Each feature below exists because it stops a specific failure mode. We cite the literature so you can check our math.

01 /

Know your data

Schema validation catches the format defects that silently kill training jobs. We parse into a canonical row model so every downstream check sees the same shape.

Auto-detect format

JSONL, JSON, CSV, ShareGPT. We normalize to a CanonicalRow model so messages, instruction/output, and prompt/completion all route through the same validation path.

Real tokenizer

Token counts come from tiktoken cl100k_base, not a char/4 heuristic. Length-budget warnings against your target profile's real context window.

Structural checks

Invalid roles, system-not-first, invisible unicode, BOM artifacts, ragged CSV columns, truncated responses, placeholder answers, empty sides. 40+ codes.

02 /

Kill duplicates — three layers

Duplicates bias training toward whatever repeats. We catch them at three levels: identical, near-identical, and semantic.

Exact

SHA-256 over the canonical lowercased text. O(n).

Near-duplicate

MinHash + LSH (128 permutations, 3-word shingles), O(n) at 100k+ rows. Verified by RapidFuzz ratio on the LSH candidate pairs.

Semantic

Sentence-transformer cosine similarity catches paraphrase duplicates MinHash misses. Union-find clusters so you see which rows collapse together.

03 /

Score every row on four axes

A single quality number lies. Four explicit axes tell you why a row scored the way it did.

Factual Grounding

Density of verifiable claims — names, numbers, IDs, mechanisms, steps — relative to what the instruction asks for.

Instruction Fit

Does the response actually answer the instruction? B = 1 triggers a gated override that forces the aggregate score to 1: a row that answers the wrong question is fatal.

Coherence

Consistency, completeness, no truncation, no repetition or filler. Owns all filler-penalization so axis D doesn't double-count.

Training Signal

Penalizes only meta-commentary and AI-tells (“as an AI”, “Certainly!”). Domain jargon is not punished.

Calibration anchors

Seven calibration exemplars including mid-tier 6/10 and 7/10 anchors. Reduces judge-mean drift between runs.

Parse-retry

One retry on malformed JSON recovers the 1–3 % of rows that silently drop under concurrency.

04 /

Fix rows without inventing facts

Most rewrite tools drop facts and hallucinate new ones. Ours doesn't. When a rewrite fails the guard, the model gets an explicit diff of what it dropped — not a vague “try again.”

Hard-ID guard

ALLCAPS acronyms and tokens with digits (APT29, T1071.001, CVE-2024-1234, DD Form 1380) must survive verbatim. Zero tolerance.

Number, URL, quote preservation

Every number (with comma-normalization), every URL, every quoted string — exact.

Soft-noun survival

Proper-noun preservation at a tunable 60 % threshold. Stopword list excludes days, months, demonyms so the guard doesn't over-reject on glue words.

Hallucination detector

The rewrite may not introduce any number or hard-ID that wasn't in the original. Violations are rejected.

Fact-diff retry

Up to two retries, each with an explicit list of which facts were dropped and which hallucinations to remove. Vague “you dropped facts” produces near-zero lift; specific diffs work. Cites Madaan et al. Self-Refine, 2023.

Keep-original fallback

If the rewrite would require inventing content, the original is kept and you get the reason why.

05 /

DPO & ORPO pairs get their own audit

Row-level scanning is blind to preference-pair defects. Swap-invariant polarity, embedding-margin detection, prompt-conflict finding, and multi-turn role preservation — every DPO failure mode we know about has a check.

Swap-invariant polarity

Score each pair twice with sides flipped, average, flag disagreement > 1 as low-confidence. Mitigates the position bias Zheng et al. 2023 documented.

Embedding-margin detector

Chosen and rejected with cosine similarity ≥ 0.92 produce a vanishing DPO gradient. Flagged as near_identical_semantic. Cites Rafailov et al. 2023.

Prompt-conflict detector

Same prompt with flipped (chosen, rejected) direction across rows is a gradient-conflict footgun. UltraFeedback-style detection.

Length-gaming share

When chosen ≥ 2.5× longer than rejected in > 50 % of pairs, the dataset teaches “longer = better” — a known reward-hack (Singhal et al. 2024).

Identity & duplicate pairs

Identity pairs (chosen == rejected) contribute zero loss. Duplicate pairs come in exact (SHA-1 over full text) and near-duplicate (MinHash over 5-gram shingles).

Multi-turn role preservation

TRL-format pairs keep [user] / [assistant] boundaries intact through dedup, polarity, and margin scoring. Turn structure is not collapsed.

06 /

See the structure of your data

Coverage holes, over-represented topics, and missing diversity are invisible in a row-by-row view.

Topic clusters

HDBSCAN over UMAP-reduced sentence-transformer embeddings. Variable-density clusters, outliers marked. Graceful K-means fallback.

Diverse subset selection

Farthest-point / facility-location sampling when you want the most diverse N rows out of a larger set.

Teach a concept

Give positive and negative exemplars; every row gets scored against the concept you care about. Useful for finding gaps.

07 /

Safety & privacy pass

Honest labels. Regex is called regex. Heuristics are called heuristics. The ML pass is called ML.

PII detection

Ten regex types — email, phone, SSN, credit card, IBAN, API key, password pattern, address, DOB, public IP. Plus a capitalized-name heuristic flagged as “not NER-grade” so you know what it is.

ML toxicity

Detoxify (BERT, unbiased-small) with 0.80 / 0.55 thresholds. Skips above 200 rows with a visible warning — no silent passes.

Prompt-injection patterns

Critical-severity flag for rows that look like training-time injection.

08 /

Autofix — opt-in, not magical

Twenty-three autofix operations. Every one is explicit about what it changes. You approve the list.

Format repairs

Fix smart quotes, BOM, encoding. Trim whitespace. Normalize roles. Convert JSON arrays to JSONL. Convert ShareGPT to messages.

Row removal

Exact, near, and duplicate-output dedup. Empty-row, short-row, placeholder, refusal, unfinished-response removal.

Content strip

GPT slop openers and closers (“Certainly!”, “I hope this helps!”), repetition collapse, HTML stripping, markdown stripping, optional PII redaction.

09 /

Export to what your trainer expects

Seven target profiles. Rows you flagged during review are filtered. Critical issues are dropped by default.

Chat formats

OpenAI Chat, Azure OpenAI Chat, HuggingFace Messages. Proper system/user/assistant role tagging.

Instruction formats

ModelBrew FT (instruction/output), Alpaca, prompt_completion.

Text format

HuggingFace Text for causal-LM / plain-text continuation training.

10 /

Production safety beyond PII

PII catches names. Production catches attacks, agentic shape errors, and operational-security leaks. Three detector families that an SFT-only checker won't have.

Jailbreak detection — 8 categories

Prompt-injection, role-bypass, system-prompt extraction, encoding tricks, persona-jailbreak, security-researcher framing, instruction-hierarchy attack, multi-turn jailbreak. Critical-severity flag on any hit.

Tool-call validation

Validates OpenAI tool_calls and Anthropic tool_use shapes against your tool schema. missing_required_arg and wrong_arg_type are critical, unknown_arg is a warning. Built for shipping agentic fine-tunes.

Military / OPSEC redaction

Six detector codes — classification markings (UNCLASSIFIED//FOUO style), MGRS grid coordinates, EDIPI numbers, DTG timestamps, lat/long coordinates, internal network references. Same severity weight as critical PII.

Industry-specific typed PII

Nine industry-specific identifier detectors layered on top of the standard PII pass — medical (MRN, DEA, ICD-10, NPI), financial (CUSIP, SWIFT/BIC, ABA routing), legal (bar number, Bates number). Categorized so your downstream policy can branch by industry.

Hardcoded-secret block

Eleven secret patterns including AWS access keys, GitHub PATs, Slack tokens, OpenAI sk- keys, URL-embedded credentials. A row leaking a credential is forced to score 1 / 10 regardless of surface quality.

ML toxicity

Detoxify (BERT, unbiased-small) with 0.80 / 0.55 thresholds. Skips above 200 rows with a visible warning — no silent passes.

11 /

Label quality — two paths, not one

Single-judge label scoring is one signal. We give you three. Cross-check, disagree, vote — and tag the ones that don't agree.

Judge cross-check

The original LLM-judge label-error path: model is asked to flip the label and explain. Disagreement with the original label produces likely_mislabeled.

Judge-disagreement v2

Independent path: when the AI judge score and the heuristic quality_score disagree by ≥ 4 points, the row is flagged. Catches Cleanlab-style label errors without their certifier dependency.

Weak supervision — 11 labeling functions

Pattern-based labelers (refusal, sycophancy, off-topic, low-density, code, list-only, very-short, very-long, etc.) vote per row. Abstain-aware aggregation: a row with 2 flags + 9 abstains is treated very differently from 2 flags + 9 OKs.

14-dimension rubric (opt-in)

Phase 4a research rubric — instruction adherence, factuality, coherence, helpfulness, conciseness, safety, plus eight more named axes. Enable via CLEANER_JUDGE_RUBRIC=14dim when the 4-axis is too coarse for your use case.

G-Eval scorer (opt-in)

Reference-grounded G-Eval pass when you upload gold answers. Logprob-weighted per-row faithfulness scoring (when the provider supports it). Aligns with the 2024 G-Eval prompt research.

Pre-flight cost estimator

Before a single Gemini or Anthropic call goes out, you see the projected token spend, latency band, and credit cost. No surprise bills on a 100,000-row scan.

12 /

Model-aware fit and gap-filling

A dataset that trains TinyLlama beautifully will overflow Phi-mini and underfit Llama-70B. Score for the model you're actually targeting.

Per-model profiles

TinyLlama, Mistral-7B, Saul-7B, Qwen3-8B, Gemma-2-9B, Llama-3.1-8B — each with its real context window and chat-template expectations. Context-overflow penalty is calibrated per model, not a generic 4K cap.

Whole-word keyword match

Profile-specific keyword detectors use word-boundary regex — “math” no longer hits inside “mathematics” or “aftermath.” Removed a real false-positive class on multi-domain corpora.

Coverage gap detection

Reads the topic-cluster output and surfaces under-represented sub-topics. The first half of synthetic gap-filling: tell you precisely where your dataset is thin before you generate anything.

Synthetic generation prompts (beta)

Builds Evol-Instruct-style prompts to fill detected gaps. The generation step itself is scaffolded; today you can hand the prompts to your own model. Full inline generation rolls out next.

Novelty vs baseline

Diff your dataset against a reference baseline to see which rows are genuinely new vs recycled. Useful when iterating on a corpus across versions.

13 /

Proven at 100,000 rows

Real benchmark numbers from tools/bench_cleaner.py. Same code you upload to runs end-to-end at this scale on a single worker.

OASST1 — 100,000 rows

390 s end-to-end, 256 rows / sec, peak RSS 1.4 GB. Parse 0.76 s, normalize 2.49 s, validate 385 s, summary 1.64 s.

Military corpus — 100,000 rows

475 s end-to-end, 211 rows / sec, peak RSS 863 MB. OPSEC + jailbreak + judge-rubric all on, no skip-on-cap.

Single-process, no GPU

The 100k benchmark runs on one CPU worker. Add workers and the rate scales linearly. No tricky distributed-state setup.

Transparent by design.

No black boxes. The rubric, the retry protocol, the thresholds, and the literature we draw on are documented.

What you see in every scan

  • Per-row axis breakdown. Every judge score shows A / B / C / D on a 1–5 scale plus a one-sentence rationale.
  • Before / after diff for rewrites. Every rewritten row keeps metadata.original_output plus a list of rewrite_changes.
  • Flag codes with row indices. 40+ validator codes. Every flag points back to the row that triggered it.
  • Rewrite-same-model warning. If you re-score rewritten rows, they get tagged rewrite_same_model so the UI can warn about self-preference inflation (Panickssery et al. 2024).
  • Silent-skip signals. ML-toxicity cap, judge-parse failures, and provider-unavailable all surface explicitly rather than passing silently.

Literature we cite in the code

  • Zheng et al. 2023Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Position bias in pairwise LLM judges; informs our swap-invariant polarity scoring.
  • Madaan et al. 2023Self-Refine: Iterative Refinement with Self-Feedback. Specifies that vague retry feedback produces near-zero lift; informs our fact-diff retry design.
  • Singhal et al. 2024A Long Way to Go: Investigating Length Correlations in RLHF. Length is one axis of reward-hack; informs length-gaming share detection.
  • Rafailov et al. 2023Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Gradient structure implies near-identical pairs vanish; informs the embedding-margin detector.
  • Panickssery et al. 2024LLM Evaluators Recognize and Favor Their Own Generations. Informs the rewrite-same-model warning.

Open your dataset in the Optimizer.

Drag your JSONL into the browser. No signup. Your file isn't persisted beyond your scan session.