01 · Clean — Free, no signup
AI-Powered Dataset Cleaning for Fine-tuning
Clean your fine-tuning dataset before training. 60+ validator codes, AI-judge scoring with score-floor-gated rewrite, structural pair audit + judge-based polarity sample, tool-call validation, jailbreak + military-OPSEC + industry-specific PII detection — all in your browser.
How It Works
Three steps to a fine-tune-ready dataset.
Upload, scan, fix — in minutes, not days. Every flag points back to a row index so you can review or auto-fix in one click.
Step 1
📤
Upload your data
JSONL, CSV, or JSON — up to 50 MB. Conversations, instruction pairs, DPO/ORPO preferences.
Step 2
🔍
Scan with 90+ rules
Format, dedup, GPT-slop, refusals, mislabels, PII, jailbreaks. Optional AI judge + score-floor-gated rewrite.
Step 3
✅
Fix and export
One-click auto-fix, manual review, or download a clean JSONL. Ready to fine-tune anywhere.
What it catches
Six layers of validation. One pass.
Most cleaners do regex dedup and call it a day. ModelBrew runs row-level validators, pair-level structural audits, AI-judged content review, and adversarial-content detection — in a single scan.
🔎
60+ validator codes
Format, schema, length, dedup (exact + near + semantic), encoding, GPT-slop, refusals, repetition, mislabel detection. Every flag points back to a row index.
⚖️
AI judge + rewrite
Four-axis judge with calibration exemplars; optional 14-dim and G-Eval rubrics. Rewriter preserves every number, URL, named entity, and acronym — verified by a fact-diff before the row ships.
🔁
DPO / ORPO structural audit
Eight structural defect codes — identity pairs, near-duplicate chosen, both-refusals, both-too-short, extreme length bias, sycophantic chosen, refusal-as-chosen, missing prompt. The pair-level checks row-level scanning misses.
🛠️
Tool-call validation
OpenAI tool_calls and Anthropic tool_use shape detection. Missing-required-arg and wrong-arg-type are critical; unknown-arg is a warning. Built for shipping agentic fine-tunes.
🛡️
Jailbreak · OPSEC · typed PII
Eight jailbreak categories (prompt injection, role bypass, system extraction, encoding attacks). Six military OPSEC codes (MGRS, EDIPI, classification markings, DTG, lat/long, network refs). Nine industry-specific PII detectors (medical: MRN/DEA/ICD-10/NPI, financial: CUSIP/SWIFT/ABA, legal: bar number/Bates) on top of the standard 10-type regex PII pass.
🚀
Proven at 100,000 rows
250 rows / sec on a single worker, peak RSS under 1.5 GB. End-to-end scan of 100k OASST1 and 100k military corpora. Real benchmark, not a marketing number.
Use Cases
Built for teams shipping into regulated and high-stakes domains.
The Optimizer wasn't built for general internet text — it was built for the data that fine-tunes get cancelled over. Defense, healthcare, finance, legal, agentic tool use.
Healthcare
Clinical NLP
Detect MRN, DEA, ICD-10, NPI leakage. Flag sycophantic chosen pairs in clinical preference data. Catch hallucinated drug interactions before training.
Defense / Gov
OPSEC + Classification
MGRS coordinates, EDIPI numbers, classification markings, DTG, lat/long, network references. Six dedicated military OPSEC codes — beyond what a regex PII scrubber will ever catch.
Finance
Cross-Asset Datasets
CUSIP, SWIFT, ABA detection. Catch numeric distortions in rewrites — the rewriter never silently changes a price, ticker, or amount.
Legal
Multi-Practice Corpora
Bar number and Bates number detection. Identity-pair audit for DPO datasets. Jailbreak detection for AI-assisted drafting tools.
Agentic LLMs
Tool-Call Hygiene
Validate tool_calls shapes before they hit a $1k/run training. Critical-vs-warning split keeps the noise floor low.
ML Teams
Pre-flight before fine-tuning
Plug the Optimizer into your CI before any training run. Score, rewrite, dedupe, then ship to ModelBrew (or anywhere) with confidence.