Slayer v3 — Data Pipeline Audit (2026-06-12)
Observations
- Verdict — SFT mix (
slayer-data/v3/train_v3.jsonl, 2,239 ex) is close to plan and passes the contamination checks that were actually run. But there are 3 critical gaps to close before training, and the CPT phase is at ~6% of pilot scale with no trainer. What’s good: - Composition: distill 45.6% + bielik 16.7% (= 62.3% distillation) / human_pl 15.4% / EN 22.3% ≈ plan targets. - 0 records with 8-gram overlap vsstyle/holdout_v1.jsonl; 0 KLEJ-pattern sources;FORBIDDEN_SOURCESfail-fast works. - 0 duplicate prompts; 0 EN/PL cross-language leakage; dash rate 1.8% (low). - Bielik funnel works as designed: 10,304 raw → 9,769 judged → 4,091 verified → 374 sampled. Judge (qwen3.5-122b, open) flaggedfakty: powaznein 50.6% of raw Bielik-11B outputs (hallucinated facts). Never widen this faucet without re-judging. - Provenance compliant where implemented: teacher deepseek-v4-pro (SFT distill), judge open Qwen3.5, Opus verdicts purged. ↗ - CRITICAL (fix before any v3 training run) — 1. Decon coverage holes (
bench/decon_audit.py:28-44): -knowledge/probe_v1.jsonlabsent from eval sources entirely. - LLMzSzŁ entry is the 200-item local sample (llmzszl_R_eval.jsonl), not the fullamu-cai/llmzszl-datasettest pull thatbench_llmzszl_likelihood.py:34actually evaluates on. -GEN_DEFAULTglob omitsslayer-data/external/*.jsonl(Bielik layer never re-audited by--all). -results/decon_audit.jsonis overwritten per run; no evidence the finaltrain_v3.jsonlwas ever audited. - Fix: add the missing sources, include external/, append-don’t-overwrite the report, and make a hit-free decon pass on the assembled mix a mandatory gate (non-zero exit blocks training). 2. 200-char atom cap drops 21% of test atoms from every inline substring guard.make_test_atoms.pywrites all lengths, but consumers filter20 <= len <= 200(build_v3_sources.py:37,build_v3_mix.py:83,gen_distill_pl.py:97): 3,747/17,707 atoms (the long PSC/belebele passages, the likeliest leaks) are silently excluded. Proof it matters: the 8-gram decon still found 1,274 hits inentigraph_zpe.jsonlafter the inline guard passed. Fix: drop the cap or replace atom-substring guards with the shingle index fromdecon_audit.py. 3. ~46% of the mix (main distill layer) was never judge-filtered. Plan requires teacher gen → open-judge filter (V3_DATA_PLAN.md:43);gen_distill_pl.pyhas no judge step andbuild_v3_mix.py:30consumes the decon-only.cleanfile. The Bielik 50.6%-hallucination result shows exactly why unjudged teacher output is risky. Fix: run averify_external_sft.py-style judge pass overdistill_pl.clean.jsonlbefore mixing. ↗ - MAJOR — 4. Unreproducible artifacts: no tracked script generates
external/bielik_distill_10k.jsonl(raw records carry no teacher metadata) norstyle_pl_sft_v3_openjudge_disjoint.jsonl(the holdout-disjoint step). Reconstruct or document both in DATA_LINEAGE.md. 5. Mix shares computed pre-filter (build_v3_mix.py:91-96): distill loses ~18% to dash/contam drops after targets are set → realized 45.6% vs 50% target. Also CLI default 0.50 contradicts the file’s own “~60%” docstring. Compute targets after filtering; validate shares sum to ≤1.0. 6. Silent degradation: missing layer file → “pomijam” + exit 0, can write an empty/skewed mix (build_v3_mix.py:53-55); failed HF download inmake_test_atoms.py:50-51still writes a partial atoms file with exit 0, silently weakening every downstream dedup guarantee. 7. Label honesty:layer=human_plincludes 124qwen_raw_teacher_rewriterecords (5.5% of mix, exact copies of the style-SFT file). True human PL (Aya+OASST) is 9.8%, not 15%. Relabel (e.g.style) or disclose in the model card. ↗ - CPT track (Phase 1) — not ready — - Volume: ~124M tokens total available (~20M clean EntiGraph synthetic + ~104M raw wiki/ZPE) vs 2-3B pilot target → ~4-6%. At observed gen throughput (~8k tok/s) the v2.2-era 455M synthetic target ≈ 16h of API time. Feasible, not done. - No CPT trainer / mix builder / replay plan exists.
sft_style_qlora.pyis chat-SFT with prompt masking, unusable for next-token CPT. Replay 20-40% general data is the plan’s own “most common failure mode” and nothing implements it. -probe_v1.excluded_hashes.txtenforced nowhere (knowledge_probe.py:9-11,127writes it, zero consumers). The knowledge probe will be contaminated by construction unless wired into the CPT builder. - Teacher drift: corpora are 99.7-100% deepseek-v4-flash, docs claim v4-pro (entigraph_augment.py:10). Per-rowgen_modellineage is intact; fix the docs/claims and decide if flash quality is acceptable for CPT bulk. - Probe golds are unverified flash output (no independent verifier, unlikepolknowledge.py); n≈36/stratum is noisy. -entigraph_hops.py:171-175resume reprocesses already-done paths (same seeded RNG) → duplicate docs, double spend. - Wiki dump pinned to 20231101 (harvest_wiki_sources.py:31), 2.5y stale; can’t serve the “współczesność” probe domain. - Nothing enforces.clean-only inputs downstream; training on non-.cleanEntiGraph files = contamination. ↗ - MINOR — - Tulu3 reservoir sampling biased (
build_v3_sources.py:158-163: wrong index base + non-uniform replacement). Seeded, so reproducible, just not uniform. -gen_distill_pl.py:129-130rewrites the whole output file every round; crash mid-write loses the run. -GEN_MODEL/VERIFY_JUDGEenv overrides have no provenance allowlist (Anthropic/OpenAI possible by accident). - Silentexcept: pass/continueingen_distill_pl.py:107,entigraph_augment.py:149-150,entigraph_hops.py:196-197; decon passes unparseable lines into.cleanunscanned (decon_audit.py:113-115). -distill_nerunderrepresented in mix (25 used vs 127 available); check if intentional. - One truncated tulu3 math response found; sweep tulu3 records for mid-sentence endings. -pipeline/daily.yamlmix paths stale (slayer-data/human_pl/,slayer-data/en/don’t exist); a daily run would silently train on 2 of 5 layers. -.cleansteps strip records, never strip dashes from kept records; train_v3’s low dash rate is sampling luck, not enforcement. ↗ - Recommended order of work — 1. Close decon holes + make the gate mandatory (CRITICAL 1, 2) — cheap, protects everything else. 2. Judge pass over
distill_pl(CRITICAL 3) — directly improves the 46% of the mix most at risk. 3. Fix mix-share math + fail-loud on missing layers (MAJOR 5, 6), rebuildtrain_v3.jsonl, run the mandatory decon gate, relabel the style layer (MAJOR 7). 4. Then SFT can train. For CPT: wire the probe exclusion list +.clean-only enforcement into a new CPT mix builder, scale EntiGraph ~20×, write the packing/next-token trainer with replay, before spending on H100s. ↗
Referenced by
- EntiGraph (mentions)
- QLoRA (mentions)
- ZPE (mentions)
- NER (mentions)
- CPT (mentions)
- LLMzSzŁ (mentions)
- EVAL (mentions)
- INCLUDE (mentions)
- belebele (mentions)
- training (mentions)
- SFT (mentions)
- PolKnowledge (mentions)
- OASST (mentions)
- DeepSeek (mentions)
- API (mentions)
- KLEJ (mentions)
- PSC (mentions)
- OpenAI (mentions)
Local graph
Slayer v3 — Data Pipeline Audit (2026-06-12)
- ← mentions EntiGraph
- ← mentions QLoRA
- ← mentions ZPE
- ← mentions NER
- ← mentions CPT
- ← mentions LLMzSzŁ
- ← mentions EVAL
- ← mentions INCLUDE
- ← mentions belebele
- ← mentions training
- ← mentions SFT
- ← mentions PolKnowledge
- ← mentions OASST
- ← mentions DeepSeek
- ← mentions API
- ← mentions KLEJ
- ← mentions PSC
- ← mentions OpenAI