LLM Training Knowledge Base · Shekkizh Knowledge Base

Overview

The source notes point toward a practical view of LLM development: strong product behavior is often shaped more by post-training and evaluation discipline than by raw pretraining scale alone.

The important question is not just whether the model knows more, but whether it is consistently useful, controllable, benchmark-honest, tool-capable, and robust under real inference conditions.

Post-training

Post-training priorities

Post-training should be treated as a multi-objective optimization problem. A useful model needs reasoning quality, instruction following, response quality, tool correctness, and broad capability retention at the same time.

Primary goals

Improve reasoning on hard tasks
Preserve or improve instruction following
Support accurate function calling and tool use
Maintain strong conversational output quality
Avoid multilingual collapse

Common failure modes

Benchmark chasing that hurts real-world usability
RL gains that degrade style or latency
Tool-call regressions after reasoning-focused tuning
Template/runtime mismatches at deployment time

Evaluation

Evaluation stack

No single eval should dominate the training story. Different benchmarks test different slices of model behavior.

Arena-style evaluations for broad preference-based quality comparison
Hard reasoning or knowledge evaluations such as GPQA-like tasks
Math-oriented evaluations for structured reasoning behavior
Pass@k framing when stochastic sampling matters

Best practice: track metrics by capability bucket instead of collapsing everything into one number.

Reasoning

Reasoning and RL

Reasoning gains are often connected to reinforcement-style post-training, verifier or reward-guided optimization, and the distinction between a pure RL route and a supervised cold start.

Cold-start supervised data can stabilize later RL.
Pure RL gains can produce unstable or awkward behavior if left unchecked.
Reward design matters as much as the optimizer.
Reasoning should be measured alongside readability and utility.

Instruction following

Instruction following and response quality

A strong model should solve hard tasks while still obeying constraints, preserving structure, and communicating clearly.

Follow explicit constraints without drifting into over-refusal.
Preserve formatting, schema, and style requirements.
Improve naturalness without sacrificing precision.
Track response quality separately from benchmark accuracy.

Tools

Tool use and function calling

Tool use should be evaluated explicitly, because free-form reasoning improvements do not automatically preserve structured calling behavior.

Train tool selection and argument formatting directly.
Evaluate exact schema adherence, not just intent.
Test multi-turn tool use under ambiguity.
Protect tool reliability through later post-training stages.

Multilingual

Multilingual quality

Multilingual performance often degrades if post-training becomes too English-centric. If multilingual quality matters, it needs explicit representation in both the training mixture and the eval suite.

Long context

Long-context and recall

Long-context quality should be measured through retrieval fidelity, instruction persistence, and degradation under distractors rather than through context length claims alone.

Data

Data mixing, decontamination, and smaller sets

Decontaminate aggressively for benchmark integrity.
Track mixture weights across reasoning, tools, multilingual, and chat data.
Use smaller curated datasets carefully and evaluate them with the right sampling logic.
Prefer explicit dataset taxonomies over a single undifferentiated blob.

Serving

Serving-time details

Chat templates and runtime compatibility matter more than teams often admit. Training-time and inference-time assumptions need to align.

Align training and serving chat templates.
Validate tool formatting in the actual runtime stack.
Document BOS/EOS and special-token assumptions.
Test in the target inference engine before calling a run successful.

Roadmap

Suggested roadmap

Define capability buckets.
Build the evaluation suite first.
Create a supervised post-training seed.
Add reasoning optimization while watching regressions elsewhere.
Rebalance the mixture whenever gains look narrow or suspicious.
Validate the exact chat template and runtime intended for deployment.

In short: do not optimize for one impressive score. Optimize for a reliable system.

Contents

Overview

Post-training priorities

Primary goals

Common failure modes

Evaluation stack

Reasoning and RL

Instruction following and response quality

Tool use and function calling

Multilingual quality

Long-context and recall

Data mixing, decontamination, and smaller sets

Serving-time details

Suggested roadmap