Shekkizh Knowledge Base
Home · Contents
Research collection · LLM systems

LLM Training Knowledge Base

A cleaned, public-facing synthesis of handwritten notes on LLM training. This page turns scattered raw ideas into a structured map covering post-training, evaluation, reasoning, instruction following, tool use, multilingual quality, long-context behavior, data curation, and serving-time reliability.

Post-training Evaluation Reasoning Function calling Data quality Serving

Structured synthesis rather than a literal transcript of the source notes.

Contents

Overview

Overview

The source notes point toward a practical view of LLM development: strong product behavior is often shaped more by post-training and evaluation discipline than by raw pretraining scale alone.

The important question is not just whether the model knows more, but whether it is consistently useful, controllable, benchmark-honest, tool-capable, and robust under real inference conditions.

Post-training

Post-training priorities

Post-training should be treated as a multi-objective optimization problem. A useful model needs reasoning quality, instruction following, response quality, tool correctness, and broad capability retention at the same time.

Primary goals

  • Improve reasoning on hard tasks
  • Preserve or improve instruction following
  • Support accurate function calling and tool use
  • Maintain strong conversational output quality
  • Avoid multilingual collapse

Common failure modes

  • Benchmark chasing that hurts real-world usability
  • RL gains that degrade style or latency
  • Tool-call regressions after reasoning-focused tuning
  • Template/runtime mismatches at deployment time
Evaluation

Evaluation stack

No single eval should dominate the training story. Different benchmarks test different slices of model behavior.

  • Arena-style evaluations for broad preference-based quality comparison
  • Hard reasoning or knowledge evaluations such as GPQA-like tasks
  • Math-oriented evaluations for structured reasoning behavior
  • Pass@k framing when stochastic sampling matters
Best practice: track metrics by capability bucket instead of collapsing everything into one number.
Reasoning

Reasoning and RL

Reasoning gains are often connected to reinforcement-style post-training, verifier or reward-guided optimization, and the distinction between a pure RL route and a supervised cold start.

  • Cold-start supervised data can stabilize later RL.
  • Pure RL gains can produce unstable or awkward behavior if left unchecked.
  • Reward design matters as much as the optimizer.
  • Reasoning should be measured alongside readability and utility.
Instruction following

Instruction following and response quality

A strong model should solve hard tasks while still obeying constraints, preserving structure, and communicating clearly.

  • Follow explicit constraints without drifting into over-refusal.
  • Preserve formatting, schema, and style requirements.
  • Improve naturalness without sacrificing precision.
  • Track response quality separately from benchmark accuracy.
Tools

Tool use and function calling

Tool use should be evaluated explicitly, because free-form reasoning improvements do not automatically preserve structured calling behavior.

  • Train tool selection and argument formatting directly.
  • Evaluate exact schema adherence, not just intent.
  • Test multi-turn tool use under ambiguity.
  • Protect tool reliability through later post-training stages.
Multilingual

Multilingual quality

Multilingual performance often degrades if post-training becomes too English-centric. If multilingual quality matters, it needs explicit representation in both the training mixture and the eval suite.

Long context

Long-context and recall

Long-context quality should be measured through retrieval fidelity, instruction persistence, and degradation under distractors rather than through context length claims alone.

Data

Data mixing, decontamination, and smaller sets

  • Decontaminate aggressively for benchmark integrity.
  • Track mixture weights across reasoning, tools, multilingual, and chat data.
  • Use smaller curated datasets carefully and evaluate them with the right sampling logic.
  • Prefer explicit dataset taxonomies over a single undifferentiated blob.
Serving

Serving-time details

Chat templates and runtime compatibility matter more than teams often admit. Training-time and inference-time assumptions need to align.

  • Align training and serving chat templates.
  • Validate tool formatting in the actual runtime stack.
  • Document BOS/EOS and special-token assumptions.
  • Test in the target inference engine before calling a run successful.
Roadmap

Suggested roadmap

  1. Define capability buckets.
  2. Build the evaluation suite first.
  3. Create a supervised post-training seed.
  4. Add reasoning optimization while watching regressions elsewhere.
  5. Rebalance the mixture whenever gains look narrow or suspicious.
  6. Validate the exact chat template and runtime intended for deployment.

In short: do not optimize for one impressive score. Optimize for a reliable system.