Shekkizh Knowledge Base
Home · 8-GPU plan · Contents
NVIDIA LLM Systems

Nemotron Knowledge Base

A structured map of the NVIDIA Nemotron training ecosystem: what each repository does, how the pieces fit together, and how to translate the stack into a smaller, practical single-node 8-GPU learning setup.

Nemotron Megatron-Bridge Megatron-LM NeMo Run NeMo Curator Single-node training

Written as a public-facing reference page for understanding the training stack.

Contents

01

Big picture

The Nemotron ecosystem is best understood as a layered LLM training stack. Different repositories handle different pieces of the problem: data preparation, pretraining, post-training, evaluation, orchestration, and low-level distributed systems behavior.

The useful mental model is to separate blueprint from execution. One repo shows how the system is structured, another makes a compact run practical, and another explains the machinery underneath.

02

What Nemotron is

Nemotron is NVIDIA’s open-model and open-recipe ecosystem for advanced language and multimodal systems. It exposes not only weights and model families, but also the surrounding workflow that turns raw datasets into trained and aligned models.

That makes it useful both as a reference system for how modern LLM training is staged and as a teaching scaffold for learning how real training pipelines are assembled.

03

Repository map

Nemotron

The top-level recipe blueprint.

Megatron-Bridge

The most practical pretraining entrypoint.

Megatron-LM

The lower-level distributed and data internals reference.

NeMo Run

Experiment orchestration and execution abstraction.

NeMo Curator

Data curation, filtering, and deduplication.

AutoModel

A parallel Hugging Face-native path for training workflows.

04

Visual stack diagram

The stack becomes clearer when drawn as a flow from data to training to alignment, with supporting systems around it.

NeMo Curatordata curationNemotronrecipe blueprintMegatron-Bridgetraining engineNeMo RunorchestrationMegatron-LMparallelism + datasets
05

The role of the Nemotron repository

Nemotron is the recipe hub. It shows how training stages fit together, what artifacts move between them, and how NVIDIA turns LLM development into a staged process.

  • Stage 0: pretraining
  • Stage 1: supervised fine-tuning
  • Stage 2: RL or preference alignment
  • Stage 3: evaluation

It is the best place to learn the system’s shape, even if it is not the simplest place to run a first compact experiment.

06

The role of Megatron-Bridge

Megatron-Bridge is the operational center for a smaller educational run. It provides the actual training loop, recipe configs, tokenizer integration, and the bridge between model definitions and Megatron execution.

For a single-node 8-GPU setup, this is the cleanest place to begin because it supports straightforward torchrun-style workflows and smaller recipe examples.

If the goal is to understand training rather than reproduce a frontier model exactly, Megatron-Bridge is the best first execution layer.
07

The role of Megatron-LM

Megatron-LM and Megatron Core expose the lower-level mechanics: tensor parallelism, pipeline parallelism, indexed datasets, sample construction, and training-time distributed behavior.

It is the repository to read when the question becomes “what exactly is happening under the hood?” rather than “what should I launch first?”

08

Run, Curator, and AutoModel

NeMo Run

Best used after the basic training loop is already understood. It helps with clean experiment orchestration and repeatability.

NeMo Curator

Essential for serious data preparation, but better treated as a later extension than a day-one dependency.

AutoModel

A strong parallel option for Hugging Face-centric workflows, but less central than Megatron-Bridge for a Nemotron-oriented learning path.

09

The nano3 recipe is the right conceptual reference

The Nemotron nano3 recipe is especially valuable because it reveals the full shape of a modern open-model pipeline while staying more approachable than the largest frontier recipes.

  • It clearly separates pretraining, SFT, and RL-style stages.
  • It documents both data-preparation and training entrypoints.
  • It includes smaller or testing-oriented variants that point toward reduced-complexity runs.
10

Why not start with full Nemotron Nano

Starting with the full Nano recipe means absorbing sparse MoE behavior, hybrid architectural ideas, and larger-scale assumptions all at once. That tends to obscure the fundamental training lessons.

For learning, it is better to start with a smaller dense model and reintroduce specialized complexity only after the basic pipeline is clear.

11

The best small-model path

The cleanest path is to use a small dense decoder-only model in Megatron-Bridge while borrowing the tokenizer and the conceptual staging from Nemotron.

  1. Study the Nemotron recipe structure.
  2. Implement the first run in Megatron-Bridge.
  3. Read Megatron-LM whenever the distributed mechanics become fuzzy.
  4. Return to the full Nemotron recipe only after the smaller run feels legible.
12

Model-size guidance

  • ~1B dense — the best first complete run
  • ~3B dense — the best second run if memory allows
  • ~8B dense — reasonable after the pipeline is stable

This sequence usually teaches more than trying to compress a sparse production recipe into a first experiment.

13

Data and tokenizer plan

Reusing the Nemotron Nano tokenizer is a practical way to remove one variable from the early experiments. That keeps attention on the training loop itself.

For the first passes, data should be deliberately simple: a clean, modest public corpus rather than a full web-scale curation pipeline. That makes loss curves and batching behavior much easier to reason about.

14

Parallelism choices on 8 GPUs

  • TP=2, PP=1, DP=4 — a strong default starting point
  • TP=4, PP=1, DP=2 — useful when partitioning pressure grows
  • Pipeline parallelism later — add only when memory demands require it

The point is clarity, not maximum complexity.

15

Practical roadmap

  1. Read the Nemotron nano3 pretraining stage.
  2. Start from a Megatron-Bridge quickstart pretraining example.
  3. Run a smoke test on mock data.
  4. Swap in the Nemotron tokenizer and a small real corpus.
  5. Train a short run and inspect memory, throughput, and loss behavior.
  6. Change one variable at a time.
  7. Only then revisit Nemotron-specific sparse or hybrid features.
17

Final recommendation

The most effective way to learn the Nemotron ecosystem is to separate blueprint from execution. Use Nemotron to understand the full training story, use Megatron-Bridge to run the first real 8-GPU experiment, and use Megatron-LM to understand the machinery underneath.

That path turns a complex NVIDIA stack into something that is both readable and runnable.