Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
This guide shows you how to run Grouped Relative Policy Optimization (GRPO)
training with torchforge using the
Qwen3 1.7B model.
Experimental Status and tested version
- torchforge is currently experimental. Expect potential bugs, incomplete features, and API changes.
_ This documentation and instructions were tested with torchforge commit
8bd8d5d3c793ca6e2930b471b8ada67ce2458784.
Prerequisites
- Access to a SUNK cluster with GPU nodes
- Minimum one H100 node for GRPO training
- 10GB available disk space
- A GitHub access token
- A Weights & Biases API key
- Conda
Initialize GitHub and Weights & Biases credentials
At a Slurm login node, run the following commands:
-
Export your GitHub token:
export GH_TOKEN=your_github_token_here
echo "export GH_TOKEN=$GH_TOKEN" >> ~/.bashrc
-
Export your Weights & Biases API key:
Get your Weights & Biases API key from wandb.ai (User Settings → API keys).
export WANDB_API_KEY=your_wandb_api_key_here
echo "export WANDB_API_KEY=$WANDB_API_KEY" >> ~/.bashrc
Install torchforge
To install torchforge, run the following commands:
-
Initialize conda:
/opt/conda/bin/conda init bash
source ~/.bashrc
-
Create a conda environment:
conda create -n torchforge --yes python=3.10 pip
You should see output similar to the following:
Downloading and Extracting Packages:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
To activate this environment, use
$ conda activate torchforge
To deactivate an active environment, use
$ conda deactivate
-
Activate torchforge:
conda activate torchforge
-
Clone the repository:
git clone https://github.com/meta-pytorch/torchforge.git
cd torchforge
git checkout 8bd8d5d3c793ca6e2930b471b8ada67ce2458784
-
Run the installation script:
Note that the installation script takes 5-15 minutes to complete.
You should see output similar to the following.
Note that you do not need to re-activate the conda environment:
[INFO] Installation completed successfully!
[INFO] Re-activate the conda environment to make the changes take effect:
[INFO] conda deactivate && conda activate torchforge
-
Verify the installation:
$ python -c "import forge; print('torchforge imported successfully')"
You should see output similar to the following.
torchforge imported successfully
Run GRPO Training
To run GRPO training, complete the following steps:
-
Edit the training configuration to reduce steps for testing:
sed -i 's/steps: 1000000/steps: 10/' apps/grpo/qwen3_1_7b.yaml
-
Create a Slurm batch script
torchforge-training.sbatch:
cat << 'EOF' > torchforge-training.sbatch
#!/usr/bin/env bash
#SBATCH --job-name=torchforge-grpo-training
#SBATCH --output=output_torchforge.log
#SBATCH --error=error_torchforge.log
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=96
#SBATCH --gpus=8
#SBATCH --time=01:00:00
#SBATCH --mem=0
eval "$(/opt/conda/bin/conda shell.bash hook)"
conda activate torchforge
cd ~/torchforge
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
EOF
-
Submit the job:
sbatch torchforge-training.sbatch
-
Monitor logs:
tail -f output_torchforge.log
Eventually, you’ll see logs like so:
Warning: setting HYPERACTOR_CODEC_MAX_FRAME_LENGTH since this needs to be set to enable large RPC calls via Monarch
INFO 10-21 03:23:23 [__init__.py:235] Automatically detected platform cuda.
Launcher not provided, remote allocations will not work.
Spawning actor DatasetActor
Spawning service Generator
Spawning actor RLTrainer
Spawning actor ReplayBuffer
Spawning actor ComputeAdvantages
Spawning service ReferenceModel
Spawning service RewardActor
... skipped for brevity ...
WandbBackend: Logged 75 metrics at global_step 4
=== [global_reduce] - METRICS STEP 4 ===
buffer/add/count_episodes_added: 16.0
buffer/evict/avg_policy_age: 1.0
buffer/evict/max_policy_age: 1.0
buffer/evict/sum_episodes_evicted: 16.0
buffer/sample/avg_data_utilization: 1.0
buffer/sample/count_sample_requests: 1.0
buffer_perf/sample/total_duration_avg_s: 0.001130327582359314
buffer_perf/sample/total_duration_max_s: 0.001130327582359314
dataset/sample/avg_sample_len: 411.0
dataset/sample/count_samples_generated: 2.0
generator/generate/avg_tokens_generated: 491.375
generator/generate/count_requests: 2.0
generator/generate/count_sequences_completed: 16.0
generator/generate/sum_tokens_generated: 7862.0
generator/update_weights/count_weight_updates: 1.0
generator_perf/_fetch_weights/total_duration_avg_s: 1.6851447527296841
generator_perf/_fetch_weights/total_duration_max_s: 1.6851447527296841
generator_perf/generate/generate/duration_avg_s: 3.1858496093750004
generator_perf/generate/generate/duration_max_s: 4.32051513671875
generator_perf/generate/process_inputs/duration_avg_s: 0.0009907840192317963
generator_perf/generate/process_inputs/duration_max_s: 0.0012066240310668946
generator_perf/generate/total_duration_avg_s: 3.18701276139915
generator_perf/generate/total_duration_max_s: 4.32185558475554
generator_perf/update_weights/avg_pending_requests: 0.0
generator_perf/update_weights/max_pending_requests: -inf
generator_perf/waiting_for_fetch_weights/total_duration_avg_s: 1.685305270832032
generator_perf/waiting_for_fetch_weights/total_duration_max_s: 1.685305270832032
generator_worker_perf/update_weights_from_shared_memory/total_duration_avg_s: 0.6395906782709062
generator_worker_perf/update_weights_from_shared_memory/total_duration_max_s: 0.6395906782709062
reference_perf/forward/avg_sequence_length: 1024.0
reference_perf/forward/compute_logprobs/duration_avg_s: 0.0001311039086431265
reference_perf/forward/compute_logprobs/duration_max_s: 0.0001311260275542736
reference_perf/forward/count_forward_passes: 2.0
reference_perf/forward/forward/duration_avg_s: 0.012596684508025646
reference_perf/forward/forward/duration_max_s: 0.012622143141925335
reference_perf/forward/garbage_collection/duration_avg_s: 0.00033104955218732357
reference_perf/forward/garbage_collection/duration_max_s: 0.00034371716901659966
reference_perf/forward/memory_delta_end_start_avg_gb: 2.31842041015625
reference_perf/forward/memory_peak_max_gb: 11.931992053985596
reference_perf/forward/to_device/duration_avg_s: 7.077353075146675e-05
reference_perf/forward/to_device/duration_max_s: 7.580406963825226e-05
reference_perf/forward/total_duration_avg_s: 0.01313200336880982
reference_perf/forward/total_duration_max_s: 0.013139839749783278
reward/evaluate_response/avg_MathReward_reward: 0.38125
reward/evaluate_response/avg_ThinkingReward_reward: 0.5
reward/evaluate_response/avg_total_reward: 0.440625
reward/evaluate_response/count_MathReward_calls: 16.0
reward/evaluate_response/count_ThinkingReward_calls: 16.0
reward/evaluate_response/std_MathReward_reward: 0.417161164899131
reward/evaluate_response/std_ThinkingReward_reward: 0.3872983346207417
reward/evaluate_response/sum_MathReward_reward: 6.1
reward/evaluate_response/sum_ThinkingReward_reward: 8.0
rl_trainer/avg_grpo_loss: 0.042646802961826324
rl_trainer/count_training_steps: 1.0
rl_trainer/learning_rate: 0.001
rl_trainer_perf/push_weights/flatten_state_dict/duration_avg_s: 0.0004946370609104633
rl_trainer_perf/push_weights/flatten_state_dict/duration_max_s: 0.0004946370609104633
rl_trainer_perf/push_weights/memory_delta_end_start_avg_gb: 0.0
rl_trainer_perf/push_weights/memory_peak_max_gb: 11.417872905731201
rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.0004782499745488167
rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.0004782499745488167
rl_trainer_perf/push_weights/total_duration_avg_s: 3.146173640154302
rl_trainer_perf/push_weights/total_duration_max_s: 3.146173640154302
rl_trainer_perf/push_weights/ts_save/duration_avg_s: 3.1451978851109743
rl_trainer_perf/push_weights/ts_save/duration_max_s: 3.1451978851109743
rl_trainer_perf/step/forward_backward/duration_avg_s: 0.34571166802197695
rl_trainer_perf/step/forward_backward/duration_max_s: 0.34571166802197695
rl_trainer_perf/step/memory_delta_end_start_avg_gb: 0.00022172927856445312
rl_trainer_perf/step/memory_peak_max_gb: 31.65072727203369
rl_trainer_perf/step/optimizer_step/duration_avg_s: 0.009929596912115812
rl_trainer_perf/step/optimizer_step/duration_max_s: 0.009929596912115812
rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.10817016521468759
rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.10817016521468759
rl_trainer_perf/step/total_duration_avg_s: 0.46381334587931633
rl_trainer_perf/step/total_duration_max_s: 0.46381334587931633
==============================
... skipped for brevity ...
[0] [0] [ReferenceModel-0/1] 2025-10-21 03:32:06 INFO [GC] Performing periodic GC collection took 0.00 seconds
Reached training limit (10 steps). Exiting continuous_training loop.
Shutting down...
Shutting down metric logger...
WandbBackend global_reduce: Finished run
Shutting down provisioner..
Shutting down 3 service(s) and 4 actor(s)...
Health loop stopped gracefully.
Error stopping proc_mesh for replica 0: _fetcher_procs
Health loop stopped gracefully.
Health loop stopped gracefully.
Shutdown completed successfully
View training metrics in your Weights & Biases dashboard at wandb.ai.
Additional Resources