Run torchforge on SUNK
Run Grouped Relative Policy Optimization (GRPO) training with torchforge on SUNK
This guide shows you how to run Grouped Relative Policy Optimization (GRPO) training with torchforge using the Qwen3 1.7B model.
-
torchforge is currently experimental. Expect potential bugs, incomplete features, and API changes.
-
This documentation and instructions were tested with torchforge commit
8bd8d5d3c793ca6e2930b471b8ada67ce2458784
.
Prerequisites
- Access to a SUNK cluster with GPU nodes
- Minimum one H100 node for GRPO training
- 10GB available disk space
- A GitHub access token
- A Weights & Biases API key
- Conda
Initialize GitHub and Weights & Biases credentials
At a Slurm login node, run the following commands:
-
Export your GitHub token:
Example$export GH_TOKEN=your_github_token_here$echo "export GH_TOKEN=$GH_TOKEN" >> ~/.bashrc -
Export your Weights & Biases API key:
Get your Weights & Biases API key from wandb.ai (User Settings → API keys).
Example$export WANDB_API_KEY=your_wandb_api_key_here$echo "export WANDB_API_KEY=$WANDB_API_KEY" >> ~/.bashrc
Install torchforge
To install torchforge, run the following commands:
-
Initialize conda:
Example$/opt/conda/bin/conda init bash$source ~/.bashrc -
Create a conda environment:
Example$conda create -n torchforge --yes python=3.10 pipYou should see output similar to the following:
Downloading and Extracting Packages:Preparing transaction: doneVerifying transaction: doneExecuting transaction: doneTo activate this environment, use$ conda activate torchforgeTo deactivate an active environment, use$ conda deactivate -
Activate torchforge:
Example$conda activate torchforge -
Clone the repository:
Example$git clone https://github.com/meta-pytorch/torchforge.git$cd torchforge$git checkout 8bd8d5d3c793ca6e2930b471b8ada67ce2458784 -
Run the installation script:
Note that the installation script takes 5-15 minutes to complete.
Example$./scripts/install.shYou should see output similar to the following. Note that you do not need to re-activate the conda environment:
[INFO] Installation completed successfully![INFO] Re-activate the conda environment to make the changes take effect:[INFO] conda deactivate && conda activate torchforge -
Verify the installation:
Example$python -c "import forge; print('torchforge imported successfully')"You should see output similar to the following.
torchforge imported successfully
Run GRPO Training
To run GRPO training, complete the following steps:
-
Edit the training configuration to reduce steps for testing:
Example$sed -i 's/steps: 1000000/steps: 10/' apps/grpo/qwen3_1_7b.yaml -
Create a Slurm batch script
torchforge-training.sbatch
:Examplecat << 'EOF' > torchforge-training.sbatch#!/usr/bin/env bash#SBATCH --job-name=torchforge-grpo-training#SBATCH --output=output_torchforge.log#SBATCH --error=error_torchforge.log#SBATCH --ntasks=1#SBATCH --cpus-per-task=96#SBATCH --gpus=8#SBATCH --time=01:00:00#SBATCH --mem=0eval "$(/opt/conda/bin/conda shell.bash hook)"conda activate torchforgecd ~/torchforgepython -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yamlEOF -
Submit the job:
Example$sbatch torchforge-training.sbatch -
Monitor logs:
Example$tail -f output_torchforge.logEventually, you'll see logs like so:
$Warning: setting HYPERACTOR_CODEC_MAX_FRAME_LENGTH since this needs to be set to enable large RPC calls via MonarchINFO 10-21 03:23:23 [__init__.py:235] Automatically detected platform cuda.Launcher not provided, remote allocations will not work.Spawning actor DatasetActorSpawning service GeneratorSpawning actor RLTrainerSpawning actor ReplayBufferSpawning actor ComputeAdvantagesSpawning service ReferenceModelSpawning service RewardActor... skipped for brevity ...WandbBackend: Logged 75 metrics at global_step 4=== [global_reduce] - METRICS STEP 4 ===buffer/add/count_episodes_added: 16.0buffer/evict/avg_policy_age: 1.0buffer/evict/max_policy_age: 1.0buffer/evict/sum_episodes_evicted: 16.0buffer/sample/avg_data_utilization: 1.0buffer/sample/count_sample_requests: 1.0buffer_perf/sample/total_duration_avg_s: 0.001130327582359314buffer_perf/sample/total_duration_max_s: 0.001130327582359314dataset/sample/avg_sample_len: 411.0dataset/sample/count_samples_generated: 2.0generator/generate/avg_tokens_generated: 491.375generator/generate/count_requests: 2.0generator/generate/count_sequences_completed: 16.0generator/generate/sum_tokens_generated: 7862.0generator/update_weights/count_weight_updates: 1.0generator_perf/_fetch_weights/total_duration_avg_s: 1.6851447527296841generator_perf/_fetch_weights/total_duration_max_s: 1.6851447527296841generator_perf/generate/generate/duration_avg_s: 3.1858496093750004generator_perf/generate/generate/duration_max_s: 4.32051513671875generator_perf/generate/process_inputs/duration_avg_s: 0.0009907840192317963generator_perf/generate/process_inputs/duration_max_s: 0.0012066240310668946generator_perf/generate/total_duration_avg_s: 3.18701276139915generator_perf/generate/total_duration_max_s: 4.32185558475554generator_perf/update_weights/avg_pending_requests: 0.0generator_perf/update_weights/max_pending_requests: -infgenerator_perf/waiting_for_fetch_weights/total_duration_avg_s: 1.685305270832032generator_perf/waiting_for_fetch_weights/total_duration_max_s: 1.685305270832032generator_worker_perf/update_weights_from_shared_memory/total_duration_avg_s: 0.6395906782709062generator_worker_perf/update_weights_from_shared_memory/total_duration_max_s: 0.6395906782709062reference_perf/forward/avg_sequence_length: 1024.0reference_perf/forward/compute_logprobs/duration_avg_s: 0.0001311039086431265reference_perf/forward/compute_logprobs/duration_max_s: 0.0001311260275542736reference_perf/forward/count_forward_passes: 2.0reference_perf/forward/forward/duration_avg_s: 0.012596684508025646reference_perf/forward/forward/duration_max_s: 0.012622143141925335reference_perf/forward/garbage_collection/duration_avg_s: 0.00033104955218732357reference_perf/forward/garbage_collection/duration_max_s: 0.00034371716901659966reference_perf/forward/memory_delta_end_start_avg_gb: 2.31842041015625reference_perf/forward/memory_peak_max_gb: 11.931992053985596reference_perf/forward/to_device/duration_avg_s: 7.077353075146675e-05reference_perf/forward/to_device/duration_max_s: 7.580406963825226e-05reference_perf/forward/total_duration_avg_s: 0.01313200336880982reference_perf/forward/total_duration_max_s: 0.013139839749783278reward/evaluate_response/avg_MathReward_reward: 0.38125reward/evaluate_response/avg_ThinkingReward_reward: 0.5reward/evaluate_response/avg_total_reward: 0.440625reward/evaluate_response/count_MathReward_calls: 16.0reward/evaluate_response/count_ThinkingReward_calls: 16.0reward/evaluate_response/std_MathReward_reward: 0.417161164899131reward/evaluate_response/std_ThinkingReward_reward: 0.3872983346207417reward/evaluate_response/sum_MathReward_reward: 6.1reward/evaluate_response/sum_ThinkingReward_reward: 8.0rl_trainer/avg_grpo_loss: 0.042646802961826324rl_trainer/count_training_steps: 1.0rl_trainer/learning_rate: 0.001rl_trainer_perf/push_weights/flatten_state_dict/duration_avg_s: 0.0004946370609104633rl_trainer_perf/push_weights/flatten_state_dict/duration_max_s: 0.0004946370609104633rl_trainer_perf/push_weights/memory_delta_end_start_avg_gb: 0.0rl_trainer_perf/push_weights/memory_peak_max_gb: 11.417872905731201rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.0004782499745488167rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.0004782499745488167rl_trainer_perf/push_weights/total_duration_avg_s: 3.146173640154302rl_trainer_perf/push_weights/total_duration_max_s: 3.146173640154302rl_trainer_perf/push_weights/ts_save/duration_avg_s: 3.1451978851109743rl_trainer_perf/push_weights/ts_save/duration_max_s: 3.1451978851109743rl_trainer_perf/step/forward_backward/duration_avg_s: 0.34571166802197695rl_trainer_perf/step/forward_backward/duration_max_s: 0.34571166802197695rl_trainer_perf/step/memory_delta_end_start_avg_gb: 0.00022172927856445312rl_trainer_perf/step/memory_peak_max_gb: 31.65072727203369rl_trainer_perf/step/optimizer_step/duration_avg_s: 0.009929596912115812rl_trainer_perf/step/optimizer_step/duration_max_s: 0.009929596912115812rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.10817016521468759rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.10817016521468759rl_trainer_perf/step/total_duration_avg_s: 0.46381334587931633rl_trainer_perf/step/total_duration_max_s: 0.46381334587931633==============================... skipped for brevity ...[0] [0] [ReferenceModel-0/1] 2025-10-21 03:32:06 INFO [GC] Performing periodic GC collection took 0.00 secondsReached training limit (10 steps). Exiting continuous_training loop.Shutting down...Shutting down metric logger...WandbBackend global_reduce: Finished runShutting down provisioner..Shutting down 3 service(s) and 4 actor(s)...Health loop stopped gracefully.Error stopping proc_mesh for replica 0: _fetcher_procsHealth loop stopped gracefully.Health loop stopped gracefully.Shutdown completed successfully
View training metrics in your Weights & Biases dashboard at wandb.ai.