Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
How to use CoreWeave sandboxes for RL training workflows where models execute tool calls in isolated environments.
Training code agents with RL requires executing tool calls (bash commands, file operations) in isolated environments. Untrusted model-generated code might modify the host filesystem, hit the network, or produce non-deterministic results. Sandboxes give you isolated, ephemeral environments where tool calls run without affecting the host or other rollouts.
In a training loop, the model generates actions (tool calls), the sandbox executes them, and observations flow back to the model. The sandbox persists across tool calls within an episode, so file changes and installed packages carry over between steps. Reward comes from the final sandbox state (e.g., tests passing) or trajectory quality.
The tagging and listing APIs make it practical to clean up and monitor training runs with thousands of sandboxes.
Prerequisites
Set your CWSANDBOX_API_KEY to a
CoreWeave API Access Token:
export CWSANDBOX_API_KEY="your-access-token"
Install the Python SDK:
Core pattern
The basic setup: an agent loop runs on your training infrastructure, and tool calls execute in a sandbox.
import cwsandbox
from cwsandbox import Sandbox
def run_agent_episode(model, task: dict, sandbox: Sandbox) -> tuple[list, float]:
"""Run one agent episode, returning trajectory and reward."""
messages = [{"role": "user", "content": task["prompt"]}]
for step in range(task.get("max_steps", 10)):
# Model generates next action
response = model.generate(messages)
messages.append({"role": "assistant", "content": response})
tool_calls = parse_tool_calls(response)
if not tool_calls:
break
# Execute tool calls in sandbox
for tool in tool_calls:
if tool.name == "bash":
result = sandbox.exec(
["bash", "-c", tool.command],
timeout_seconds=30.0,
).result()
observation = f"exit={result.returncode}\n{result.stdout}{result.stderr}"
elif tool.name == "read_file":
content = sandbox.read_file(tool.path).result()
observation = content.decode()
elif tool.name == "write_file":
sandbox.write_file(tool.path, tool.content.encode()).result()
observation = "File written successfully"
messages.append({"role": "tool", "name": tool.name, "content": observation})
# Compute reward from final sandbox state
test_result = sandbox.exec(task["test_command"]).result()
reward = 1.0 if test_result.returncode == 0 else 0.0
return messages, reward
The sandbox persists across tool calls within an episode, so file changes accumulate as the agent works.
Training step with parallel episodes
Process a batch of tasks with one sandbox per episode:
def training_step(model, batch: list[dict], session) -> list[float]:
"""Run agent episodes for a batch of tasks."""
# Create and pre-start all sandboxes in parallel
sandboxes = [session.sandbox() for _ in batch]
refs = [sb.start() for sb in sandboxes]
[r.result() for r in refs] # Wait for all backends to accept
trajectories = []
rewards = []
for task, sandbox in zip(batch, sandboxes):
trajectory, reward = run_agent_episode(model, task, sandbox)
trajectories.append(trajectory)
rewards.append(reward)
sandbox.stop() # Non-blocking cleanup
# trajectories and rewards go to policy update
return rewards
Tags let you filter and find sandboxes created by your training jobs. Include metadata that helps identify sandboxes when debugging or cleaning up:
import os
from cwsandbox import SandboxDefaults
def make_defaults(model_name: str) -> SandboxDefaults:
return SandboxDefaults(
container_image="python:3.11",
tags=(
f"wandb-run:{os.environ.get('WANDB_RUN_ID', 'local')}",
f"slurm-job:{os.environ.get('SLURM_JOB_ID', 'interactive')}",
f"model:{model_name}",
"rl-training",
),
)
Useful metadata to include in tags:
| Tag pattern | Purpose |
|---|
wandb-run:{id} | W&B run ID (from WANDB_RUN_ID env var) for filtering by training run |
slurm-job:{id} | Slurm job ID (from SLURM_JOB_ID env var) for cluster job tracking |
model:{name} | Model name or checkpoint for multi-model experiments |
env:{name} | Environment (dev, staging, prod) for resource management |
Sandbox tags become Kubernetes pod labels, which the CoreWeave observability platform uses for filtering and dashboards.
Try it: reward_function.py
A minimal integration: compute code execution rewards with parallel sandbox execution.
What it does:
- Executes a set of toy code completions (arithmetic, string operations, syntax errors, runtime errors)
- Creates one sandbox per completion for isolation
- Computes binary rewards: 1.0 for successful execution, 0.0 for failure
- Shows progress as results arrive (faster executions complete first)
How it uses CoreWeave sandboxes:
The example uses cwsandbox.wait() to process results as they complete:
# Create sandboxes and execute all completions in parallel
processes = [
session.sandbox().exec(
["python", "-c", code],
timeout_seconds=EXECUTION_TIMEOUT_SECONDS,
)
for code in completions
]
# Collect results as they complete
while pending:
[process], pending = cwsandbox.wait(pending, num_returns=1)
result = process.result()
reward = 1.0 if result.returncode == 0 else 0.0
Run it:
uv run examples/rl_training/reward_function.py
No additional dependencies required. No GPU needed.
Expected output:
Results arrive as executions complete, so faster problems finish first:
RL Training Reward Function Example (job: cc92116b)
============================================================
Evaluating 5 completions...
Progress (results arrive as executions complete):
------------------------------------------------------------
[1/5] Problem 1 (string-ops): PASS
[2/5] Problem 3 (syntax-error): FAIL
[3/5] Problem 2 (delayed-error): FAIL
[4/5] Problem 0 (slow-sum): PASS
[5/5] Problem 4 (slow-list): PASS
------------------------------------------------------------
Final summary (original order):
------------------------------------------------------------
Problem 0 (slow-sum): reward=1.0 [PASS] OK
Problem 1 (string-ops): reward=1.0 [PASS] OK
Problem 2 (delayed-error): reward=0.0 [FAIL] OK
Problem 3 (syntax-error): reward=0.0 [FAIL] OK
Problem 4 (slow-list): reward=1.0 [PASS] OK
------------------------------------------------------------
Total reward: 3.0/5
Pass rate: 3/5 (60%)
TRL GRPOTrainer integration
TRL uses a reward function interface where completions map directly to rewards. The agent generates a completion, and the reward function executes it in a sandbox.
The standard pattern uses <answer> XML tags for code extraction (matching the format used in GRPO math examples with \boxed{}):
import cwsandbox
from cwsandbox import SandboxDefaults
session = cwsandbox.Session(defaults=SandboxDefaults(
container_image="python:3.11",
tags=("trl-grpo",),
))
def extract_xml_answer(text: str) -> str:
if "<answer>" not in text:
return ""
return text.split("<answer>")[-1].split("</answer>")[0].strip()
def reward_fn(completions, **kwargs) -> list[float]:
# Conversational datasets pass message dicts, standard datasets pass strings
texts = [c[0]["content"] if isinstance(c, list) else c for c in completions]
codes = [extract_xml_answer(t) for t in texts]
code_indices = [(i, code) for i, code in enumerate(codes) if code]
processes = [
(i, session.sandbox().exec(
["python", "-c", code],
timeout_seconds=30.0,
))
for i, code in code_indices
]
rewards = [0.0] * len(codes)
for i, process in processes:
try:
rewards[i] = 1.0 if process.result().returncode == 0 else 0.0
except Exception:
pass
return rewards
This pattern works for training models to generate correct code in a single turn.
Try it: trl_grpo_integration.py
Uses CoreWeave sandboxes with TRL’s GRPOTrainer for code execution rewards.
What it does:
- Loads a small model (
Qwen/Qwen2.5-0.5B-Instruct)
- Creates a toy dataset of simple coding problems
- Trains the model using GRPO with sandbox-based reward computation
- Runs 10 training steps to demonstrate the integration
How it uses CoreWeave sandboxes:
The reward function extracts code from <answer> tags (the standard GRPO pattern), creates sandboxes in parallel through a Session, executes each completion, and returns binary rewards:
def extract_xml_answer(text: str) -> str:
"""Extract answer from XML-style <answer> tags."""
if "<answer>" not in text:
return ""
answer = text.split("<answer>")[-1]
answer = answer.split("</answer>")[0]
return answer.strip()
def code_execution_reward(completions, **kwargs) -> list[float]:
# Conversational datasets pass message dicts, standard datasets pass strings
texts = [c[0]["content"] if isinstance(c, list) else c for c in completions]
codes = [extract_xml_answer(t) for t in texts]
# Create sandboxes and execute non-empty code in parallel
processes = [
(i, session.sandbox().exec(
["python", "-c", code],
timeout_seconds=EXECUTION_TIMEOUT_SECONDS,
))
for i, code in enumerate(codes) if code
]
# Collect rewards, defaulting to 0.0
rewards = [0.0] * len(codes)
for i, process in processes:
try:
result = process.result()
rewards[i] = 1.0 if result.returncode == 0 else 0.0
except Exception:
rewards[i] = 0.0
return rewards
The prompts use a system message instructing the model to format code with <answer> tags:
SYSTEM_PROMPT = """You solve coding problems by writing Python code.
Put your code inside <answer> tags like this: <answer>print("hello")</answer>
Only include the code, no explanations."""
The Session tracks sandboxes and cleans them up when it closes.
Run it:
uv pip install trl==0.27.1 transformers==5.0.0 datasets==4.5.0 torch==2.10.0
uv run examples/rl_training/trl_grpo_integration.py
GPU is recommended for reasonable performance. Without one, training works but is slow.
Expected output:
TRL GRPO Integration Example (job: def67890)
============================================================
Loading model: Qwen/Qwen2.5-0.5B-Instruct
Creating toy dataset...
Dataset size: 5 problems
Setting up GRPOTrainer...
Starting training (10 steps)...
------------------------------------------------------------
[CWSandbox] Reward call 1: 2 sandboxes, 0/2 passed
[CWSandbox] Reward call 2: 1 sandboxes, 0/1 passed, 1 skipped (no code)
[CWSandbox] Reward call 3: 0 sandboxes, 0/0 passed, 2 skipped (no code)
[CWSandbox] Reward call 4: 2 sandboxes, 0/2 passed
...
[CWSandbox] Reward call 8: 2 sandboxes, 1/2 passed
[CWSandbox] Reward call 9: 2 sandboxes, 1/2 passed
[CWSandbox] Reward call 10: 1 sandboxes, 0/1 passed, 1 skipped (no code)
[training logs]
------------------------------------------------------------
Training completed successfully!
Understanding the output:
The number of sandboxes varies per step because we only create sandboxes when extract_xml_answer() finds extractable code in the model’s completion. When the model generates text without the expected <answer>...</answer> tags, that completion is skipped and receives a reward of 0.0.
2 sandboxes, 0/2 passed - Model generated 2 code blocks, both failed execution
1 sandboxes, 0/1 passed, 1 skipped (no code) - Model generated 1 code block (failed) and 1 text-only completion
0 sandboxes, 0/0 passed, 2 skipped (no code) - Model generated no extractable code in either completion
Expected with a small, untrained model. As training progresses, you should see fewer skipped completions and more passes.
Error handling in agent episodes
Sandbox operations can fail (timeouts, missing files, sandbox termination). Return observations that help the agent understand what went wrong:
from cwsandbox import SandboxTimeoutError, SandboxFileError
def execute_tool(sandbox, tool) -> str:
"""Execute a tool call, returning an observation string."""
try:
if tool.name == "bash":
result = sandbox.exec(
["bash", "-c", tool.command],
timeout_seconds=30.0,
).result()
return f"exit={result.returncode}\n{result.stdout}{result.stderr}"
elif tool.name == "read_file":
content = sandbox.read_file(tool.path).result()
return content.decode()
elif tool.name == "write_file":
sandbox.write_file(tool.path, tool.content.encode()).result()
return "File written successfully"
except SandboxTimeoutError:
return "Error: command timed out after 30 seconds"
except SandboxFileError as e:
return f"Error: {e}"
except Exception as e:
return f"Error: {type(e).__name__}: {e}"
For reward computation, catch exceptions and return a fallback reward instead of propagating to the training loop.
W&B metrics integration
When using W&B (Weights & Biases) for training, cwsandbox Sessions log sandbox usage metrics to your active wandb run automatically.
Auto-detection
If WANDB_API_KEY is set and a wandb run is active (wandb.run exists), metrics logging is enabled automatically:
import wandb
from cwsandbox import Session, SandboxDefaults
wandb.init(project="my-rl-training")
# Metrics logging enabled automatically
with Session(defaults) as session:
for step in range(num_steps):
sandbox = session.sandbox()
# Exec results are automatically tracked - no manual calls needed
result = sandbox.exec(["python", "-c", code]).result()
# Log metrics at this training step for correlation
session.log_metrics(step=step)
# Final metrics logged on session close
Explicit control
Control metrics reporting with the report_to parameter:
# Explicit opt-in (creates reporter; metrics logged when wandb.run exists)
session = Session(defaults, report_to=["wandb"])
# Disable reporting (even if wandb run exists)
session = Session(defaults, report_to=[])
# Auto-detect (default behavior)
session = Session(defaults, report_to=None)
Metrics
Execution metrics are tracked automatically when exec() completes:
| Metric | Description |
|---|
cwsandbox/sandboxes_created | Total sandboxes created via session |
cwsandbox/executions | Total exec() calls |
cwsandbox/exec_completed_ok | Completed executions (returncode=0) |
cwsandbox/exec_completed_nonzero | Completed executions (returncode!=0) |
cwsandbox/exec_failures | Failed executions (timeouts, transport failures) |
cwsandbox/exec_completion_rate | Fraction of exec() that completed with returncode=0 |
cwsandbox/exec_failure_rate | Fraction of exec() that failed to complete |
cwsandbox/startup_count | Number of sandbox startup times recorded |
cwsandbox/avg_startup_seconds | Average sandbox startup time |
cwsandbox/min_startup_seconds | Minimum sandbox startup time |
cwsandbox/max_startup_seconds | Maximum sandbox startup time |
Tracking is automatic: just call exec() on any sandbox associated with a session. Call session.log_metrics(step=N) to log at specific training steps:
def training_step(session, model, batch, step: int) -> list[float]:
rewards = []
for task in batch:
sandbox = session.sandbox()
# Metrics tracked automatically on exec() completion
result = sandbox.exec(["python", "-c", task["code"]]).result()
reward = 1.0 if result.returncode == 0 else 0.0
rewards.append(reward)
sandbox.stop()
# Log metrics at this training step for correlation
session.log_metrics(step=step)
return rewards
You can also access per-sandbox statistics via the exec_stats property:
sandbox = session.sandbox()
result = sandbox.exec(["echo", "hello"]).result()
print(sandbox.exec_stats) # {"exec_count": 1, "completed_ok": 1, "completed_nonzero": 0, "failures": 0}
Per-sandbox exec metrics
Sessions with W&B integration also track per-sandbox metrics:
| Metric | Description |
|---|
cwsandbox/avg_execs_per_sandbox | Average exec() calls per sandbox (useful for “tool calls per rollout”) |
cwsandbox/min_execs_per_sandbox | Minimum exec() calls in any sandbox |
cwsandbox/max_execs_per_sandbox | Maximum exec() calls in any sandbox |
What these tell you about agent behavior:
- High avg_execs_per_sandbox may indicate verbose agents that make many tool calls per episode
- Large variance (max-min) may indicate inconsistent rollout behavior across episodes
- Trends over training steps show how agent behavior evolves as the policy improves
Example dashboard usage:
- Plot
avg_execs_per_sandbox vs training step to see tool usage trends over training
- Alert if
max_execs_per_sandbox exceeds a threshold (runaway agent making excessive tool calls)
- Compare min/max spread to detect episodes where agents get stuck in loops vs complete quickly
By default, log_metrics() resets the counters after logging. Set reset=False to keep accumulating:
session.log_metrics(step=step, reset=False) # Keep accumulating
Metrics are also logged automatically when the session closes, so you get final summary metrics even without explicit logging.
Monitoring and debugging
Counting active sandboxes
Monitor sandbox usage during training:
from cwsandbox import Sandbox, SandboxStatus
def count_active_sandboxes(run_id: str) -> dict:
sandboxes = Sandbox.list(
tags=[f"wandb-run:{run_id}"],
status=[SandboxStatus.RUNNING, SandboxStatus.PENDING],
).result()
return {
"running": sum(1 for s in sandboxes if s.status == SandboxStatus.RUNNING),
"pending": sum(1 for s in sandboxes if s.status == SandboxStatus.PENDING),
"total": len(sandboxes),
}
Logging execution details
Capture execution details for debugging reward computation:
import logging
logger = logging.getLogger(__name__)
# Assumes `session` is created at module level
def logged_reward(completion: str, step: int) -> float:
code = extract_xml_answer(completion) # Extract code from <answer> tags
sandbox = session.sandbox()
sandbox.wait()
result = sandbox.exec(
["python", "-c", code],
timeout_seconds=30.0,
).result()
logger.debug(
"Reward computation",
extra={
"step": step,
"sandbox_id": sandbox.sandbox_id,
"returncode": result.returncode,
"stdout_len": len(result.stdout),
"stderr_len": len(result.stderr),
},
)
sandbox.stop()
return 1.0 if result.returncode == 0 else 0.0
Multi-step rollouts with ART
The TRL example above uses sandboxes for single-shot execution: one sandbox per completion, execute once, return a reward. This works for training models to generate correct code in one attempt.
Stateful multi-step rollouts are different: the agent takes multiple actions within a single sandbox, and the sandbox maintains state between actions. The agent can write a file, run it, see the error, edit the file, and try again - all within the same sandbox.
The examples/rl_training/art/ directory demonstrates this pattern on the MBPP benchmark. When a solution fails, the agent receives error feedback and can iterate on its approach.
Overview
ART (Agent Reinforcement Trainer) is an open-source RL framework by OpenPipe for training multi-step agents using GRPO. This example integrates CoreWeave sandboxes with ART:
- Uses the
art package (openpipe-art) for trajectory collection and training
- Supports two backends:
LocalBackend (requires GPU) or TinkerBackend (no GPU)
- Executes code via tool calling in CoreWeave sandbox
- Computes binary rewards based on MBPP test case results
Training approach: GRPO with distillation
This example uses distillation with reinforcement learning: a stronger model generates demonstrations, and a smaller model learns to replicate the successful ones.
Two models are involved:
-
Inference model (
--model, default: gpt-5.1-codex-mini): Generates trajectories during rollouts. This model makes tool calls, sees sandbox results, iterates on errors, and submits solutions. It does not get trained.
-
Base model (
--base-model, default: Qwen/Qwen3-8B): The model being trained. It receives the trajectories generated by the inference model and learns from them via GRPO (Group Relative Policy Optimization).
How it works:
- The inference model generates multiple trajectories per problem, each with tool calls executed in a CoreWeave sandbox
- Each trajectory receives a binary reward: 1.0 if tests pass, 0.0 otherwise
- Trajectories for the same problem form a group - GRPO compares trajectories within each group
- The base model (Qwen3-8B) is trained to prefer higher-reward trajectories over lower-reward ones
After training, you deploy Qwen3-8B with the same tool definitions. It will have learned to make similar tool calls by imitating the successful trajectories from the inference model.
Prerequisites
| Mode | Requirements |
|---|
| Dry run | CPU only |
| TinkerBackend | CPU only (training via API) |
| LocalBackend | GPU required |
Environment variables:
export CWSANDBOX_API_KEY="your-access-token" # CoreWeave API Access Token
export OPENAI_API_KEY="your-openai-key"
export ART_TINKER_API_KEY="your-tinker-key" # required for --backend=tinker
export WANDB_API_KEY="your-wandb-key" # optional, for logging
Installation
uv pip install -r examples/rl_training/art/requirements.txt
This installs:
openpipe-art==0.5.7 # ART framework
openai==2.15.0 # LLM inference
datasets==4.5.0 # MBPP loading
wandb==0.24.0 # Optional logging
For LocalBackend with GPU support, also install:
uv pip install "openpipe-art[backend]==0.5.7"
Running the example
# Dry run - validate setup without training
uv run examples/rl_training/art/train.py --dry-run
# Train with TinkerBackend (no local GPU required)
uv run examples/rl_training/art/train.py --backend tinker --num-problems 10
# Train with LocalBackend (requires GPU)
uv run examples/rl_training/art/train.py --backend local --num-problems 10
Expected output:
ART Training with CWSandbox
========================================
Backend: tinker
Model: gpt-5.1-codex-mini
Base model: Qwen/Qwen3-8B
Problems: 10
Steps: 5
Trajectories per problem: 2
Project: cwsandbox-mbpp
Run name: train-001
Loading MBPP problems...
Loaded 10 problems
Creating tinker backend...
Creating trainable model...
Registering model with backend...
Starting training...
=== Step 1 ===
Collecting trajectories for 10 problems...
step 1: 100%|██████████| 10/10 [00:45<00:00]
Collected 20 trajectories, avg reward: 0.35
Training...
Training complete: step=1, metrics={'loss': 0.42}
...
Configuration options
| Flag | Default | Description |
|---|
--backend | local | Training backend: local (GPU) or tinker (no GPU) |
--model | gpt-5.1-codex-mini | Model for inference |
--base-model | Qwen/Qwen3-8B | Base model for training |
--num-problems | 10 | Number of MBPP problems |
--num-steps | 5 | Training steps |
--trajectories-per-problem | 2 | Trajectories collected per problem per step |
--base-url | None | OpenAI-compatible API base URL |
--project | cwsandbox-mbpp | W&B project name |
--run-name | train-001 | Training run name |
--learning-rate | 1e-5 | Learning rate |
--dry-run | false | Validate setup without training |
Architecture
art/
├── train.py # Training loop, CLI, and ART backend setup
├── rollout.py # Multi-step sandbox execution, builds Trajectory
├── tools.py # Tool schemas for execute_code and submit_solution
└── __init__.py
Key ART imports:
import art
from art.local import LocalBackend
from art.tinker import TinkerBackend
# Create trainable model
model = art.TrainableModel(
name="train-001",
project="cwsandbox-mbpp",
base_model="Qwen/Qwen3-8B",
)
# Collect trajectories
groups = await art.gather_trajectory_groups(
(collect_trajectories(problem) for problem in problems),
pbar_desc="collecting",
)
# Train
result = await backend.train(model, groups, learning_rate=1e-5)
Rollout returns art.Trajectory:
from openai.types.chat import ChatCompletionToolParam
trajectory = art.Trajectory(
messages_and_choices=messages_and_choices, # Conversation history
tools=ROLLOUT_TOOLS, # Tool definitions
reward=1.0 if passed else 0.0, # Binary reward
metadata={"task_id": problem.task_id},
)
return trajectory.finish()
Tool-calling pattern:
The rollout uses OpenAI-compatible tool calling with two tools:
execute_code: Test code in sandbox, returns stdout/stderr
submit_solution: Final submission, runs all test cases
ROLLOUT_TOOLS: list[ChatCompletionToolParam] = [
{
"type": "function",
"function": {
"name": "execute_code",
"description": "Execute Python code in an isolated sandbox...",
"parameters": {
"type": "object",
"properties": {"code": {"type": "string"}},
"required": ["code"],
},
},
},
# submit_solution tool...
]
Sandboxes are tagged for tracking:
sandbox_defaults = SandboxDefaults(
container_image="python:3.11",
tags=("art-training", args.project, args.run_name),
)
Data flow
The training pipeline has several components:
-
Local machine: The training script (
train.py) runs on your machine or training server. It loads problems from the MBPP dataset and orchestrates the training loop.
-
OpenAI API (inference): During trajectory collection, the rollout code calls the OpenAI API (or compatible endpoint) to generate model responses. The model receives tool definitions and returns tool calls that the rollout executes.
-
CoreWeave sandbox (code execution): Each rollout uses a single CoreWeave sandbox that persists across all tool calls. When the model calls
execute_code or submit_solution, the code runs in that sandbox. This means file changes and state accumulate as the agent iterates - it can write a file, run it, see an error, and fix it. The sandbox provides isolation so untrusted model-generated code cannot affect the host. Results (stdout, stderr, exit code) flow back to the rollout.
-
Trajectory collection: The rollout accumulates the conversation history (messages and tool results) along with the final reward into an
art.Trajectory object. Multiple trajectories for the same problem form an art.TrajectoryGroup.
-
Training backend: The collected trajectory groups are sent to the training backend. With
LocalBackend, training happens on your local GPU. With TinkerBackend, trajectories are uploaded to Thinking Machines’s Tinker service which handles training remotely - no local GPU required.