- You nudge behavior with context — the weights never change
- Quality plateaus; the hardest tasks stay out of reach
- Long prompts and big models tax every single call
- You rent behavior you can't keep, version, or own
An API to train models
that are yours.
LoRA-based fine-tuning & RL on open-source models, from 35B to 1T parameters. Train, sample, and serve with one small Python client.
Stop prompting. Start training.
Prompting steers a model you don't own and can't improve. River lets you train open models into ones that are truly yours — and serve them like any other endpoint.
- RL on your own rewards turns an open model into a reliable agent
- Weights actually update, so it learns what prompting never could
- A small tuned model matches a giant — far cheaper, far faster
- The checkpoint is yours: keep it, iterate, and serve it in production
LoRA fine-tuning
Low-rank adapters across every weight matrix — including each MoE expert.
RL, the hard parts handled
Fast weight transfer, sampling-training consistency, and elastic compute.
Drop-in serving
OpenAI-compatible endpoint on any checkpoint — swap the base URL and ship.
Train on a range of open models.
State-of-the-art open-source models, available at several context lengths. More are added regularly.
Pay per token. No GPU-hour math.
Billing is metered on tokens for both inference and training, so cost tracks the work you actually do.
| Model | Context | Prompt | Completion | Training |
|---|---|---|---|---|
| Qwen3.6-35B-A3B-FP8 | 262k | $0.33 / 1M | $0.82 / 1M | $1 / 1M |
| Qwen3.5-397B-A17B-FP8 | 65k | $1.66 / 1M | $4.15 / 1M | $5 / 1M |
| Qwen3.5-397B-A17B-FP8 | 262k | $3.32 / 1M | $8.30 / 1M | $10 / 1M |
| Kimi-K2.6-NVFP4 | 32k | $1.22 / 1M | $3.06 / 1M | $3.67 / 1M |
| Kimi-K2.6-NVFP4 | 131k | $4.28 / 1M | $10.70 / 1M | $12.84 / 1M |
| GLM-5.1-NVFP4 | 32k | $2.45 / 1M | $6.12 / 1M | $7.34 / 1M |
| GLM-5.1-NVFP4 | 131k | $8.56 / 1M | $21.40 / 1M | $25.68 / 1M |
Prices in USD per 1M tokens · Preview rates, subject to change
Checkpoint storage billed at $0.10 / GB / month · Longer context lengths are priced on request — [email protected]
From toy to production-scale training.
Training is a tiny Python loop: sample, score, update. The examples below are simplified to show the shape of the API.
A toy example: a minimal GRPO loop on GSM8K, training Kimi-K2.6-NVFP4. Reward starts low because Kimi tends to think for longer than the 128-token budget, then climbs above 0.9 reasonably quickly as it learns to answer within the limit.
import river_client as river
from datasets import load_dataset
from transformers import AutoTokenizer
MODEL = "nvidia/Kimi-K2.6-NVFP4"
BATCH = 256 # prompts per step
GROUP = 4 # samples per prompt
MAX_TOKENS = 128
LORA_RANK = 16
LR = 4e-5 # learning rate
client = river.Client(api_key="...")
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
gsm8k = load_dataset("openai/gsm8k", "main", split="train") # math word problems
def reward(text, answer):
# 1.0 if the \boxed{...} answer matches the GSM8K ground truth, else 0.0
return float(extract_boxed(text) == extract_gsm8k_answer(answer))
def make_prompt(question):
... # render the question with the chat template (+ "...answer inside \boxed{}" suffix)
with client.session() as session:
model = session.create_model(
base_model=MODEL,
lora=river.LoraConfig(
rank=LORA_RANK,
train_attn=True, # attention projections
train_mlp=True, # MLP / MoE experts
train_unembed=True, # unembedding (lm_head)
),
)
for step in range(len(gsm8k) // BATCH):
rows = gsm8k.select(range(step * BATCH, (step + 1) * BATCH))
prompts = [make_prompt(q) for q in rows["question"]]
# 1. Sample a group of candidate answers per prompt
groups = model.sample(
prompts=prompts, num_samples=GROUP, max_tokens=MAX_TOKENS, stop=["<|im_end|>"],
)
# 2. Score, then center rewards into GRPO advantages (no std division)
train_data = []
for prompt, samples, answer in zip(prompts, groups, rows["answer"]):
rewards = [reward(s.text, answer) for s in samples]
baseline = sum(rewards) / len(rewards)
if all(r == baseline for r in rewards):
continue # no signal — skip this group
ptoks = tok.encode(prompt, add_special_tokens=False)
for s, r in zip(samples, rewards):
adv = r - baseline
full_ids = ptoks + s.tokens
train_data.append({
"input_ids": full_ids,
"attention_mask": [1] * len(full_ids),
"old_logprobs": [0.0] * (len(ptoks) - 1) + s.logprobs + [0.0],
"advantages": [0.0] * (len(ptoks) - 1) + [adv] * len(s.tokens) + [0.0],
})
# 3. One policy-gradient update
if train_data:
model.forward_backward(data=train_data, loss_fn="importance_sampling")
model.optim_step(lr=LR)
model.save_weights("final")
An example of a larger RL run on the
Polaris-53K
math dataset (filtered to difficulty ≤ 3/8), with batch-normalized advantages and the
cispo loss, following the ScaleRL recipe
(Khatri et al., 2025).
The base model can already solve some of these problems, but before fine-tuning, its
completions run long and rarely finish within the 4,096-token budget, which is why reward
stays near zero for the first ~40 steps below. For under $1,000 — roughly
500M completion tokens and 250M training tokens — you end up with a model that solves the
same math much faster and cheaper.
import numpy as np
import river_client as river
from collections import deque
from transformers import AutoTokenizer
MODEL = "Qwen/Qwen3.6-35B-A3B-FP8" # 35B MoE, FP8
LORA_RANK = 8
LEARNING_RATE = 1e-4
WARMUP_STEPS = 100 # linear LR warmup
PIPELINE_K = 1 # PipelineRL depth (1 = on-policy)
GROUP = 16 # generations per prompt
BATCH = 48 # prompts per step
MAX_TOKENS = 4096
EPS_MAX = 6.0 # CISPO clip
client = river.Client(api_key="...")
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
SUFFIX = " Think step by step, then put your final answer inside \boxed{}."
def render(problem):
messages = [{"role": "user", "content": problem + SUFFIX}]
return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
def grade(response, answer):
... # your task reward, e.g. a symbolic math checker -> 0.0 / 1.0
def advantages(group_rewards):
# ScaleRL: center each group by its own mean (the GRPO baseline), but
# normalize by the BATCH std — not the per-group std that standard GRPO
# uses. This drops GRPO's difficulty/length bias from per-group scaling.
centered = [[r - np.mean(g) for r in g] for g in group_rewards]
std = max(float(np.std([a for g in centered for a in g])), 1e-8)
return [[a / std for a in g] for g in centered]
def build_cispo_data(prompts, groups, advs):
... # pack prompt + sample tokens with old_logprobs and advantages
def lr_at(train_step):
# linear warmup over the first WARMUP_STEPS optimizer steps, then hold
if train_step < WARMUP_STEPS:
return LEARNING_RATE * (train_step + 1) / WARMUP_STEPS
return LEARNING_RATE
with client.session() as session:
model = session.create_model(
base_model=MODEL,
lora=river.LoraConfig(
rank=LORA_RANK,
train_attn=True, # attention projections
train_mlp=True, # MLP / MoE experts
train_unembed=True, # unembedding (lm_head)
),
)
pipeline = deque() # holds the last PIPELINE_K batches
train_step = 0 # optimizer steps; LR warmup keys on this
for step in range(200):
problems, answers = next_batch(BATCH) # Polaris-53K, difficulty <= 3/8
prompts = [render(p) for p in problems]
# 1. Sample from an inference checkpoint of the current weights
ckpt = model.save_weights(f"sample_{step}", mode="inference")
groups = session.sample(prompts=prompts, base_model=MODEL,
checkpoint=ckpt, num_samples=GROUP,
max_tokens=MAX_TOKENS, stop=["<|im_end|>"])
# 2. Grade, normalize advantages, drop zero-variance groups, build CISPO data
rewards = [[grade(s.text, a) for s in g] for g, a in zip(groups, answers)]
advs = advantages(rewards)
# A group where every sample earned the same reward has all-zero
# advantages -> no learning signal. Skip it (ScaleRL).
keep = [any(a != 0.0 for a in g) for g in advs]
data = build_cispo_data(
[p for p, k in zip(prompts, keep) if k],
[g for g, k in zip(groups, keep) if k],
[a for a, k in zip(advs, keep) if k],
)
pipeline.append(data)
# 3. Train once PIPELINE_K batches are queued, with a warmed-up LR
if len(pipeline) >= PIPELINE_K:
model.forward_backward(data=pipeline.popleft(), loss_fn="cispo", eps_max=EPS_MAX)
model.optim_step(
lr=lr_at(train_step), grad_clip_norm=1.0,
beta1=0.9, beta2=0.95, eps=1e-15, weight_decay=0.01, # ScaleRL Adam
)
train_step += 1
model.save_weights("final")
Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal. The Art of Scaling Reinforcement Learning Compute for LLMs. arXiv:2510.13786 (2025).