API — River

01 / What it does

Stop prompting. Start training.

Prompting steers a model you don't own and can't improve. River lets you train open models into ones that are truly yours — and serve them like any other endpoint.

Prompting & few-shot the model stays generic

You nudge behavior with context — the weights never change
Quality plateaus; the hardest tasks stay out of reach
Long prompts and big models tax every single call
You rent behavior you can't keep, version, or own

Training with River the model becomes yours

RL on your own rewards turns an open model into a reliable agent
Weights actually update, so it learns what prompting never could
A small tuned model matches a giant — far cheaper, far faster
The checkpoint is yours: keep it, iterate, and serve it in production

Under the hood

LoRA fine-tuning

Low-rank adapters across every weight matrix — including each MoE expert.

RL, the hard parts handled

Fast weight transfer, sampling-training consistency, and elastic compute.

Drop-in serving

OpenAI-compatible endpoint on any checkpoint — swap the base URL and ship.

02 / Models

Train on a range of open models.

State-of-the-art open-source models, available at several context lengths. More are added regularly.

Qwen3.6 · MoE

Qwen3.6 35B

262k

Qwen3.5 · MoE

Qwen3.5 397B

262k

Kimi · MoE

Kimi K2.6

32k 262k

GLM · MoE

GLM 5.2

32k 262k

03 / Pricing

Pay per token. No GPU-hour math.

Billing is metered on tokens for both inference and training, so cost tracks the work you actually do.

Model	Context	Prompt	Cached	Completion	Training
Qwen3.5-9B	262k	$0.66 / 1M	$0.132 / 1M	$1.99 / 1M	$1.46 / 1M
Qwen3.6-35B-A3B-FP8	262k	$0.33 / 1M	$0.066 / 1M	$0.82 / 1M	$1.00 / 1M
Qwen3.5-122B-A10B-FP8	262k	$1.00 / 1M	$0.200 / 1M	$3.00 / 1M	$4.00 / 1M
Qwen3.5-397B-A17B-FP8	262k	$3.32 / 1M	$0.664 / 1M	$8.30 / 1M	$10.00 / 1M
Kimi-K2.6-NVFP4	32k	$1.22 / 1M	$0.244 / 1M	$3.06 / 1M	$3.67 / 1M
Kimi-K2.6-NVFP4-262k	262k	$4.28 / 1M	$0.856 / 1M	$10.70 / 1M	$12.84 / 1M
GLM-5.2-NVFP4	32k	$1.46 / 1M	$0.292 / 1M	$3.67 / 1M	$4.40 / 1M
GLM-5.2-NVFP4-262k	262k	$5.14 / 1M	$1.028 / 1M	$12.84 / 1M	$15.41 / 1M

Prices in USD per 1M tokens · Preview rates, subject to change · Cached prompt tokens billed at 20% of the prompt rate
Checkpoint storage billed at $0.10 / GB / month · Longer context lengths are priced on request — [email protected]

04 / Examples

From toy to production-scale training.

Training is a tiny Python loop: sample, score, update. The examples below are simplified to show the shape of the API.

A toy example: a minimal GRPO loop on GSM8K, training Kimi-K2.6-NVFP4. Reward starts low because Kimi tends to think for longer than the 128-token budget, then climbs above 0.9 reasonably quickly as it learns to answer within the limit.

rl_loop.py

import river_client as river
from datasets import load_dataset
from transformers import AutoTokenizer

MODEL      = "nvidia/Kimi-K2.6-NVFP4"
BATCH      = 256          # prompts per step
GROUP      = 4            # samples per prompt
MAX_TOKENS = 128
LORA_RANK  = 16
LR         = 4e-5         # learning rate

client = river.Client(api_key="...")
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
gsm8k = load_dataset("openai/gsm8k", "main", split="train")   # math word problems

def reward(text, answer):
    # 1.0 if the \boxed{...} answer matches the GSM8K ground truth, else 0.0
    return float(extract_boxed(text) == extract_gsm8k_answer(answer))

def make_prompt(question):
    ...   # render the question with the chat template (+ "...answer inside \boxed{}" suffix)

with client.session() as session:
    model = session.create_model(
        base_model=MODEL,
        lora=river.LoraConfig(
            rank=LORA_RANK,
            train_attn=True,     # attention projections
            train_mlp=True,      # MLP / MoE experts
            train_unembed=True,  # unembedding (lm_head)
        ),
    )

    for step in range(len(gsm8k) // BATCH):
        rows = gsm8k.select(range(step * BATCH, (step + 1) * BATCH))
        prompts = [make_prompt(q) for q in rows["question"]]

        # 1. Sample a group of candidate answers per prompt
        groups = model.sample(
            prompts=prompts, num_samples=GROUP, max_tokens=MAX_TOKENS, stop=["<|im_end|>"],
        )

        # 2. Score, then center rewards into GRPO advantages (no std division)
        train_data = []
        for prompt, samples, answer in zip(prompts, groups, rows["answer"]):
            rewards = [reward(s.text, answer) for s in samples]
            baseline = sum(rewards) / len(rewards)
            if all(r == baseline for r in rewards):
                continue  # no signal — skip this group

            ptoks = tok.encode(prompt, add_special_tokens=False)
            for s, r in zip(samples, rewards):
                adv = r - baseline
                full_ids = ptoks + s.tokens
                train_data.append({
                    "input_ids": full_ids,
                    "attention_mask": [1] * len(full_ids),
                    "old_logprobs": [0.0] * (len(ptoks) - 1) + s.logprobs + [0.0],
                    "advantages": [0.0] * (len(ptoks) - 1) + [adv] * len(s.tokens) + [0.0],
                })

        # 3. One policy-gradient update
        if train_data:
            model.forward_backward(data=train_data, loss_fn="importance_sampling")
            model.optim_step(lr=LR)

    model.save_weights("final")

reward/mean · 29 steps

reward

rl_loop.py · Kimi-K2.6-NVFP4 · GSM8K · GRPO · hover to inspect any step

An example of a larger RL run on the Polaris-53K math dataset (filtered to difficulty ≤ 3/8), with batch-normalized advantages and the cispo loss, following the ScaleRL recipe (Khatri et al., 2025). The base model can already solve some of these problems, but before fine-tuning, its completions run long and rarely finish within the 4,096-token budget, which is why reward stays near zero for the first ~40 steps below. For under $1,000 — roughly 500M completion tokens and 250M training tokens — you end up with a model that solves the same math much faster and cheaper.

scalerl.py

import numpy as np
import river_client as river
from collections import deque
from transformers import AutoTokenizer

MODEL         = "Qwen/Qwen3.6-35B-A3B-FP8"   # 35B MoE, FP8
LORA_RANK     = 8
LEARNING_RATE = 1e-4
WARMUP_STEPS  = 100          # linear LR warmup
PIPELINE_K    = 1            # PipelineRL depth (1 = on-policy)
GROUP         = 16           # generations per prompt
BATCH         = 48           # prompts per step
MAX_TOKENS    = 4096
EPS_MAX       = 6.0          # CISPO clip

client = river.Client(api_key="...")
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)

SUFFIX = " Think step by step, then put your final answer inside \boxed{}."

def render(problem):
    messages = [{"role": "user", "content": problem + SUFFIX}]
    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def grade(response, answer):
    ...   # your task reward, e.g. a symbolic math checker -> 0.0 / 1.0

def advantages(group_rewards):
    # ScaleRL: center each group by its own mean (the GRPO baseline), but
    # normalize by the BATCH std — not the per-group std that standard GRPO
    # uses. This drops GRPO's difficulty/length bias from per-group scaling.
    centered = [[r - np.mean(g) for r in g] for g in group_rewards]
    std = max(float(np.std([a for g in centered for a in g])), 1e-8)
    return [[a / std for a in g] for g in centered]

def build_cispo_data(prompts, groups, advs):
    ...   # pack prompt + sample tokens with old_logprobs and advantages

def lr_at(train_step):
    # linear warmup over the first WARMUP_STEPS optimizer steps, then hold
    if train_step < WARMUP_STEPS:
        return LEARNING_RATE * (train_step + 1) / WARMUP_STEPS
    return LEARNING_RATE

with client.session() as session:
    model = session.create_model(
        base_model=MODEL,
        lora=river.LoraConfig(
            rank=LORA_RANK,
            train_attn=True,     # attention projections
            train_mlp=True,      # MLP / MoE experts
            train_unembed=True,  # unembedding (lm_head)
        ),
    )

    pipeline = deque()                  # holds the last PIPELINE_K batches
    train_step = 0                      # optimizer steps; LR warmup keys on this

    for step in range(200):
        problems, answers = next_batch(BATCH)   # Polaris-53K, difficulty <= 3/8
        prompts = [render(p) for p in problems]

        # 1. Sample from an inference checkpoint of the current weights
        ckpt = model.save_weights(f"sample_{step}", mode="inference")
        groups = session.sample(prompts=prompts, base_model=MODEL,
                                checkpoint=ckpt, num_samples=GROUP,
                                max_tokens=MAX_TOKENS, stop=["<|im_end|>"])

        # 2. Grade, normalize advantages, drop zero-variance groups, build CISPO data
        rewards = [[grade(s.text, a) for s in g] for g, a in zip(groups, answers)]
        advs = advantages(rewards)

        # A group where every sample earned the same reward has all-zero
        # advantages -> no learning signal. Skip it (ScaleRL).
        keep = [any(a != 0.0 for a in g) for g in advs]
        data = build_cispo_data(
            [p for p, k in zip(prompts, keep) if k],
            [g for g, k in zip(groups,  keep) if k],
            [a for a, k in zip(advs,    keep) if k],
        )
        pipeline.append(data)

        # 3. Train once PIPELINE_K batches are queued, with a warmed-up LR
        if len(pipeline) >= PIPELINE_K:
            model.forward_backward(data=pipeline.popleft(), loss_fn="cispo", eps_max=EPS_MAX)
            model.optim_step(
                lr=lr_at(train_step), grad_clip_norm=1.0,
                beta1=0.9, beta2=0.95, eps=1e-15, weight_decay=0.01,  # ScaleRL Adam
            )
            train_step += 1

    model.save_weights("final")

reward/mean · first 200 steps · <$1,000 run

raw ema

scalerl.py · Qwen3.6 35B · LoRA rank 8 · lr 1e-4 · CISPO · hover to inspect any step

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal. The Art of Scaling Reinforcement Learning Compute for LLMs. arXiv:2510.13786 (2025).

05 / Documentation

Everything you need to get started.

For setup guides, API references, and examples, explore the River documentation.

Explore the docs

Ready to train something that's yours?

Sign up Email Us