← Back
API · v0.1 preview

An API to train models
that are yours.

LoRA-based fine-tuning & RL on open-source models, from 35B to 1T parameters. Train, sample, and serve with one small Python client.

01 / What it does

Stop prompting. Start training.

Prompting steers a model you don't own and can't improve. River lets you train open models into ones that are truly yours — and serve them like any other endpoint.

Prompting & few-shot the model stays generic
  • You nudge behavior with context — the weights never change
  • Quality plateaus; the hardest tasks stay out of reach
  • Long prompts and big models tax every single call
  • You rent behavior you can't keep, version, or own
Training with River the model becomes yours
  • RL on your own rewards turns an open model into a reliable agent
  • Weights actually update, so it learns what prompting never could
  • A small tuned model matches a giant — far cheaper, far faster
  • The checkpoint is yours: keep it, iterate, and serve it in production
Under the hood

LoRA fine-tuning

Low-rank adapters across every weight matrix — including each MoE expert.

RL, the hard parts handled

Fast weight transfer, sampling-training consistency, and elastic compute.

Drop-in serving

OpenAI-compatible endpoint on any checkpoint — swap the base URL and ship.

02 / Models

Train on a range of open models.

State-of-the-art open-source models, available at several context lengths. More are added regularly.

Qwen3.6 · MoE
Qwen3.6 35B
262k
Qwen3.5 · MoE
Qwen3.5 397B
65k 262k
Kimi · MoE
Kimi K2.6
32k 131k
GLM · MoE
GLM 5.1
32k 131k
03 / Pricing

Pay per token. No GPU-hour math.

Billing is metered on tokens for both inference and training, so cost tracks the work you actually do.

Model Context Prompt Completion Training
Qwen3.6-35B-A3B-FP8 262k $0.33 / 1M $0.82 / 1M $1 / 1M
Qwen3.5-397B-A17B-FP8 65k $1.66 / 1M $4.15 / 1M $5 / 1M
Qwen3.5-397B-A17B-FP8 262k $3.32 / 1M $8.30 / 1M $10 / 1M
Kimi-K2.6-NVFP4 32k $1.22 / 1M $3.06 / 1M $3.67 / 1M
Kimi-K2.6-NVFP4 131k $4.28 / 1M $10.70 / 1M $12.84 / 1M
GLM-5.1-NVFP4 32k $2.45 / 1M $6.12 / 1M $7.34 / 1M
GLM-5.1-NVFP4 131k $8.56 / 1M $21.40 / 1M $25.68 / 1M

Prices in USD per 1M tokens · Preview rates, subject to change
Checkpoint storage billed at $0.10 / GB / month · Longer context lengths are priced on request — [email protected]

04 / Examples

From toy to production-scale training.

Training is a tiny Python loop: sample, score, update. The examples below are simplified to show the shape of the API.

A toy example: a minimal GRPO loop on GSM8K, training Kimi-K2.6-NVFP4. Reward starts low because Kimi tends to think for longer than the 128-token budget, then climbs above 0.9 reasonably quickly as it learns to answer within the limit.

rl_loop.py
import river_client as river
from datasets import load_dataset
from transformers import AutoTokenizer

MODEL      = "nvidia/Kimi-K2.6-NVFP4"
BATCH      = 256          # prompts per step
GROUP      = 4            # samples per prompt
MAX_TOKENS = 128
LORA_RANK  = 16
LR         = 4e-5         # learning rate

client = river.Client(api_key="...")
tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
gsm8k = load_dataset("openai/gsm8k", "main", split="train")   # math word problems

def reward(text, answer):
    # 1.0 if the \boxed{...} answer matches the GSM8K ground truth, else 0.0
    return float(extract_boxed(text) == extract_gsm8k_answer(answer))

def make_prompt(question):
    ...   # render the question with the chat template (+ "...answer inside \boxed{}" suffix)

with client.session() as session:
    model = session.create_model(
        base_model=MODEL,
        lora=river.LoraConfig(
            rank=LORA_RANK,
            train_attn=True,     # attention projections
            train_mlp=True,      # MLP / MoE experts
            train_unembed=True,  # unembedding (lm_head)
        ),
    )

    for step in range(len(gsm8k) // BATCH):
        rows = gsm8k.select(range(step * BATCH, (step + 1) * BATCH))
        prompts = [make_prompt(q) for q in rows["question"]]

        # 1. Sample a group of candidate answers per prompt
        groups = model.sample(
            prompts=prompts, num_samples=GROUP, max_tokens=MAX_TOKENS, stop=["<|im_end|>"],
        )

        # 2. Score, then center rewards into GRPO advantages (no std division)
        train_data = []
        for prompt, samples, answer in zip(prompts, groups, rows["answer"]):
            rewards = [reward(s.text, answer) for s in samples]
            baseline = sum(rewards) / len(rewards)
            if all(r == baseline for r in rewards):
                continue  # no signal — skip this group

            ptoks = tok.encode(prompt, add_special_tokens=False)
            for s, r in zip(samples, rewards):
                adv = r - baseline
                full_ids = ptoks + s.tokens
                train_data.append({
                    "input_ids": full_ids,
                    "attention_mask": [1] * len(full_ids),
                    "old_logprobs": [0.0] * (len(ptoks) - 1) + s.logprobs + [0.0],
                    "advantages": [0.0] * (len(ptoks) - 1) + [adv] * len(s.tokens) + [0.0],
                })

        # 3. One policy-gradient update
        if train_data:
            model.forward_backward(data=train_data, loss_fn="importance_sampling")
            model.optim_step(lr=LR)

    model.save_weights("final")
reward/mean · 29 steps
reward
rl_loop.py · Kimi-K2.6-NVFP4 · GSM8K · GRPO · hover to inspect any step

Ready to train something that's yours?