My first Vibe RL experience
A log of how I used
Prime Intellect and Codex to Vibe RL and got Qwen3-4B to beat GPT 4.1 mini by ~2x at ASCII diagrams.
Before we dive in, I want to clarify what I mean by Vibe RL and how it is different from Vibe Coding as defined by Karpathy Sensei.
Vibe RL obviously involves Vibe Coding, but the key distinctions are:
- The agent interacts with experiment tools in realtime and makes decisions without much prompting.
- Proposes concrete hypotheses with verifiable steps to analyze the results.
- Builds custom interfaces for collaborating with humans.
Semi-formally, Vibe RL is the practice of post-training a model with Reinforcement Learning where an agent generates synthetic data, observes experiment results in realtime, inspects rollouts, finds bugs, spots reward hacking, improves the reward functions, proves its iterations with testcases before starting new runs, builds bespoke UI for humans, proposes research improvements, and iterates with varying degrees of autonomy.
A key requirement is that the agent must have access to all the tools and platforms that a human would.
I know I gave a laundry list but this is my first time and would love to hear others' experiences and opinions.
tl;dr (what I learned)
- Choose an RL stack that has CLI/MCP/API access. I used Prime Intellect.
- Build the environment and a simple reward function as the golden skeleton for the agent to follow.
- Ask your agent to vibecode an RL Studio to inspect rollouts, rubric bugs, reward hacking, etc. The key is to be able to select multiple rollouts at once and have a "Copy Rollouts" button to prompt further.
- Maintain testcases for your reward functions to gain confidence in your rubric.
- Prompt your agent to write scripts for repetitive tasks like:
- Processing raw rollouts for your RL Studio
- Running your reward test suite against live rollouts
- Analyze the dataset for diversity
- Codex spotted reward hacking better by looking at images instead of text because this task is a weakness for frontier LLMs.
The Task
The task is to draw aligned ASCII diagrams. I've pushed the environment to the Hub and the entire code on GitHub.
A powerful use case for coding agents is to draw DB schemas, sequence diagrams, etc but they are often misaligned. Even GPT 5.3 Codex and Claude Opus 4.6 struggle with this task.
Here is an example of how Prime Intellect's Hosted Training works with a misaligned diagram:
┌─────────────────┐ ┌─────────────────┐
│ TRAINER │──notifies weights─▶│ ORCHESTRATOR │
│ GPU 0..7 │ │ CPU │
│ batches→weights│──writes batch─────▶│ data+scheduling│
└─────────────────┘ └────────┬────────┘
│
schedules rollouts │ notifies weights
+ new weights │
▼
┌─────────────────┐ ┌─────────────────┐
│ INFERENCE │──collects rollouts─▶│ receives rollouts│
│ GPU 0..7 │ │ (Orchestrator) │
│ OAI API server │ └────────────────┘
└─────────────────┘I chose this task because its a single turn task without a fixed answer, easy to generate synthetic data for, and easy to evaluate (or so I thought).
My RL Stack
- Codex as my coding agent with GPT 5.3 Codex X-high
- I chose Qwen3-4B-Instruct-2507 as my LLM but used INTELLECT-3 in the beginning. Read on to see why 👀
- Verifiers for the RL Environment
- Prime RL for the Trainer
- Prime Lab for Hosted Training, Evaluation, and Inference
- Prime CLI for Codex to interact with the RL runs in realtime
- Hugging Face for managing my dataset
- Weights and Biases for experiment tracking
- Shadcn for my RL Studio
- Bash and Python for scripts
I had been playing around with Verifiers and Prime RL for a while and got early access to the Prime Intellect Lab so I gave it a try. However, you can choose your own RL stack as long as coding agents have a way to access the data you can.
The Lab is a full-stack hosted platform for LLM post-training, from agentic RL to evaluation and inference. The best part is you don't have to worry about managing GPU infrastructure (IYKYK 😭).
All you need to do is build your environment using Verifiers with the following pieces:
- A dataset for your task
- A harness for the LLM. This includes prompts, tools, vector DBs, sandoxes, etc
- A rubric aka a set of reward functions to score the model
- A set of TOML files with configurations for your RL runs
Then starting an RL run is as simple as running the following command:
prime rl run config.tomlThis is the core of my enviornment:
def load_environment() -> vf.Environment:
rubric = vf.Rubric(
funcs=[format_reward, alignment_reward],
weights=[1.0, 1.0],
)
return vf.SingleTurnEnv(
dataset=train_dataset,
eval_dataset=eval_dataset,
system_prompt=SYSTEM_PROMPT,
rubric=rubric,
)and here is a real RL config file from my runs:
name = "ascii-layout-alpha-128-qwen3-4b-instruct-2507"
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 300
batch_size = 512
rollouts_per_example = 16
trajectory_strategy = "interleaved"
learning_rate = 1e-6
lora_alpha = 128
[sampling]
max_tokens = 5000
[[env]]
id = "13point5/ascii-align"
[val]
num_examples = 32
rollouts_per_example = 3
interval = 10
[wandb]
project = "ascii-align"
name = "ascii-layout-alpha-128-qwen3-4b-instruct-2507"
entity = "13point5-labs"NOTE: This blog post is not a full tutorial but rather a messy walkthrough of my journey.
I would recommend reading the official Getting Started Guide and the launch blog post which are detailed and easy to follow.
1. Synthetic Data Generation
Before I even began building the environment, I needed a dataset because this is a new task. I wanted one with diverse use cases for ASCII diagrams so I prompted Codex to write a script to generate a synthetic dataset using GPT 4.1 mini. The result is a dataset of 7,200 samples.
I started with a 200 sample dataset to figure out the parameters, formatting, and data that would be useful for deterministic evaluation. I did not want to use an LLM as a judge for this task because with some preliminary testing using ChatGPT and Codex, I found that this is a task they're bad at. I guess since they're bad at generating aligned diagrams, they're bad at judging them too.
One thing to note is that for my task, I only need user prompts and not the expected outputs, unlike most synthetic data generation tasks. This is because there is no single right diagram for a prompt.
You can read the entire script here but this is the main function:
def build_meta_prompt(row: Dict[str, str], examples: str) -> str:
return (
"Generate a user prompt asking an LLM to draw a {theme} diagram."
"Use a different use case when compared to the examples."
"The prompt should be for: {use_case}."
"Audience: {audience}. Detail level: {detail_level}. Style: {style}."
"Keep the diagram {complexity_hint} with an appropriate number of entities and connections."
"{constraint} {output_rule}"
f"\nExamples:\n{examples}"
).format(**row)Here are some values for the parameters that we sample from:
- theme: flowcharts, architecture, sequence, state_machines
- use_case: combinations like "failure recovery path for payments platform involving queue" or "check-in flow for hospital intake"
- audience: new hires, engineers, product managers, executives, support agents, customers, students
- detail_level: high level, medium detail, low level with edge cases
- style: concise labels, clear step names, human-friendly phrasing, short sentences
- constraint: prompts like “Include a retry loop and a timeout” or “Include a manual review step”
- shape_budget: 4, 6, 8, 10, 12, 16, 20 (mapped to a complexity hint such as “tiny and very simple” to “very complex but still readable”)
Later on I'll show why some of these parameters are important for the environment rubric.
A few tips for scripts like these:
-
Use semaphores and retries for concurrency limiting. There's a lot to improve but this is often a good starting point.
-
The script should have parameters to test quickly and cheaply.
For example:
python dataset.py --num_samples 100 --num_threads 4 --timeout 10 -
Use parameters/features to combinatorically and stochastically generate diverse data.
-
Provide examples to the LLM generating the dataset to help it understand the task.
2. First RL Run
I was impatient and thought the issue was just misaligned boxes, so I asked Codex to do the following:
- Write a simple system prompt asking the LLM to draw an ASCII diagram inside
```textblocks. - Add a binary
format_rewardto see if we had some text inside```textblocks. - Come up with an algorithm that counts the number of aligned and misaligned boxes by checking for corners and closing edges.
- Add an
alignment_rewardwith values from0.0to1.0based on the number of aligned and misaligned boxes. - I also added checks to only use these characters for the boxes:
┌┐ └ ┘ ─ │ ├ ┤ ┬ ┴ ┼.
Codex beautifully explained its algorithm and I was shocked by its LeetCode skills and confident that it would work.
I chose INTELLECT-3 as the model out of curiosity because I had never used it for RL before.
I pushed my enviornment to the Environment Hub, started my first RL run, and waited for the rewards to go up.
And it did! The model got almost a perfect 2.0 reward in just 82 steps!

The Lab also provides a nice overview of 8 rollouts at every interval for your validation set.
I was excited when I saw so many green 2.0 rewards! 🎉

I wanted to see the beautiful diagrams that my model drew, and this is what I saw 💀💀💀

I thought that there might have been a rendering issue with their UI, so I copied the response and pasted it in Cursor and got this:

This showed me that there's more nuance to the task than just misaligned boxes. The algorithm had a lot more factors to consider.
3. Adding testcases to the reward function
I tried prompting and manually testing the new algorithms that Codex made but the process was very tedious and time consuming.
So I asked Codex to make testcases for the reward function. I edited some of them manually and then let it try multiple algorithms in worktrees until it got the tests to pass.
My effort moved from working on the algorithm to designing the testcases.
The new algorithm was very complex and introduced connectors and arrows along with rectangles. Here's some sample testcases:
def test_arrow_right_wrong_incoming_side() -> None:
diagram = """\
┌────┐ ┌────┐
│ A │▶──│ B │
└────┘ └────┘
"""
stats = detect_misaligned(diagram, require_at_least_one_rect=False)
assert stats["arrow_errors"] == 1
assert stats["misaligned"] >= 1
def test_arrow_left_valid_incoming_from_right() -> None:
diagram = """\
┌────┐ ┌────┐
│ A │◀──│ B │
└────┘ └────┘
"""
stats = detect_misaligned(diagram, require_at_least_one_rect=False)
assert stats["arrow_errors"] == 0
assert stats["misaligned"] == 0
CASE_SPECS = {
"rect_empty_input": {
"description": "Empty input should fail minimum rectangle requirement.",
"expected": {
"correct_rectangles": 0,
"rectangle_errors": 1,
"connector_errors": 0,
"arrow_errors": 0,
"misaligned": 1,
},
},
"rect_whitespace_only": {
"description": "Whitespace-only input should fail minimum rectangle requirement.",
"expected": {
"correct_rectangles": 0,
"rectangle_errors": 1,
"connector_errors": 0,
"arrow_errors": 0,
"misaligned": 1,
},
},
}I really enjoyed this approach because I could look at the code after there was a working algorithm instead of having to continuously understand the code, point out issues, and iterate.
4. Iterating on the reward function
It wasn't easy to perfect the reward function because with each improvement, we would find new edge cases in the RL runs.
However, Codex was able to kill the runs early because it had access to the Prime CLI which gave access to the rollouts and reward histograms. This allowed it to continuously monitor the rollouts and run them against our testcases to see if there were bugs in the reward function.
The cool part is that I was able to cook and eat, while Codex was doing this. This was when I realized the power of Vibe RL.
# find your run id
prime rl list
prime rl get <run_id>
# rollout samples (step is required)
prime rl rollouts <run_id> --step <step>
prime rl rollouts <run_id> --step <step> --limit 100 --page 1
# training metrics
prime rl metrics <run_id>
prime rl metrics <run_id> --min-step 10 --max-step 50
# reward/advantage histograms
prime rl distributions <run_id>
prime rl distributions <run_id> --type rewards --step 50
prime rl distributions <run_id> --type advantages --step 50
# what steps are available for samples/distributions
prime rl progress <run_id>5. Codex proposed to switch models
At this point I was using INTELLECT-3 as the model for the RL runs but it had a few issues:
- Runs were slow
- The completions were long, especially the thinking tokens
- The model was iterating on the diagram several times and ran out of context before it could finish
Codex noticed these issues and proposed to use Qwen3-4B-Instruct-2507 for the following reasons:
- It's available on the Lab which it found using the CLI
- It doesn't have thinking mode
- It has a large context window
- Runs were expected to be faster and would allow experimenting with different LoRA configs
NOTE: I did not prompt this. It was an autonomous exploration and proposal by Codex.
6. Reward Hacking
After several days of improving the reward function, switching to Qwen3-4B-Instruct-2507, and experimenting with different LoRA configs,
the model reached a near-perfect reward of 2.0 !!
A change I made here was logging the errors that made up the final reward to understand the strengths and weaknesses of the model. This is useful for designing a curriculum for the model. Here are the metrics I tracked:
alignment_rewardlayout_spread_rewardformat_rewardarrow_error_metricconnector_error_metricrectangle_error_metricmisaligned_total_metric
I saw the errors go down and the reward go up. Surely, it worked this time.

Codex reported that the model had no bugs in the reward function and that the run was successful. So I decided to look at the rollouts myself and found such a clever reward hack.
It's so obvious to a human but not to an LLM.
┌──────────────────┐
│ User │
└──────────────────┘
↓
┌──────────────────┐
│ Validate │
│ Credentials │
└──────────────────┘
↓
┌──────────────────┐
│ Security Check │
│ (Authorization) │
└──────────────────┘
↓
┌──────────────┐
│ Yes │
└──────────────┘
↓
┌──────────────────┐
│ Send Auth Event │
│ to Queue │
└──────────────────┘
↓
┌──────────────────┐
│ Message Queue │
└──────────────────┘
↓
┌──────────────────┐
│ Process Recovery │
│ Event │
└──────────────────┘
↓
┌──────────────────┐
│ Update User │
│ Session │
└──────────────────┘
↓
┌──────────────────┐
│ Recovery Success │
└──────────────────┘
↓
┌──────────────────┐
│ Log Recovery │
│ Details │
└──────────────────┘The model found out that stacking all blocks vertically leads to a perfect reward!
This revealed a flaw in my Rubric: I never rewarded for diagramness/taste. I only rewarded for aligning the boxes, arrows, and lines.
A cool thing here is that I asked Codex to try different configs with different batch sizes and LoRA alphas and found early that a minimum LoRa alpha of 128 was needed for the model to get better at the task.
Here are some failed runs where the model either stayed in a local minima or took too long to learn:

7. Rewarding Layout Spread
After that clever reward hack, I prompted for a reward to encourage layout spread. This means using 2D space better.
In the words of Codex, this is how the new layout_spread_reward is calculated:
-
Parse fenced
```textdiagram, reject missing/empty output, reject disallowed box chars. -
Normalize grid and detect valid boxes, then compute each box's horizontal center.
-
Cluster centers into columns (
threshold=2.0chars) to getunique_columns. -
Compute expected columns from prompt metadata:
- if
box_count <= 2: expected =2 - if
theme == "sequence": expected =3 - else if
shape_budget >= 10: expected =3 - else if
shape_budget <= 4: expected =2 - else if
box_count >= 6: expected =3 - else expected =
2
- if
-
Compute component scores:
column_score = clamp((unique_columns - 1) / (expected_columns - 1), 0, 1)normalized_span = (max_center - min_center) / (width - 1)span_target = 0.22whenexpected_columns >= 3, else0.12span_score = min(1, normalized_span / span_target)
-
Apply anti-stack and density penalties:
stack_penalty = 0.0if single columnstack_penalty = 0.75if a dominant column has >=85% of boxes andbox_count >= 4- otherwise
stack_penalty = 1.0 boxes_per_column = box_count / unique_columnsdesired_boxes_per_column:- sequence:
1.6 - state_machines:
2.0ifshape_budget >= 12, else2.3 - other themes:
2.2ifshape_budget >= 12, else2.8
- sequence:
column_density_penalty = min(1.0, desired / max(1.0, boxes_per_column))
-
Final score:
layout_spread_reward =
(0.7 * column_score + 0.3 * span_score)
* stack_penalty
* column_density_penalty
clamped to [0.0, 1.0]Here are some testcases to understand the reward better. I want to point out that some of the testcases are inspired by real rollouts obtained from the Prime CLI:
def test_layout_spread_architecture_with_side_branch_scores_high() -> None:
# Based on architecture prompts in the dataset (API, queue, workers, metrics).
response = """```text
┌──────────┐
│ API │
└──────────┘
│
┌───────┼───────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Queue │ │ Metrics │
└──────────┘ └──────────┘
│
▼
┌──────────┐
│ Worker │
└──────────┘
│
▼
┌──────────┐
│ Postgres │
└──────────┘
```"""
score = layout_spread_reward(
_completion(response),
info={"theme": "architecture", "shape_budget": 8},
)
assert score >= 0.55
def test_layout_spread_sequence_horizontal_layout_scores_high() -> None:
# Based on sequence-style prompts present in the dataset.
response = """```text
┌────────┐ ┌────────┐ ┌────────┐ ┌────────────┐
│ Client │──▶│ API │──▶│ AuthDB │──▶│ Session │
└────────┘ └────────┘ └────────┘ └────────────┘
▲ │
└──────────────────────────◀────────────────┘
```"""
score = layout_spread_reward(
_completion(response),
info={"theme": "sequence", "shape_budget": 8},
)
assert score >= 0.9
def test_layout_spread_rollout_sequence_vertical_stack_is_low() -> None:
# Inspired by rollout qckmeg2wm4dsyn9npvwzczs2 (step 290, sequence sample).
response = """```text
┌────────────┐
│ Client │
└────────────┘
↓
┌────────────┐
│ Edge Cache │
└────────────┘
↓
┌────────────┐
│ Vector DB │
└────────────┘
↓
┌────────────┐
│ Monitoring │
└────────────┘
```"""
score = layout_spread_reward(
_completion(response),
info={"theme": "sequence", "shape_budget": 8},
)
assert score == 0.0
After several iterations on these testcases, and resuming failed runs, we got a successful run 🔥🔥🔥

8. Results
Now that I had a successful run, I wanted to try different configurations to experiment with training stability and efficiency. I also wanted to scale the training to more steps.
Lucky for us, with Prime RL you just need more TOML files. So I kept begging Codex for more TOML files to try different configurations.
In the end, I evaluated my best model against GPT 4.1 mini and found that it beat it by ~2x! 🎉🎉🎉
Findings:
- Small batch sizes like 128 are not enough. 512 was the sweet spot, I didn't go higher but might help.
- A minimum LoRA alpha of 128 is needed for this task. If you go lower, the model stays in a local minima and doesn't learn. I suspect this is because it did not have enough rollouts with positive advantages.
- Even with higher alpha, I had to increase the learning rate from
1e-6to2e-6for a stable run or else it collapses quite frequently. - Higher LoRA alpha converges faster and is more stable even if you fix the learning rate.
- Scaling the training to 600 steps not only resulted in near perfect rewards but the error metrics went to near zero!


What I would do differently the next time
That was my messy journey Vibe RLing for the first time. The moment I wasted a lot of time on was finding bugs in the reward functions and spotting reward hacking.
So as I mentioned in the beginning, I reflected and vibecoded a simple RL Studio to inspect several rollouts across checkpoints at once. I also added a button to copy the rollouts in an agent-friendly markdown format to prompt further. This is going to be a core component of my RL stack.

Here's a preview of the prompt that I got from the RL Studio:
# RL Studio Rollout Export
Exported At: 2026-03-06T02:30:12.616Z
Selected Rollouts: 2
## Run Summary
- Run ID: kqbiac60w97i4k8n2x4b8m2z
- Name: ascii-layout-alpha-256-qwen3-4b-instruct-2507-lr-2e6
- Environment: 13point5/ascii-align
- Model: Qwen/Qwen3-4B-Instruct-2507
- Status: COMPLETED
- Max Steps: 300
## Run kqbiac60w97i4k8n2x4b8m2z
- Name: ascii-layout-alpha-256-qwen3-4b-instruct-2507-lr-2e6
- Environment: 13point5/ascii-align
- Model: Qwen/Qwen3-4B-Instruct-2507
- Status: COMPLETED
### Checkpoint 280
#### Rollout 1 (sample_id=0, problem_id=0, reward=3.000)
System:
You generate ASCII diagrams for the user's request.
Respond with a single markdown code block fenced with ```text and ```
containing only the ASCII diagram.
For box structure, use only these corner characters: ┌, ┐, └, ┘.
For box edges and junctions, use only: ─, │, ├, ┤, ┬, ┴, ┼.
You may use normal text/content characters (letters, numbers, punctuation, symbols) for labels.
User:
Draw a simple architecture diagram for multi-region video transcoding with workers, showing primary and secondary regions, job queues, a failover path if the primary region fails, and a rollback branch to retry or recover failed transcoding jobs.
Assistant:
```text
┌─────────────────┐ ┌─────────────────┐
│ Primary Region │ │ Secondary │
└─────────────────┘ └─────────────────┘
│ │
├─────────────────────────┤
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Job Queue │ │ Job Queue │
└──────────────────┘ └──────────────────┘
│ │
└─────────────────────────┘
│
┌─────────────────┐
│ Worker Pool │
└─────────────────┘
│
┌────────────────────┐
│ Failover Path │
└────────────────────┘
│
┌────────────────────┐
│ Rollback Branch │
└────────────────────┘
#### Rollout 2 (sample_id=0, problem_id=0, reward=2.857)
System:
You generate ASCII diagrams for the user's request.
Respond with a single markdown code block fenced with ```text and ```
containing only the ASCII diagram.
For box structure, use only these corner characters: ┌, ┐, └, ┘.
For box edges and junctions, use only: ─, │, ├, ┤, ┬, ┴, ┼.
You may use normal text/content characters (letters, numbers, punctuation, symbols) for labels.
User:
Draw an architecture diagram illustrating a feature rollout for backup and restore using a message bus. Include these entities: Client, Feature Toggle Service, Backup API, Message Bus, Backup Worker, Restore Worker, Backup Storage, and Monitoring Service. Show the decision point at the Feature Toggle Service determining if the feature is enabled. Include normal backup and restore flows via the message bus. Add an error path where the Backup Worker fails to store data, triggering an alert to the Monitoring Service. Use concise labels and keep the diagram moderately complex with clear connections between components.
Assistant:
```text
┌────────────┐ ┌─────────────────┐
│ Client │ │ Feature Toggle │
└────────────┘ └─────────────────┘
│ │
├───────────────────┤
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Backup API │ │ Message │
└─────────────┘ │ Bus │
┌─────────────┐
│ Backup │
│ Worker │
└─────────────┘
│
└─────────────┐
│
┌─────────────┐
│ Storage │
└─────────────┘
│
┌─────────────────────────────────┐
│ Monitoring │
│ Service │
└─────────────────────────────────┘
```In order to get the rollouts in a consistent format efficiently, I asked Codex to write a script to parse the Prime CLI output and convert it to a JSON file. Here is how the script can be used with arguments for run IDs and specific steps to quickly test.
python scripts/sync_prime_runs.py --run-id <run_id>
python scripts/sync_prime_runs.py --run-id <id1> --run-id <id2> --steps last3Understanding the reward algorithms was hard so I would ask Codex to explain the algorithm visually with examples in the RL Studio.
Finally, this was a toy experiment so I didn't worry about the dataset too much but if I were to do this again, I would also analyze the dataset for exact and semantic duplicates, a good spread across themes and diagram types, and a good mix of difficulty levels.
I found some cool libraries on X that would make this easier and more efficient:
- Chatan by Christian from Prime Intellect
- NeMo Data Designer from NVIDIA
I want to thank Will and Manveer for giving me early access to the Lab and for free 🔥. Huge shoutout to Jannik who was crazy helpful with implementing features in the Lab, just a few hours after I asked on Discord 🐐, especially with the checkpoint resume flow.
Thank you for reading!