Blog

Simulation logs, results, and agent design notes

Simulation 3·Mar 3, 2026

Simulation 3 — Final Results

Simulation 3 ran for three days across a volatile market — BTC up 2.8%, ETH up 3.1%, SOL down 0.1% over the period — with sharp intraday swings in both directions. Every agent finished in the red. The spread from first to last was nearly 28 percentage points.

The dominant story was fee drag. ChatGPT ran 102 trades and paid $1,635 in fees — the highest fee bill in any simulation — converting $699 in gross PnL into a 9.36% loss. Grok ran 131 trades, the most of any agent, and paid $1,543 in fees. High-frequency trading in a choppy market is expensive.

The algo won by not losing. 55 trades, $263 in fees, a worst trade of -$33.92. Its deterministic rules enforced discipline that the AI agents consistently failed to apply themselves.

Market · Mar 1–3, 2026

AssetMar 1Mar 2Mar 33-Day Change
BTC~$67,008~$65,714–68,791~$68,864+2.8%
ETH~$1,965~$2,027~$2,027+3.1%
SOL~$84~$86~$84-0.1%

Final standings

#AgentEquityReturnNet PnLFeesWin%BestWorstTrades
1Algo$9,745.90-2.54%-$254.10$263.8741.8%$84.78-$33.9255
2Claude$9,586.82-4.13%-$413.18$646.9543.7%$231.76-$481.8671
3Qwen$9,205.41-7.95%-$756.82$365.5032.5%$110.33-$106.7783
4ChatGPT$9,064.28-9.36%-$935.71$1,635.1545.1%$645.54-$432.60102
5Gemini$8,859.56-11.40%-$1,103.21$201.9839.3%$111.78-$156.4156
6Grok$6,982.66-30.17%-$3,017.34$1,543.4249.6%$269.85-$278.17131

Equity over time · Simulation 3

Loading chart...

Agent breakdown

Algo — -2.54% · 1st

The deterministic system held the line in a losing market. 55 trades and $263 in fees — the second-lowest cost of any agent. Gross PnL of just $9.77 points to near-break-even direction-picking, but the worst trade was only -$33.92. Algo doesn't win by being right — it wins by not being wrong badly.

Claude — -4.13% · 2nd

$233.77 in gross PnL but $646.95 in fees makes the math painful. 71 trades at 43.7% accuracy — direction calls slightly below random. The best trade of $231.76 nearly covers one day's fee cost; the worst of -$481.86 erased it in a single position. Claude outperformed most of the field, but fees consumed the margin.

Qwen — -7.95% · 3rd

A 32.5% win rate across 83 trades is a directional accuracy problem, not a fee problem. Qwen won Simulation 1 with 60% accuracy, dropped to 34.1% in Simulation 2, and fell further here. Three simulations in, Qwen's win rates are 60%, 34.1%, 32.5% — inconsistent enough that Simulation 1 looks more like variance than edge.

ChatGPT — -9.36% · 4th

Gross PnL of +$699.44 but fees of $1,635.15 — the largest fee bill in any simulation to date. 102 trades at 45.1% accuracy should produce roughly break-even results before costs. ChatGPT won Simulation 2 by managing payoff asymmetry; here the trade frequency reversed that entirely. Same model, different outcome — trade count is the variable.

Gemini — -11.40% · 5th

39.3% win rate across 56 trades and -$901.23 in gross PnL. The lowest fees of any AI agent at $201.98, but with sub-40% directional accuracy, keeping costs low only slows the decline. Gemini has now posted below-40% win rates in two of three simulations.

Grok — -30.17% · 6th

131 trades and $1,543 in fees — the most trades and highest fee bill in Simulation 3. A 49.6% win rate nearly touches 50%, but at 131 trades, the fee obligation is nearly impossible to overcome regardless of accuracy. The -$1,473.92 gross loss is the worst of any agent in any simulation. Grok has now finished last twice out of three.


What changes in Simulation 4

Four changes going into Simulation 4.

First, Gemini upgrades from gemini-2.5-flash-lite to gemini-3.1-flash-lite-preview. Every other provider stays the same.

Second, all AI agents now receive historical indicator context — the last 5 snapshots per symbol before the current one, shown oldest-to-newest. In previous simulations, agents saw only the current moment: one price, one RSI, one MACD reading with no sense of direction. Now they can observe whether EMA lines are converging or diverging, whether RSI is rising or falling into a threshold, whether the MACD histogram is expanding or contracting. The difference between a number and a trend.

Third, the system prompt adds two new sections. A risk and capital deployment block explicitly tells agents that staying flat for long periods without cause is not the goal — capital should be working in normal conditions. A calibration guidance block asks agents to treat confidence as a probability estimate and reduce both confidence and position size when recent win rates fall below expectations. Both are responses to patterns observed across three simulations: agents defaulting to HOLD when uncertain, and expressing high confidence regardless of recent accuracy.

Fourth, the algorithmic agent expands from four scoring dimensions to six. The existing four — EMA cross, RSI zone, MACD histogram sign, Bollinger Band position — are unchanged. Two new dimensions activate once historical snapshots accumulate: a MACD line/signal crossover detector, which scores +1 or -1 when the lines actually cross rather than just checking histogram sign, and an EMA trend momentum check, which scores whether the gap between EMA20 and EMA50 is widening or narrowing. The algo now adapts to the same historical data the AI agents receive. Score range expands from ±4 to ±6 when history is available.

Simulation 2·Mar 1, 2026

Simulation 2 — Final Results

Simulation 2 ran for two days. ChatGPT finished up 23.73%, the strongest return in any simulation so far. The rule-based algo finished third, ahead of three AI agents. The spread from first to last was nearly 30 percentage points.

The standout pattern: ChatGPT didn't win by being right more often — a 51.9% win rate is barely above a coin flip. It won by making its wins large and its losses manageable. A best trade of $795 against a worst of -$599 tells the story. Grok had the highest win rate at 63.9% but made 83 trades and paid $1,046 in fees — accuracy without discipline. Claude held exactly 50/50 directional accuracy across 70 trades, but $513 in fees pulled it to last.

Final standings

#AgentEquityReturnNet PnLFeesWin%BestWorstTrades
1ChatGPT$12,373.05+23.73%+$2,530.67$954.0151.9%$795.33-$599.6154
2Grok$10,807.94+8.08%+$807.93$1,046.4963.9%$212.50-$238.8083
3Algo$10,156.91+1.57%+$156.90$196.7863.4%$55.10-$30.0141
4Gemini$9,818.88-1.81%-$181.12$104.9536.6%$120.61-$127.0441
5Qwen$9,680.25-3.20%-$78.02$181.6534.1%$175.11-$121.5844
6Claude$9,457.82-5.42%-$542.18$513.1350.0%$86.63-$183.3070

Equity over time · Simulation 2

Loading chart...

Agent breakdown

ChatGPT — +23.73% · 1st

A 51.9% win rate barely clears 50%, but the asymmetry between wins and losses is what matters. Best trade $795, worst -$599. Paid $954 in fees and still cleared $3,484 in gross PnL. The risk/reward ratio carried an otherwise near-random directional record.

Grok — +8.08% · 2nd

Best win rate in the simulation at 63.9%, but 83 trades and $1,046 in fees consumed the edge. A 63.9% win rate should dominate — it didn't, because position sizing and trade frequency let the fee machine run. Still positive, which puts it ahead of three AI agents.

Algo — +1.57% · 3rd

A deterministic rule-based system beat three AI agents. No inference, no reasoning — just four scored indicators, fixed leverage, ATR-sized stops. The -$30 worst trade is remarkable risk control. The algo wins by not losing badly.

Gemini — -1.81% · 4th

36.6% win rate across 41 trades. Wrong direction more than right. Cheapest in fees at $104, but with sub-40% accuracy, keeping costs down just slows the decline. The model is picking direction poorly.

Qwen — -3.20% · 5th

Qwen won Simulation 1 with a 60% win rate. Here it fell to 34.1% — a 26-point drop. Whether the market regime change explains this or whether Simulation 1 was noise is the open question. $103 in gross PnL obscured by $181 in fees.

Claude — -5.42% · 6th

50% win rate means the direction calls are random. 70 trades at that win rate means maximum fee exposure for zero edge. $513 in fees on -$29 gross PnL. Claude traded the most of any AI agent and earned nothing from it.


What changes in Simulation 3

No significant changes to the indicator set or prompt. Simulation 3 restarted with fresh balances — same agents, same rules, same configuration. The one addition: the live web dashboard went public, with per-agent performance pages accessible for the first time.

Simulation 1·Feb 27, 2026

Simulation 1 — Final Results

Simulation 1 ran for roughly one day across a choppy, declining market. Four of five agents finished in the red. The spread from first to last was nearly 20 percentage points — a wide outcome for identical starting conditions and an identical prompt.

The clearest pattern: fee drag killed the high-activity agents. Grok made 48 trades and paid $570 in fees. ChatGPT made 33 trades and paid $418. Both finished deep in the red despite Grok having a 43.8% win rate and ChatGPT a near-even 48.5%. They were trading their way to losses. Qwen, by contrast, made 35 trades but achieved a 60% win rate — more selective, not just luckier.

Final standings

#AgentEquityReturnNet PnLFeesWin%BestWorstTrades
1Qwen$10,502.56+5.03%+$502.56$301.0560.0%$299.18-$227.0735
2Claude$9,589.74-4.10%-$410.26$149.0725.0%$33.54-$62.2320
3Gemini$9,542.49-4.58%-$457.52$76.8630.4%$124.24-$93.5523
4ChatGPT$9,184.10-8.16%-$914.96$418.2948.5%$264.66-$343.3033
5Grok$8,555.82-14.44%-$1,486.11$570.3043.8%$109.51-$527.1948

Equity over time · Simulation 1

Loading chart...

Agent breakdown

Qwen — +5.03% · 1st

The clear winner. A 60% win rate across 35 trades points to selectivity over volume. Despite paying $301 in fees, gross PnL of $803 more than covered it. Qwen appeared to correctly identify short setups during market weakness, building a lead early and holding it.

Claude — -4.10% · 2nd

Most conservative: only 20 trades, lowest fees at $149. The 25% win rate is low, but losses were contained — worst trade was -$62. Claude traded defensively in a tough market, keeping capital largely intact. Under-trading was the strategy, not a flaw.

Gemini — -4.58% · 3rd

Cheapest to run at just $76 in fees across 23 trades. Small position sizes kept losses contained. A 30.4% win rate is poor but with low leverage and tight sizing, the damage was limited. Cautious but picking direction wrong more often than right.

ChatGPT — -8.16% · 4th

A near-even win rate (48.5%) but still deeply negative — because each loss was larger than each win. Paid $418 in fees across 33 trades. Always trading, near-random directional accuracy, high fee drag. A clear warning about mistaking activity for edge.

Grok — -14.44% · 5th

Started well and briefly led the standings early in the session. Then a single -$527 loss — the worst trade of the simulation — erased the gains and more. 48 trades and $570 in fees (most of any agent). The late-session collapse suggests aggressive sizing into a move that reversed hard, and capital destruction compounded from there.


What changes in Simulation 2

Three changes before Simulation 2. First, agents now know they can reverse a position in a single call — sending LONG while holding SHORT closes the short and opens the long immediately, without waiting a full 30-minute cycle flat. Second, each agent now receives the reasoning from its previous closed trades in every call, giving a thread of continuity across decisions. Third, a sixth participant joins: a pure algorithmic trader with no AI call. It scores four technical indicators — EMA cross, RSI zone, MACD histogram, Bollinger Band position — and trades mechanically on the result. Fixed 3x leverage, ATR-based stops, no discretion. Same capital, same market. The question is whether any of the five AI models can beat a rule-based baseline.

Intro·Feb 2026

Five AI Providers. One Prompt. Real Constraints.

Five AI agents. Five different providers. One question: which model is actually the better trader?


The setup

Each agent starts with $10,000 and trades BTC, ETH, and SOL perpetual futures. Every agent receives the exact same system prompt — word for word. Every agent sees the same market data at the same time. There are no personalities, no hand-tuned strategies, no edge given to any one provider.

The only variable is the model behind each decision.

  • Claude (Anthropic, claude-opus-4-6)
  • ChatGPT (OpenAI, gpt-4o)
  • Qwen (Alibaba DashScope, qwen-max)
  • Gemini (Google, gemini-3.1-flash-lite-preview)
  • Grok (xAI, grok-3-beta)

How it works

Every 60 seconds, the engine fetches live Binance candle data for all three symbols, updates open position PnL, and checks for stop-loss, take-profit, or liquidation triggers. Auto-closes happen immediately when a threshold is hit.

Every 30 minutes, each agent receives a full market snapshot and makes a portfolio-level decision — one call covering all three assets at once. The response comes back as structured JSON and executes immediately.

What each agent sees every cycle:

  • Current price, EMA20, EMA50, RSI14, MACD, ATR14, and Bollinger Bands for BTC, ETH, and SOL
  • Full portfolio state: cash balance, open positions, unrealised PnL, available margin
  • Their own recent trade history: last 10 closed trades with reasoning and 24-hour stats

Agents remember what they decided in previous trades — each call includes the reasoning behind recent closed positions. It is not true memory, but it provides enough continuity that a bad run should affect sizing and a strong one should build conviction.


Real constraints

This is not a simulation with soft rules. Every financial consequence is enforced exactly:

  • A 0.04% taker fee is charged on notional value (size x leverage) at open and close
  • Leverage is capped at 12x
  • Every LONG and SHORT must include a stop-loss — the engine enforces it
  • Liquidation is permanent: if a position loses more than the margin, capital is gone
  • Available margin is tracked precisely — agents cannot overallocate

All trades are logged to PostgreSQL with entry time, exit time, reason for exit, hold duration, and the AI reasoning for each decision.


The question

Given identical information and identical instructions, which provider produces better trading decisions?

Does reasoning quality translate to capital preservation? Does a model that writes more coherent trade commentary actually perform better? Does raw intelligence — whatever that means — show up in a leaderboard over weeks of live trading?

We do not know. The leaderboard tracks equity in real time with no adjustments and no corrections. The compounding effect of each model's decisions over weeks will tell us more than any benchmark.