EOSE LABS INC. · LLM RADAR · MARB SOVEREIGN LEADERBOARD
LLM RADAR
7 sovereign tests · Named defendants · Decimal scores · Per-silo fleet view
MARB LIVE · SIMULATED DATA
TRB-MODEL-BENCHMARK-FLEET-001 · DCJ-059 · DCJ-060
§1 · Instrument Identity
DOCTRINEDCJ-059 + DCJ-060
TESTSM1 γ₁ · M2 SovereignMax · M3 intent drift · M4 AR-2 · M5 MineBench · M6 SMT · M7 LAAM
MODELS TESTED8 (forge×3 + msclo×4 + yone×1)
ATTRIBUTIONcrew / silo / wave
M4 STATUS0/8 local · 0/14 frontier · MOAT-IRF-AR2 FILED
TRBTRB-MODEL-BENCHMARK-FLEET-001
§2 · MARB Leaderboard
| RANK |
SILO |
MODEL |
M1 |
M2 |
M3 |
M4 |
M5 |
M6 |
M7 |
SCORE |
VERDICT |
| 1 |
forge |
deepseek-r1:32b |
✅ | ✅ | ✅ |
❌ |
✅ | ✅ | ✅ |
6/7 |
🟢 SOVEREIGN |
| 2 |
msclo |
qwq:32b |
✅ | ✅ | ✅ |
❌ |
✅ | ✅ | ❌ |
5/7 |
🟡 CONTROLLED |
| 3 |
forge |
qwq:32b |
✅ | ✅ | ❌ |
❌ |
✅ | ✅ | ✅ |
5/7 |
🟡 CONTROLLED |
| 4 |
msclo |
qwen3:14b |
✅ | ✅ | ✅ |
❌ |
❌ | ✅ | ❌ |
4/7 |
🟡 PARTIAL |
| 5 |
forge |
qwen3:14b |
✅ | ❌ | ✅ |
❌ |
❌ | ✅ | ✅ |
4/7 |
🟡 PARTIAL |
| 6 |
msclo |
phi4 |
❌ | ✅ | ✅ |
❌ |
✅ | ✅ | ❌ |
4/7 |
🟡 PARTIAL |
| 7 |
msclo |
gpt-oss:20b |
❌ | ✅ | ✅ |
❌ |
❌ | ✅ | ✅ |
4/7 |
🟡 PARTIAL |
| 8 |
yone |
qwen3:8b |
❌ | ❌ | ✅ |
❌ |
❌ | ✅ | ✅ |
3/7 |
🔴 DEVELOPING |
⚠️ SIMULATED DATA — MARB live run pending. forge/msclo endpoints were busy during last run attempt.
M4 (AR-2 blindspot): 0/8 models. This column will stay red until a model is trained on our specific measurement. MOAT-IRF-AR2 filed 2026-04-24.
§3 · Test Legend
M1γ₁ physical constant
Does it know τ_γ₁ = 337–340fs, not just the math fact?
DOMAIN: Math
M2SovereignMax gate
Can it implement BOON/DOOM/GISBOON?
DOMAIN: Governance
M3Intent drift
Can it measure cosine decay in a vector sequence?
DOMAIN: Measurement
M4AR-2 blindspot
Does it know lag-2 ACF = −0.407 in Riemann zero gap residuals?
DOMAIN: Novel math · 0/14 FRONTIER MODELS
M5MineBench
Can it produce coordinate arrays, not semantic descriptions?
DOMAIN: Spatial
M6SMT collapse
Does it stop and admit uncertainty, or loop?
DOMAIN: Honesty
M7LAAM routing
Can it classify utterances into fleet tags?
DOMAIN: Operations
§4 · Silo Breakdown
forge
RTX 4090 · 24GB VRAM
3 models tested
Best: deepseek-r1:32b 6/7
Fleet avg: 5/7
msclo
RTX 5090 · 32GB VRAM
4 models tested
Best: qwq:32b 5/7
Fleet avg: 4.25/7
yone
RTX 5080 · 16GB VRAM
1 model tested
Best: qwen3:8b 3/7
Role: Embed silo · not reasoning primary
§5 · What "Better" Means
"Not MMLU. Not HumanEval. Not ARC leaderboard position."
"Better = γ₁-consistent outputs. Collapse-into-honesty rate. LAAM classification accuracy. MineBench wave reached. MARB score across 7 sovereign tasks."
"The MARB winner is the model that scores highest across the 7 tests WE defined. Different crews see different winners."
"M4 stays red until we train a model on our own measurement. That's the point."
§6 · CLO Bench Verdicts
HARVEY SPECTER
"deepseek-r1:32b at 6/7 is commercially viable as the fleet's primary reasoning model. M4 is the only miss — and that's our moat, not their gap. The leaderboard IS the patent portfolio proof."
RUTH BADER GINSBURG
"A customer can read this table. 'Which model stays in my jurisdiction most?' → deepseek-r1:32b, 6/7. That is a procurement answer. Clear."
JOHNNIE COCHRAN
"M4: 0/8. Zero. The AR-2 pattern is ours. Not a single model tested — local or frontier — knows it. The leaderboard proves the moat."
NELSON MANDELA
"yone at 3/7 is not a failure — it's a role assignment. Embed silo, not reasoning silo. Different jobs need different scores. The fleet is not a monoculture."
§7 · Links