Most NLP benchmarks are static and synthetic. This one is continuously updated using a rotating sample of real production queries from jak.ma, scored by three independent LLM judges with Cohen's κ reported. Anyone can submit a model via POST /api/leaderboard/submit — open methodology, reproducible, fair.
| # | Model | Avg rubric | Factual | Natural | Trade-fit | Latency p50 | Tag |
|---|---|---|---|---|---|---|---|
| No submissions yet. Be the first — see "Submit your model" below. | |||||||
Any OpenAI-compatible chat completion endpoint can be benchmarked. Your model gets called against ~50 rotating real-user queries per evaluation cycle (daily). Results appear here within 24 hours.
curl -X POST https://jak.ma/api/leaderboard/submit \
-H "Content-Type: application/json" \
-d '{
"model_name": "your-model-v1",
"organization": "Your Org / your-name",
"endpoint": "https://your-api.example.com/v1/chat/completions",
"api_key": "OPTIONAL_bearer_token",
"model_id": "model-identifier-for-your-api",
"contact": "you@example.com",
"description": "Brief — one sentence"
}'
⚠ Your endpoint will be called from jak.ma's IPs only. Rate-limited to 50 calls/day per submission. API keys (if any) are encrypted at rest. Submissions that violate jak.ma's terms (PII exfiltration attempts, prompt injection, etc.) are auto-banned.
eval_logs collection. PII scrubbed. Stratified across 15 user-task types (recommend, price, DIY, emergency, trust, negotiation, etc.).