jak.ma Leaderboard

Live public benchmark · Moroccan-Darija home-services AI · daily-refreshed

The first live, public benchmark for production Moroccan-Darija AI v0.1

Most NLP benchmarks are static and synthetic. This one is continuously updated using a rotating sample of real production queries from jak.ma, scored by three independent LLM judges with Cohen's κ reported. Anyone can submit a model via POST /api/leaderboard/submit — open methodology, reproducible, fair.

Top models — last 7 days Loading...

#	Model	Avg rubric	Factual	Natural	Trade-fit	Latency p50	Tag
No submissions yet. Be the first — see "Submit your model" below.

Per-dimension score across top 5 models

Avg score over time (7-day window)

Submit your model

Any OpenAI-compatible chat completion endpoint can be benchmarked. Your model gets called against ~50 rotating real-user queries per evaluation cycle (daily). Results appear here within 24 hours.

curl -X POST https://jak.ma/api/leaderboard/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model_name":   "your-model-v1",
    "organization": "Your Org / your-name",
    "endpoint":     "https://your-api.example.com/v1/chat/completions",
    "api_key":      "OPTIONAL_bearer_token",
    "model_id":     "model-identifier-for-your-api",
    "contact":      "you@example.com",
    "description":  "Brief — one sentence"
  }'

⚠ Your endpoint will be called from jak.ma's IPs only. Rate-limited to 50 calls/day per submission. API keys (if any) are encrypted at rest. Submissions that violate jak.ma's terms (PII exfiltration attempts, prompt injection, etc.) are auto-banned.

Methodology

Eval queries: rotating sample of ~50 real production queries per day, drawn from jak.ma's eval_logs collection. PII scrubbed. Stratified across 15 user-task types (recommend, price, DIY, emergency, trust, negotiation, etc.).
Judges: three independent LLMs (Grok-3, GPT-4o-mini, Claude-3.5-Haiku) score each response 0-5 on 5 rubric dimensions (factuality, naturalness, trade-fit, price-fairness, geographic). Cohen's κ between judges reported per dimension.
Final score: averaged across judges. Per-task breakdowns also published.
Latency: end-to-end time from request → first token, measured server-side.
Reproducibility: full eval methodology + adversarial probes at github.com/Samielakkad/jak-ma-eval-suite.
Cited in: Lexicon-Grounded Production Chat for Moroccan Darija: A Verifier-First Architecture, WANLP 2026 (submission).