Arabic Excellence
MX4 Atlas is engineered from the ground up for the MENA region. Our models feature native Arabic tokenization, dialect support, and culturally aligned reinforcement learning.
Why Native Matters
Most "Arabic-capable" LLMs are English-first models that treat Arabic as a second-class citizen. They rely on translation layers or inefficient tokenization that fragments Arabic words, leading to:
- Higher Latency: More tokens per word means slower generation.
- Increased Cost: You pay per token; inefficient tokenization doubles your bill.
- Cultural Hallucinations: Direct translation misses idiom, context, and nuance.
The Tokenization Advantage
We rebuilt the tokenizer vocabulary, adding 20,000+ native Arabic tokens. This allows our models to process whole words and phrases rather than fragmented characters.
Token Efficiency Comparison
- Standard Llama 3 / GPT-4
- ~4.2 tokens
- per Arabic word
- MX4 Atlas
- ~1.6 tokens
- per Arabic word
Dialect Support
Modern Standard Arabic (MSA) is the language of business, but your users speak dialects. MX4 Atlas is fine-tuned on a massive corpus of dialectal data.
Gulf (Khaleeji)
ProductionSaudi Arabia, UAE, Kuwait, Qatar
Levantine (Shami)
ProductionJordan, Lebanon, Palestine, Syria
Egyptian (Masri)
ProductionEgypt
North African (Maghrebi)
BetaMorocco, Algeria, Tunisia
Performance by Dialect
Accuracy and latency metrics across all supported Arabic dialects:
| Dialect | Accuracy | Avg Latency | Training Data |
|---|---|---|---|
| Gulf (Khaleeji) | 94.2% | 45-65ms | 250M documents |
| Levantine (Shami) | 92.8% | 48-70ms | 180M documents |
| Egyptian (Masri) | 91.5% | 50-72ms | 320M documents |
| Modern Standard (MSA) | 96.1% | 40-55ms | 1.2B documents |
Benchmarks
We evaluate our models on standard Arabic benchmarks (OpenArabic, ALDi) and our own proprietary sovereign evaluation set.
| Model | Arabic MMLU | Dialect Understanding | Safety & Alignment |
|---|---|---|---|
| MX4 Atlas Core | 78.4% | 82.1% | High |
| GPT-4o | 76.2% | 68.5% | Medium |
| Jais 30B | 72.1% | 74.2% | High |
Fine-Tuning by Dialect
Recommendations for fine-tuning on specific Arabic dialects:
Gulf Dialect (Khaleeji)
Recommended 200-500 examples. Best for financial, e-commerce, and customer service applications in KSA/UAE region.
Levantine Dialect (Shami)
Recommended 150-400 examples. Excellent for social media, community platforms, and user-generated content.
Egyptian Dialect (Masri)
Recommended 300-600 examples. Perfect for entertainment, media, and high-volume consumer applications.
Modern Standard Arabic (MSA)
Recommended 100-300 examples. Ideal for formal documents, legal, and governmental use cases.
Cost Efficiency
Native Arabic tokenization dramatically reduces costs compared to translation-based approaches:
1# Cost comparison: 1000 character Arabic text2text = "المملكة العربية السعودية هي أكبر منتج نفط في العالم"34# Standard models (4.2 tokens per word)5standard_tokens = len(text.split()) * 4.2 # ≈ 63 tokens67# MX4 Atlas (1.6 tokens per word)8atlas_tokens = len(text.split()) * 1.6 # ≈ 24 tokens910# Cost saving with Atlas11standard_cost = (standard_tokens / 1000000) * 0.15 # $0.15/1M tokens12atlas_cost = (atlas_tokens / 1000000) * 0.151314savings = ((standard_cost - atlas_cost) / standard_cost) * 10015print(f"Cost reduction: {savings:.1f}% with Atlas") # ~62% savings
Ready to build?
Start building with the most capable Arabic AI infrastructure today.