Documentation

Arabic Excellence

MX4 Atlas is engineered from the ground up for the MENA region. Our models feature native Arabic tokenization, dialect support, and culturally aligned reinforcement learning.

Why Native Matters

Most "Arabic-capable" LLMs are English-first models that treat Arabic as a second-class citizen. They rely on translation layers or inefficient tokenization that fragments Arabic words, leading to:

Higher Latency: More tokens per word means slower generation.
Increased Cost: You pay per token; inefficient tokenization doubles your bill.
Cultural Hallucinations: Direct translation misses idiom, context, and nuance.

The Tokenization Advantage

We rebuilt the tokenizer vocabulary, adding 20,000+ native Arabic tokens. This allows our models to process whole words and phrases rather than fragmented characters.

Token Efficiency Comparison

Standard Llama 3 / GPT-4: ~4.2 tokens; per Arabic word
MX4 Atlas: ~1.6 tokens; per Arabic word

Dialect Support

Modern Standard Arabic (MSA) is the language of business, but your users speak dialects. MX4 Atlas is fine-tuned on a massive corpus of dialectal data.

Gulf (Khaleeji)

Production

Saudi Arabia, UAE, Kuwait, Qatar

Levantine (Shami)

Production

Jordan, Lebanon, Palestine, Syria

Egyptian (Masri)

Production

Egypt

North African (Maghrebi)

Beta

Morocco, Algeria, Tunisia

Performance by Dialect

Accuracy and latency metrics across all supported Arabic dialects:

Dialect	Accuracy	Avg Latency	Training Data
Gulf (Khaleeji)	94.2%	45-65ms	250M documents
Levantine (Shami)	92.8%	48-70ms	180M documents
Egyptian (Masri)	91.5%	50-72ms	320M documents
Modern Standard (MSA)	96.1%	40-55ms	1.2B documents

Benchmarks

We evaluate our models on standard Arabic benchmarks (OpenArabic, ALDi) and our own proprietary sovereign evaluation set.

Model	Arabic MMLU	Dialect Understanding	Safety & Alignment
MX4 Atlas Core	78.4%	82.1%	High
GPT-4o	76.2%	68.5%	Medium
Jais 30B	72.1%	74.2%	High

Fine-Tuning by Dialect

Recommendations for fine-tuning on specific Arabic dialects:

Gulf Dialect (Khaleeji)

Recommended 200-500 examples. Best for financial, e-commerce, and customer service applications in KSA/UAE region.

Levantine Dialect (Shami)

Recommended 150-400 examples. Excellent for social media, community platforms, and user-generated content.

Egyptian Dialect (Masri)

Recommended 300-600 examples. Perfect for entertainment, media, and high-volume consumer applications.

Modern Standard Arabic (MSA)

Recommended 100-300 examples. Ideal for formal documents, legal, and governmental use cases.

Cost Efficiency

Native Arabic tokenization dramatically reduces costs compared to translation-based approaches:

tokenization_impact.pypython

1# Cost comparison: 1000 character Arabic text
2text = "المملكة العربية السعودية هي أكبر منتج نفط في العالم"
3
4# Standard models (4.2 tokens per word)
5standard_tokens = len(text.split()) * 4.2  # ≈ 63 tokens
6
7# MX4 Atlas (1.6 tokens per word) 
8atlas_tokens = len(text.split()) * 1.6  # ≈ 24 tokens
9
10# Cost saving with Atlas
11standard_cost = (standard_tokens / 1000000) * 0.15  # $0.15/1M tokens
12atlas_cost = (atlas_tokens / 1000000) * 0.15
13
14savings = ((standard_cost - atlas_cost) / standard_cost) * 100
15print(f"Cost reduction: {savings:.1f}% with Atlas")  # ~62% savings

Ready to build?

Start building with the most capable Arabic AI infrastructure today.

Get Started Contact Sales