We are now part of the NVIDIA Inception Program.Read the announcement
Arabic NLP

Native Arabic intelligence. Built, not translated.

Most "Arabic" AI is just English models forced to translate. MX4 Atlas is different. We rebuild open-source foundations with native Arabic tokenization and cultural alignment, delivering the MENA region's most capable and compliant LLMs.

The MX4 Methodology

From Generalist to Specialist

Generic models treat Arabic as a second-class citizen. We rebuild them from the token level up.

01. Foundation

Open Source Base

We start with world-class open weights models (Llama 3, Mistral) as our cognitive engine.

  • 7B-70B Parameters
  • English Fluency
  • Reasoning Core
02. Adaptation

Vocabulary Expansion

We reconstruct the tokenizer, adding 20,000+ native Arabic tokens to reduce fragmentation.

  • +250% Efficiency
  • Native Script Support
  • Dialect Coverage
03. Knowledge

Continued Pre-training

Injecting 100 Billion tokens of high-quality Arabic data (Modern Standard & Dialects).

  • Regional History
  • Legal Frameworks
  • Cultural Nuance
04. Alignment

Cultural Fine-Tuning

Instruction tuning and RLHF specifically designed for MENA cultural and ethical values.

  • Sovereign-ready
  • Safety tuning
  • Regional Values

Performance Metrics

Sovereign, Yet Superior

MX4 Atlas outperforms standard open-source models on Arabic tasks and rivals proprietary clouds.

Arabic Reasoning
MMLU (Arabic Translated)
MX4 Atlas68.4%
GPT-4o72.0%
Llama 3 Base52.1%
Approaching GPT-4 performance with 1/10th the inference cost.
Cultural Alignment
Regional Context Accuracy
MX4 Atlas94.7%
GPT-4o72.0%
Llama 3 Base68.2%
Native understanding of MENA idioms, laws, and customs.
Token Efficiency
Tokens per Word
MX4 Atlas1.6
GPT-4o2.8
Llama 3 Base4.2
2.6x faster generation and lower cost.

Linguistic Diversity

One Model, Many Voices

The Arab world is not a monolith. MX4 Atlas is the first foundational model trained on a balanced corpus of Modern Standard Arabic and regional dialects.

From formal government decrees in MSA to customer service chatbots in Saudi dialect, we cover the full spectrum of communication.

Modern Standard Arabic
Pan-Arab
MSA
Gulf (Khaleeji)
Saudi, UAE, Kuwait
GLF
Levantine
Jordan, Lebanon, Syria
LEV
Egyptian
Egypt
EGY
Maghrebi
Morocco, Algeria
MAG
MENA
Coverage
22 Nations
400M+ Speakers

Challenge

Why standard models fail

Standard models (like GPT-4 or Llama base) chop Arabic words into many small, meaningless fragments. This increases cost, latency, and hallucination rates.

The MX4 Solution: We expanded the vocabulary by 20,000+ native tokens. Our models "see" whole Arabic words, not just letters.

Standard Model
4.2 Tokens
per Arabic word
MX4 Atlas
1.6 Tokens
per Arabic word
Standard Llama 3 (Fragmentation)
الذكاء الاصطناعي
VS
MX4 Atlas (Native Understanding)
الذكاء الاصطناعي

Open Source

Powered by open source

We don't reinvent the wheel; we reinforce it. By building upon the world's best open-weights models—Meta's Llama 3, Mistral, and others—we focus our energy on the last mile: Cultural Alignment and Sovereign Deployment.

Deploy Arabic-First AI

Bring sovereign Arabic intelligence on-prem in weeks, not months.

Talk to MX4 Atlas specialists to scope a dialect-focused deployment, benchmarks, and data residency plan.