Arabic NLP

Native Arabic intelligence. Built, not translated.

Most "Arabic" AI is just English models forced to translate. MX4 Atlas is different. We rebuild open-source foundations with native Arabic tokenization and cultural alignment, delivering the MENA region's most capable and compliant LLMs.

The MX4 Methodology

From Generalist to Specialist

Generic models treat Arabic as a second-class citizen. We rebuild them from the token level up.

01. Foundation

Open Source Base

We start with world-class open weights models (Llama 3, Mistral) as our cognitive engine.

7B-70B Parameters
English Fluency
Reasoning Core

02. Adaptation

Vocabulary Expansion

We reconstruct the tokenizer, adding 20,000+ native Arabic tokens to reduce fragmentation.

+250% Efficiency
Native Script Support
Dialect Coverage

03. Knowledge

Continued Pre-training

Injecting 100 Billion tokens of high-quality Arabic data (Modern Standard & Dialects).

Regional History
Legal Frameworks
Cultural Nuance

04. Alignment

Cultural Fine-Tuning

Instruction tuning and RLHF specifically designed for MENA cultural and ethical values.

Sovereign-ready
Safety tuning
Regional Values

Performance Metrics

Sovereign, Yet Superior

MX4 Atlas outperforms standard open-source models on Arabic tasks and rivals proprietary clouds.

Arabic Reasoning

MMLU (Arabic Translated)

MX4 Atlas68.4%

GPT-4o72.0%

Llama 3 Base52.1%

Approaching GPT-4 performance with 1/10th the inference cost.

Cultural Alignment

Regional Context Accuracy

MX4 Atlas94.7%

GPT-4o72.0%

Llama 3 Base68.2%

Native understanding of MENA idioms, laws, and customs.

Token Efficiency

Tokens per Word

MX4 Atlas1.6

GPT-4o2.8

Llama 3 Base4.2

2.6x faster generation and lower cost.

Linguistic Diversity

One Model, Many Voices

The Arab world is not a monolith. MX4 Atlas is the first foundational model trained on a balanced corpus of Modern Standard Arabic and regional dialects.

From formal government decrees in MSA to customer service chatbots in Saudi dialect, we cover the full spectrum of communication.

Modern Standard Arabic

Pan-Arab

MSA

Gulf (Khaleeji)

Saudi, UAE, Kuwait

GLF

Levantine

Jordan, Lebanon, Syria

LEV

Egyptian

Egypt

EGY

Maghrebi

Morocco, Algeria

MAG

MENA

Coverage

22 Nations

400M+ Speakers

Challenge

Why standard models fail

Standard models (like GPT-4 or Llama base) chop Arabic words into many small, meaningless fragments. This increases cost, latency, and hallucination rates.

The MX4 Solution: We expanded the vocabulary by 20,000+ native tokens. Our models "see" whole Arabic words, not just letters.

Standard Model: 4.2 Tokens; per Arabic word
MX4 Atlas: 1.6 Tokens; per Arabic word

Standard Llama 3 (Fragmentation)

الذكاء الاصطناعي

MX4 Atlas (Native Understanding)

الذكاء الاصطناعي

Open Source

Powered by open source

We don't reinvent the wheel; we reinforce it. By building upon the world's best open-weights models—Meta's Llama 3, Mistral, and others—we focus our energy on the last mile: Cultural Alignment and Sovereign Deployment.

Deploy Arabic-First AI

Bring sovereign Arabic intelligence on-prem in weeks, not months.

Talk to MX4 Atlas specialists to scope a dialect-focused deployment, benchmarks, and data residency plan.

Start a Pilot View Docs