We are now part of the NVIDIA Inception Program.Read the announcement
January 28, 202615 min readResearch

Scaling RAG for Arabic: Challenges and Solutions

Retrieval-Augmented Generation (RAG) behaves differently with morphologically rich languages. We benchmark standard embeddings vs. our new Atlas-Embed-v2.

M
MX4 Team
Sovereign AI

Retrieval-Augmented Generation (RAG) is the gold standard for enterprise AI, allowing models to "read" your internal documents before answering. However, standard English-centric embedding models often fail to capture the semantic nuances of Arabic text.

The Morphology Problem

In English, "bank" and "banking" are distinct but close. In Arabic, a single root word (like k-t-b) can spawn dozens of variations (kitab, maktaba, kataba, yaktubu) that standard vector models often treat as unrelated concepts due to surface-level differences.

Arabic Morphology vs. Vector Space

Atlas-Embed-v2 Architecture

To solve this, we developed Atlas-Embed-v2, a contrastive learning model pre-trained on 10 billion tokens of high-quality Arabic text. It uses a novel Stem-Aware Pooling mechanism to ensure that queries match documents conceptually.

Benchmark Results

In our internal benchmarks against MTEB (Massive Text Embedding Benchmark) for Arabic Retrieval, Atlas-Embed-v2 achieved a 14% improvement in NDCG@10 compared to multilingual-e5-large.
rag_pipeline.pypython
def retrieve_documents(query, k=5):
    # 1. Stem-aware query expansion
    expanded_query = atlas.morphology.expand(query)
    
    # 2. Generate embeddings using Atlas-Embed-v2
    query_vec = atlas.embed(expanded_query, model="atlas-v2-large")
    
    # 3. Hybrid Search (Dense + Sparse)
    results = vector_db.search(
        vectors=query_vec,
        filter={"sovereignty_level": "Level-4"}, # Strict filtering
        top_k=k
    )
    return results

This pipeline ensures that when a user asks about "regulations" (اللوائح), the system correctly retrieves documents containing "regulatory" (التنظيمية) and "legislative" (التشريعية) contexts, which is critical for legal and government use cases.