Retrieval-Augmented Generation (RAG) is the gold standard for enterprise AI, allowing models to "read" your internal documents before answering. However, standard English-centric embedding models often fail to capture the semantic nuances of Arabic text.
The Morphology Problem
In English, "bank" and "banking" are distinct but close. In Arabic, a single root word (like k-t-b) can spawn dozens of variations (kitab, maktaba, kataba, yaktubu) that standard vector models often treat as unrelated concepts due to surface-level differences.
Atlas-Embed-v2 Architecture
To solve this, we developed Atlas-Embed-v2, a contrastive learning model pre-trained on 10 billion tokens of high-quality Arabic text. It uses a novel Stem-Aware Pooling mechanism to ensure that queries match documents conceptually.
Benchmark Results
def retrieve_documents(query, k=5):
# 1. Stem-aware query expansion
expanded_query = atlas.morphology.expand(query)
# 2. Generate embeddings using Atlas-Embed-v2
query_vec = atlas.embed(expanded_query, model="atlas-v2-large")
# 3. Hybrid Search (Dense + Sparse)
results = vector_db.search(
vectors=query_vec,
filter={"sovereignty_level": "Level-4"}, # Strict filtering
top_k=k
)
return resultsThis pipeline ensures that when a user asks about "regulations" (اللوائح), the system correctly retrieves documents containing "regulatory" (التنظيمية) and "legislative" (التشريعية) contexts, which is critical for legal and government use cases.