Embeddings
Get a vector representation of a given input — optimized for Arabic morphology with dialect-aware tokenization.
Available Models
| Model | Notes | Best For |
|---|---|---|
| mx4-embed-v1 | Balanced quality and cost for Arabic semantic search. | General purpose retrieval |
| mx4-embed-v1-large | Higher recall for long or technical corpora. | High-precision retrieval, classification |
Model IDs and capacities can vary by deployment. Check Atlas Studio for the exact list in your environment.
Request Body
The text(s) to embed. Pass a single string or an array of strings for batch processing (limits depend on your plan).
ID of the model to use (for example: 'mx4-embed-v1' or 'mx4-embed-v1-large').
Output format: 'float' (default) or 'base64'. Use base64 for bandwidth-sensitive applications.
Request
1curl https://api.mx4.ai/v1/embeddings \2 -H "Authorization: Bearer $MX4_API_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "input": ["السيادة على البيانات مطلب أساسي", "Data sovereignty is essential"],6 "model": "mx4-embed-v1"7 }'
Response
1{2 "object": "list",3 "model": "mx4-embed-v1",4 "data": [5 {6 "object": "embedding",7 "index": 0,8 "embedding": [0.0123, -0.0456, 0.0789, "...float array"]9 },10 {11 "object": "embedding",12 "index": 1,13 "embedding": [0.0234, -0.0567, 0.0891, "...float array"]14 }15 ],16 "usage": {17 "prompt_tokens": 18,18 "total_tokens": 1819 }20}
Arabic Embedding Notes
- •Root-Aware Tokenization: MX4 embeddings use an Arabic-aware tokenizer that preserves morphological roots. The word "كتبوا" (they wrote) remains close to "كتاب" (book) and "مكتبة" (library) in vector space.
- •Dialect Coverage: Trained on MSA and major dialect families to improve cross-dialect retrieval.
- •Mixed-Language Support: Handles Arabic-English code-switching common in Gulf business contexts. Embedding a mixed sentence like "نحتاج meeting بعد الظهر" produces coherent vectors.
- •Diacritics Handling: Embeddings are stable with or without tashkeel (diacritical marks). وَلَد and ولد produce near-identical vectors.
Use Cases
Semantic Search
Find documents matching a query by comparing embedding vectors. Ideal for Arabic knowledge bases where keyword matching fails due to morphological complexity.
RAG Retrieval
Power retrieval-augmented generation by embedding document chunks and retrieving the top-k most relevant passages for context injection.
Duplicate Detection
Identify duplicate or near-duplicate documents using cosine similarity — across MSA and dialect variants of the same content.
Classification
Use embeddings as features for downstream classifiers — sentiment analysis, topic categorization, intent detection in Arabic customer support.
Best Practices
✓ Batch Requests
Batch multiple texts in a single request to improve throughput and reduce overhead.
✓ Cache Embeddings
Store embeddings in a vector database (Qdrant, Weaviate, pgvector) to avoid recomputing. Caching reduces latency and cost.
✓ Normalize Before Comparison
Normalize embeddings before comparison to make cosine similarity or dot product consistent at scale.
✓ Chunk Arabic Text Carefully
Arabic text often expands after tokenization. Start with moderate chunk sizes and tune based on retrieval quality. Use sentence boundaries instead of fixed character counts.
Example: Semantic Search with Python
1import openai2import numpy as np34# MX4 is OpenAI-compatible — use the standard SDK5client = openai.OpenAI(6 api_key="mx4-sk-...",7 base_url="https://api.mx4.ai/v1"8)910# Embed a query in Arabic11query = "ما هي السيادة على البيانات؟" # What is data sovereignty?12query_embedding = client.embeddings.create(13 model="mx4-embed-v1",14 input=query15).data[0].embedding1617# Embed a document corpus (batch for efficiency)18documents = [19 "السيادة على البيانات تعني بقاء البيانات في البلد الذي تم جمعها فيه.",20 "عاصمة فرنسا هي باريس.",21 "التشفير يحمي البيانات أثناء النقل وفي حالة السكون.",22 "يتطلب نظام حماية البيانات الشخصية (PDPL) معالجة البيانات محلياً.",23]2425doc_response = client.embeddings.create(26 model="mx4-embed-v1",27 input=documents28)29doc_embeddings = [d.embedding for d in doc_response.data]3031# Dot product = cosine similarity (MX4 embeddings are L2-normalized)32similarities = [np.dot(query_embedding, doc_emb) for doc_emb in doc_embeddings]33ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)3435for idx, score in ranked[:3]:36 print(f"[{score:.3f}] {documents[idx]}")
Example: Batch Embeddings with Node.js
1import OpenAI from "openai";23const client = new OpenAI({4 apiKey: process.env.MX4_API_KEY,5 baseURL: "https://api.mx4.ai/v1",6});78async function embedDocuments(texts) {9 const response = await client.embeddings.create({10 model: "mx4-embed-v1",11 input: texts, // Batch size depends on your plan12 });1314 return response.data.map((item) => ({15 index: item.index,16 embedding: item.embedding, // vector array17 }));18}1920// Usage21const docs = [22 "برنامج التحول الرقمي في المملكة العربية السعودية",23 "Digital public services framework in Saudi Arabia",24];2526const embeddings = await embedDocuments(docs);27console.log(`Embedded ${embeddings.length} documents (${embeddings[0].embedding.length} dimensions)`);