Embeddings
Get a vector representation of a given input — optimized for Arabic morphology with dialect-aware tokenization.
Available Models
| Model | Dimensions | Max Tokens | Arabic MTEB | Best For |
|---|---|---|---|---|
| mx4-embed-v1 | 1536 | 8,192 | 74.2% | General purpose, semantic search |
| mx4-embed-v1-large | 3072 | 8,192 | 78.6% | High-precision retrieval, classification |
Arabic MTEB scores measured on our internal benchmark suite covering MSA, Gulf, Levantine, and Egyptian dialects.
Request Body
The text(s) to embed. Pass a single string or an array of up to 50 strings for batch processing.
ID of the model to use: 'mx4-embed-v1' (1536-dim) or 'mx4-embed-v1-large' (3072-dim).
Output format: 'float' (default) or 'base64'. Use base64 for bandwidth-sensitive applications.
Request
1curl https://api.mx4.ai/v1/embeddings \2 -H "Authorization: Bearer $MX4_API_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "input": ["السيادة على البيانات مطلب أساسي", "Data sovereignty is essential"],6 "model": "mx4-embed-v1"7 }'
Response
1{2 "object": "list",3 "model": "mx4-embed-v1",4 "data": [5 {6 "object": "embedding",7 "index": 0,8 "embedding": [0.0123, -0.0456, 0.0789, "...1536 floats"]9 },10 {11 "object": "embedding",12 "index": 1,13 "embedding": [0.0234, -0.0567, 0.0891, "...1536 floats"]14 }15 ],16 "usage": {17 "prompt_tokens": 18,18 "total_tokens": 1819 }20}
Arabic Embedding Notes
- •Root-Aware Tokenization: MX4 embeddings use a custom Arabic tokenizer that preserves morphological roots. The word "كتبوا" (they wrote) shares vector proximity with "كتاب" (book) and "مكتبة" (library) — something standard BPE tokenizers miss entirely.
- •Dialect Coverage: Trained on MSA, Gulf, Egyptian, Levantine, and Maghrebi corpora. Cross-dialect retrieval accuracy is ~12% higher than multilingual-e5-large.
- •Mixed-Language Support: Handles Arabic-English code-switching common in Gulf business contexts. Embedding a mixed sentence like "نحتاج meeting بعد الظهر" produces coherent vectors.
- •Diacritics Handling: Embeddings are stable with or without tashkeel (diacritical marks). وَلَد and ولد produce near-identical vectors.
Use Cases
Semantic Search
Find documents matching a query by comparing embedding vectors. Ideal for Arabic knowledge bases where keyword matching fails due to morphological complexity.
RAG Retrieval
Power retrieval-augmented generation by embedding document chunks and retrieving the top-k most relevant passages for context injection.
Duplicate Detection
Identify duplicate or near-duplicate documents using cosine similarity — across MSA and dialect variants of the same content.
Classification
Use embeddings as features for downstream classifiers — sentiment analysis, topic categorization, intent detection in Arabic customer support.
Best Practices
✓ Batch Requests
Send up to 50 texts in a single request for optimal throughput. Batching reduces per-text latency by ~60% compared to individual calls.
✓ Cache Embeddings
Store embeddings in a vector database (Qdrant, Weaviate, pgvector) to avoid recomputing. Embeddings are deterministic — the same input always produces the same output.
✓ Normalize Before Comparison
MX4 embeddings are already L2-normalized, so cosine similarity equals dot product. Use dot product for faster comparisons at scale.
✓ Chunk Arabic Text Carefully
Arabic sentences are longer than English after tokenization. Aim for 256–512 token chunks for retrieval. Use sentence boundaries, not fixed character counts.
Example: Semantic Search with Python
1import openai2import numpy as np34# MX4 is OpenAI-compatible — use the standard SDK5client = openai.OpenAI(6 api_key="mx4-sk-...",7 base_url="https://api.mx4.ai/v1"8)910# Embed a query in Arabic11query = "ما هي السيادة على البيانات؟" # What is data sovereignty?12query_embedding = client.embeddings.create(13 model="mx4-embed-v1",14 input=query15).data[0].embedding1617# Embed a document corpus (batch for efficiency)18documents = [19 "السيادة على البيانات تعني بقاء البيانات في البلد الذي تم جمعها فيه.",20 "عاصمة فرنسا هي باريس.",21 "التشفير يحمي البيانات أثناء النقل وفي حالة السكون.",22 "يتطلب نظام حماية البيانات الشخصية (PDPL) معالجة البيانات محلياً.",23]2425doc_response = client.embeddings.create(26 model="mx4-embed-v1",27 input=documents28)29doc_embeddings = [d.embedding for d in doc_response.data]3031# Dot product = cosine similarity (MX4 embeddings are L2-normalized)32similarities = [np.dot(query_embedding, doc_emb) for doc_emb in doc_embeddings]33ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)3435for idx, score in ranked[:3]:36 print(f"[{score:.3f}] {documents[idx]}")
Example: Batch Embeddings with Node.js
1import OpenAI from "openai";23const client = new OpenAI({4 apiKey: process.env.MX4_API_KEY,5 baseURL: "https://api.mx4.ai/v1",6});78async function embedDocuments(texts) {9 const response = await client.embeddings.create({10 model: "mx4-embed-v1",11 input: texts, // Up to 50 texts per batch12 });1314 return response.data.map((item) => ({15 index: item.index,16 embedding: item.embedding, // 1536-dim float array17 }));18}1920// Usage21const docs = [22 "نظام الحوكمة الرقمية في المملكة العربية السعودية",23 "Digital governance framework in Saudi Arabia",24];2526const embeddings = await embedDocuments(docs);27console.log(`Embedded ${embeddings.length} documents (${embeddings[0].embedding.length} dimensions)`);