We are now part of the NVIDIA Inception Program.Read the announcement
Documentation

Atlas Serve

High-performance inference engine optimized for Arabic.

Last updated on February 2, 2026

Overview

Atlas Serve is MX4's proprietary high-performance inference engine, specifically optimized for Arabic language processing and MENA region requirements. Built on cutting-edge techniques, it delivers exceptional performance while maintaining security and sovereignty.

4x
Higher throughput than standard pipelines
30%
Token reduction for Arabic text
<50ms
Average response time

Key Technologies

PagedAttention

Advanced attention mechanism that optimizes memory usage and enables efficient batching of requests with different sequence lengths.

Benefit: Up to 4x improvement in throughput for variable-length sequences, common in Arabic text processing.

Continuous Batching

Dynamic batching that continuously groups incoming requests for optimal GPU utilization, reducing latency and increasing throughput.

Benefit: Minimizes idle GPU time and maximizes inference efficiency across varying request patterns.

Arabic-Optimized Tokenization

Custom tokenizer trained specifically on Arabic text corpora, understanding morphological patterns and reducing token fragmentation.

Benefit: 30% reduction in token count for Arabic text, lowering costs and improving context window utilization.

Performance Benchmarks

Arabic Language Tasks

TaskAtlas ServeStandard PipelineImprovement
Question Answering1,250 tokens/sec320 tokens/sec3.9x
Text Generation980 tokens/sec250 tokens/sec3.9x
Summarization1,180 tokens/sec310 tokens/sec3.8x

Optimization Strategies

Request Batching

Group multiple requests together to maximize GPU utilization and reduce per-request overhead.

KV Cache Sharing

Share key-value caches across similar requests to reduce memory and computation overhead.

Prompt Caching

Cache precomputed embeddings and attention states for repeated prompts to improve latency.

Speculative Decoding

Use smaller models to predict next tokens and verify with larger model for faster generation.

Configuration

Atlas Serve can be configured for different deployment scenarios and performance requirements.

atlas-serve-config.yamlyaml
1# Atlas Serve Configuration
2serve:
3 model: mx4-atlas-core
4 optimization:
5 paged_attention: true
6 continuous_batching: true
7 max_batch_size: 32
8 max_sequence_length: 4096
9
10 arabic_optimization:
11 custom_tokenizer: true
12 morphological_awareness: true
13
14 caching:
15 kv_cache_sharing: true
16 prompt_cache_ttl: 3600
17
18 performance:
19 target_latency: 50ms
20 throughput_priority: high
21
22 security:
23 enclave_mode: true
24 activity_journal: true

Integration & Deployment

Atlas Serve integrates seamlessly with Atlas Runtime for secure, high-performance inference.

atlas_serve_integration.pypython
1from mx4 import AtlasServe
2
3# Initialize Atlas Serve with configuration
4serve = AtlasServe(
5 model="mx4-atlas-core",
6 config="atlas-serve-config.yaml",
7 gpu_memory_utilization=0.9 # Maximize GPU usage
8)
9
10# High-performance inference with streaming
11response = serve.generate(
12 prompt="اكتب مقالة عن الذكاء الاصطناعي",
13 max_tokens=500,
14 temperature=0.7,
15 stream=True
16)
17
18for chunk in response:
19 print(chunk.text, end="", flush=True)
20
21print(f"\nTokens/sec: {response.throughput}")
22print(f"Latency: {response.latency_ms}ms")

Scaling & Load Balancing

Deploy multiple Atlas Serve instances for high-availability and load distribution:

atlas-serve-cluster.yamlyaml
1# Multi-instance load balancing configuration
2load_balancer:
3 algorithm: least_connections
4 health_check_interval: 10s
5
6instances:
7 - name: serve-1
8 model: mx4-atlas-core
9 gpus: [0, 1]
10 max_batch_size: 32
11
12 - name: serve-2
13 model: mx4-atlas-core
14 gpus: [2, 3]
15 max_batch_size: 32
16
17 - name: serve-lite
18 model: mx4-atlas-lite
19 gpus: [4]
20 max_batch_size: 64
21
22routing:
23 default: serve-1
24 fallback: serve-lite

Note: Atlas Serve is optimized for Arabic language tasks. For general-purpose English tasks, consider using standard inference engines. See Atlas Runtime documentation for deployment options.