Documentation

Atlas Serve

Inference engine optimized for Arabic workloads on sovereign infrastructure.

Last updated on February 2, 2026

Overview

Atlas Serve is MX4's inference engine optimized for Arabic language processing and MENA region requirements. It focuses on low‑latency routing, efficient batching, and predictable operations on customer infrastructure.

Efficient

Optimized batching and scheduling

Arabic‑native

Tokenization tuned for Arabic

Predictable

Performance tuned per deployment

Key Technologies

PagedAttention

Advanced attention mechanism that optimizes memory usage and enables efficient batching of requests with different sequence lengths.

Benefit: Improves throughput for variable‑length sequences common in Arabic text processing.

Continuous Batching

Dynamic batching that continuously groups incoming requests for optimal GPU utilization, reducing latency and increasing throughput.

Benefit: Minimizes idle GPU time and maximizes inference efficiency across varying request patterns.

Arabic-Optimized Tokenization

Custom tokenizer trained specifically on Arabic text corpora, understanding morphological patterns and reducing token fragmentation.

Benefit: Reduces token fragmentation for Arabic text and improves context window utilization.

Performance validation

Performance depends on model choice, hardware, and deployment topology. We provide benchmark guidance and validate throughput and latency during pilots.

Optimization Strategies

Request Batching

Group multiple requests together to maximize GPU utilization and reduce per-request overhead.

KV Cache Sharing

Share key-value caches across similar requests to reduce memory and computation overhead.

Prompt Caching

Cache precomputed embeddings and attention states for repeated prompts to improve latency.

Speculative Decoding

Use smaller models to predict next tokens and verify with larger model for faster generation.

Configuration

Atlas Serve can be configured for different deployment scenarios and performance requirements.

atlas-serve-config.yamlyaml

1# Atlas Serve Configuration
2serve:
3  model: mx4-atlas-core
4  optimization:
5    paged_attention: true
6    continuous_batching: true
7    max_batch_size: 0 # tune per hardware
8    max_sequence_length: 0 # tune per model
9
10  arabic_optimization:
11    custom_tokenizer: true
12    morphological_awareness: true
13
14  caching:
15    kv_cache_sharing: true
16    prompt_cache_ttl: 0 # optional
17
18  performance:
19    throughput_priority: balanced
20
21  security:
22    enclave_mode: true
23    activity_journal: true

Integration & Deployment

Atlas Serve integrates seamlessly with Atlas Runtime for secure, high-performance inference.

atlas_serve_integration.pypython

1from mx4 import AtlasServe
2
3# Initialize Atlas Serve with configuration
4serve = AtlasServe(
5    model="mx4-atlas-core",
6    config="atlas-serve-config.yaml",
7    gpu_memory_utilization=0.9  # Maximize GPU usage
8)
9
10# High-performance inference with streaming
11response = serve.generate(
12    prompt="اكتب مقالة عن الذكاء الاصطناعي",
13    max_tokens=500,
14    temperature=0.7,
15    stream=True
16)
17
18for chunk in response:
19    print(chunk.text, end="", flush=True)

Scaling & Load Balancing

Deploy multiple Atlas Serve instances for high-availability and load distribution:

atlas-serve-cluster.yamlyaml

1# Multi-instance load balancing configuration
2load_balancer:
3  algorithm: least_connections
4  health_check_interval: 10s
5
6instances:
7  - name: serve-1
8    model: mx4-atlas-core
9    gpus: [0, 1]
10    max_batch_size: 32
11  
12  - name: serve-2
13    model: mx4-atlas-core
14    gpus: [2, 3]
15    max_batch_size: 32
16  
17  - name: serve-lite
18    model: mx4-atlas-lite
19    gpus: [4]
20    max_batch_size: 64
21
22routing:
23  default: serve-1
24  fallback: serve-lite

Note: Atlas Serve is optimized for Arabic language tasks. For general-purpose English tasks, consider using standard inference engines. See Atlas Runtime documentation for deployment options.