We are now part of the NVIDIA Inception Program.Read the announcement
Documentation

Rate Limits

Understand MX4 Atlas API rate limits and how to handle them effectively.

Last updated on February 2, 2026

Rate Limit Overview

MX4 Atlas implements rate limiting to ensure fair usage and maintain service stability. Rate limits are applied per API key and are measured in requests per minute (RPM) and tokens per minute (TPM).

Default Limits (Starter Plan)

Requests per minute (RPM)100
Tokens per minute (TPM)10,000
Requests per hour1,000
Pro & Enterprise: Contact sales for custom limits tailored to your workload.

Rate Limit Headers

Every API response includes headers that indicate your current rate limit status:

x-ratelimit-limit-requestsMaximum requests per minute
x-ratelimit-limit-tokensMaximum tokens per minute
x-ratelimit-remaining-requestsRemaining requests in window
x-ratelimit-remaining-tokensRemaining tokens in window
x-ratelimit-reset-requestsTime until requests reset (Unix timestamp)
x-ratelimit-reset-tokensTime until tokens reset (Unix timestamp)

Handling Rate Limits

When you exceed rate limits, the API returns a 429 status code. Implement exponential backoff and retry logic in your applications.

rate_limit_handling.pypython
1import time
2import openai
3from openai import OpenAI
4
5client = OpenAI(
6 api_key="mx4-sk-...",
7 base_url="https://api.mx4.ai/v1"
8)
9
10def make_request_with_retry(messages, max_retries=3):
11 for attempt in range(max_retries):
12 try:
13 response = client.chat.completions.create(
14 model="mx4-atlas-core",
15 messages=messages
16 )
17 return response
18 except openai.RateLimitError as e:
19 if attempt == max_retries - 1:
20 raise e
21 # Exponential backoff: wait 2^attempt seconds
22 wait_time = 2 ** attempt
23 print(f"Rate limited. Waiting {wait_time} seconds...")
24 time.sleep(wait_time)
25 return None

Advanced Retry Strategies

Exponential Backoff with Jitter

Reduce thundering herd by adding jitter to retry delays:

advanced_retry_logic.pypython
1import time
2import random
3import openai
4
5def request_with_exponential_backoff(messages, max_retries=5):
6 base_delay = 1 # Start with 1 second
7
8 for attempt in range(max_retries):
9 try:
10 response = client.chat.completions.create(
11 model="mx4-atlas-core",
12 messages=messages
13 )
14 return response
15 except openai.RateLimitError as e:
16 if attempt == max_retries - 1:
17 raise e
18
19 # Exponential backoff with jitter
20 delay = base_delay * (2 ** attempt)
21 jitter = random.uniform(0, delay * 0.1)
22 total_delay = delay + jitter
23
24 print(f"Attempt {attempt + 1}: waiting {total_delay:.2f} seconds")
25 time.sleep(total_delay)
26 except openai.APIError as e:
27 # Don't retry on non-rate-limit errors
28 raise e

Monitoring & Optimization

Log Rate Limit Headers

Track remaining tokens/requests in your monitoring system to predict limit breaches.

Batch Requests

Group multiple queries into single API calls where possible to reduce request count.

Request Queue

Implement a request queue to smooth out traffic spikes and avoid rate limit bursts.

Token Optimization

Use shorter messages, remove unnecessary context, and implement caching for repeated queries.

Optimization Techniques

Request Queuing Pattern

request_queue.pypython
1import asyncio
2import time
3from collections import deque
4
5class RateLimitedQueue:
6 def __init__(self, max_rpm=100, max_tpm=10000):
7 self.max_rpm = max_rpm
8 self.max_tpm = max_tpm
9 self.queue = deque()
10 self.request_times = []
11 self.token_counts = []
12
13 async def add_request(self, messages, tokens_estimate=200):
14 while True:
15 now = time.time()
16 # Clean old entries (older than 1 minute)
17 self.request_times = [t for t in self.request_times if now - t < 60]
18 self.token_counts = [t for t in self.token_counts if now - t < 60]
19
20 if (len(self.request_times) < self.max_rpm and
21 sum(self.token_counts) + tokens_estimate < self.max_tpm):
22 self.request_times.append(now)
23 self.token_counts.append(tokens_estimate)
24 return
25
26 # Wait before retry
27 await asyncio.sleep(1)

Monitoring Best Practices

Alert on Thresholds

Set alerts when remaining tokens drop below 20% of limit to proactively manage load.

Track Usage Patterns

Monitor peak usage times and adjust request distribution to avoid consistent bottlenecks.

Upgrade Planning

If consistently hitting 80%+ of limits, upgrade your plan for better performance and cost efficiency.

Increasing Limits

Higher rate limits are available for enterprise customers. Contact our sales team to discuss your requirements and upgrade your plan.

Plan Limits

Starter: 100 RPM, 10K TPM
Pro: 500 RPM, 50K TPM
Enterprise: Custom limits (1K-10K+ RPM)

Sales: sales@mx4.ai