We are now part of the NVIDIA Inception Program.Read the announcement
Documentation

Rate Limits

Understand MX4 Atlas API rate limits and how to handle them effectively.

Last updated on February 2, 2026

Rate Limit Overview

MX4 Atlas implements rate limiting to ensure fair usage and maintain service stability. Rate limits are applied per API key and are measured in requests per minute (RPM) and tokens per minute (TPM).

Limits by Plan

Requests per minute (RPM)Varies by plan
Tokens per minute (TPM)Varies by plan
Requests per hourVaries by plan
View your current limits in Atlas Studio or contact sales for limits tailored to your workload.

Rate Limit Headers

Every API response includes headers that indicate your current rate limit status:

x-ratelimit-limit-requestsMaximum requests per minute
x-ratelimit-limit-tokensMaximum tokens per minute
x-ratelimit-remaining-requestsRemaining requests in window
x-ratelimit-remaining-tokensRemaining tokens in window
x-ratelimit-reset-requestsTime until requests reset (Unix timestamp)
x-ratelimit-reset-tokensTime until tokens reset (Unix timestamp)

Header names and windows can vary by deployment. Use the response headers from your environment as the source of truth.

Handling Rate Limits

When you exceed rate limits, the API returns a 429 status code. Implement exponential backoff and retry logic in your applications.

rate_limit_handling.pypython
1import time
2import openai
3from openai import OpenAI
4
5client = OpenAI(
6 api_key="mx4-sk-...",
7 base_url="https://api.mx4.ai/v1"
8)
9
10def make_request_with_retry(messages, max_retries=3):
11 for attempt in range(max_retries):
12 try:
13 response = client.chat.completions.create(
14 model="mx4-atlas-core",
15 messages=messages
16 )
17 return response
18 except openai.RateLimitError as e:
19 if attempt == max_retries - 1:
20 raise e
21 # Exponential backoff: wait 2^attempt seconds
22 wait_time = 2 ** attempt
23 print(f"Rate limited. Waiting {wait_time} seconds...")
24 time.sleep(wait_time)
25 return None

Advanced Retry Strategies

Exponential Backoff with Jitter

Reduce thundering herd by adding jitter to retry delays:

advanced_retry_logic.pypython
1import time
2import random
3import openai
4
5def request_with_exponential_backoff(messages, max_retries=5):
6 base_delay = 1 # Start with 1 second
7
8 for attempt in range(max_retries):
9 try:
10 response = client.chat.completions.create(
11 model="mx4-atlas-core",
12 messages=messages
13 )
14 return response
15 except openai.RateLimitError as e:
16 if attempt == max_retries - 1:
17 raise e
18
19 # Exponential backoff with jitter
20 delay = base_delay * (2 ** attempt)
21 jitter = random.uniform(0, delay * 0.1)
22 total_delay = delay + jitter
23
24 print(f"Attempt {attempt + 1}: waiting {total_delay:.2f} seconds")
25 time.sleep(total_delay)
26 except openai.APIError as e:
27 # Don't retry on non-rate-limit errors
28 raise e

Monitoring & Optimization

Log Rate Limit Headers

Track remaining tokens/requests in your monitoring system to predict limit breaches.

Batch Requests

Group multiple queries into single API calls where possible to reduce request count.

Request Queue

Implement a request queue to smooth out traffic spikes and avoid rate limit bursts.

Token Optimization

Use shorter messages, remove unnecessary context, and implement caching for repeated queries.

Optimization Techniques

Request Queuing Pattern

request_queue.pypython
1import asyncio
2import time
3from collections import deque
4
5class RateLimitedQueue:
6 def __init__(self, max_rpm, max_tpm):
7 self.max_rpm = max_rpm # set based on your plan limits
8 self.max_tpm = max_tpm # set based on your plan limits
9 self.queue = deque()
10 self.request_times = []
11 self.token_counts = []
12
13 async def add_request(self, messages, tokens_estimate=200):
14 while True:
15 now = time.time()
16 # Clean old entries (older than 1 minute)
17 self.request_times = [t for t in self.request_times if now - t < 60]
18 self.token_counts = [t for t in self.token_counts if now - t < 60]
19
20 if (len(self.request_times) < self.max_rpm and
21 sum(self.token_counts) + tokens_estimate < self.max_tpm):
22 self.request_times.append(now)
23 self.token_counts.append(tokens_estimate)
24 return
25
26 # Wait before retry
27 await asyncio.sleep(1)

Monitoring Best Practices

Alert on Thresholds

Set alerts when remaining quota drops below a safe threshold to proactively manage load.

Track Usage Patterns

Monitor peak usage times and adjust request distribution to avoid consistent bottlenecks.

Upgrade Planning

If you consistently hit limits, upgrade your plan for higher throughput and stability.

Increasing Limits

Higher rate limits are available for enterprise customers. Contact our sales team to discuss your requirements and upgrade your plan.

Plan Limits

Starter: Baseline limits for evaluation
Pro: Higher limits for production pilots
Enterprise: Custom limits defined in contract

Sales: sales@mx4.ai