We are now part of the NVIDIA Inception Program.Read the announcement
Documentation

Chat Completions

Generate responses for a given conversation.

Last updated on February 2, 2026
POSThttps://api.mx4.ai/v1/chat/completions

Creates a model response for the given chat conversation. This endpoint is fully compatible with the OpenAI Chat Completions API.

Request Body

modelstringRequired

ID of the model to use. See the Model Garden for available options (e.g., 'mx4-atlas-core').

messagesarrayRequired

A list of messages comprising the conversation so far.

temperaturenumber

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

max_tokensinteger

The maximum number of tokens that can be generated in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.

top_pnumber

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

streamboolean

If set, partial message deltas will be sent. Tokens will be sent as data-only server-sent events as they become available.

presence_penaltynumber

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

frequency_penaltynumber

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text, decreasing the model's likelihood to repeat the same line verbatim.

Use Cases

Q&A Systems

Build smart question-answering systems, customer support chatbots, and knowledge base assistants.

Translation

Translate content between Arabic dialects and other languages with cultural context awareness.

Summarization

Create concise summaries of long documents, reports, and conversations in real-time.

Content Generation

Generate contextual responses for creative writing, code completion, and content creation.

Example Request

bash
1curl https://api.mx4.ai/v1/chat/completions \
2 -H "Content-Type: application/json" \
3 -H "Authorization: Bearer $MX4_API_KEY" \
4 -d '{
5 "model": "mx4-atlas-core",
6 "messages": [
7 {
8 "role": "system",
9 "content": "You are a helpful assistant specialized in Arabic language and MENA markets."
10 },
11 {
12 "role": "user",
13 "content": "اشرح لي كيفية استخدام الذكاء الاصطناعي في التجارة الإلكترونية"
14 }
15 ],
16 "temperature": 0.7,
17 "max_tokens": 500
18 }'

Example Response

json
1{
2 "id": "chatcmpl-123",
3 "object": "chat.completion",
4 "created": 1677652288,
5 "model": "mx4-atlas-core",
6 "system_fingerprint": "fp_44709d6fcb",
7 "choices": [{
8 "index": 0,
9 "message": {
10 "role": "assistant",
11 "content": "يمكن استخدام الذكاء الاصطناعي في التجارة الإلكترونية بعدة طرق... "
12 },
13 "logprobs": null,
14 "finish_reason": "stop"
15 }],
16 "usage": {
17 "prompt_tokens": 42,
18 "completion_tokens": 156,
19 "total_tokens": 198
20 }
21}

Streaming Example

Enable streaming to receive tokens in real-time as they are generated, perfect for responsive chat interfaces and live content generation.

streaming_example.pypython
1import openai
2import os
3
4client = openai.OpenAI(
5 api_key=os.getenv("MX4_API_KEY"),
6 base_url="https://api.mx4.ai/v1"
7)
8
9# Stream chat completion
10with client.chat.completions.create(
11 model="mx4-atlas-core",
12 messages=[
13 {"role": "system", "content": "You are a helpful assistant."},
14 {"role": "user", "content": "اكتب مقالة قصيرة عن الابتكار"}
15 ],
16 stream=True,
17 temperature=0.7
18) as stream:
19 for text in stream.text_stream:
20 print(text, end="", flush=True)

Best Practices

System Prompts

Use clear system prompts to define the assistant's behavior, tone, and expertise level for consistent responses.

Context Management

Maintain conversation history but trim old messages to stay within token limits and reduce costs.

Temperature Tuning

Use lower temperature (0.3-0.5) for factual tasks, higher (0.8-1.0) for creative generation.

Streaming for UX

Implement streaming for chat applications to show token-by-token generation and improve perceived responsiveness.

Troubleshooting

Tokens Exceeded

Model responds with "context_length_exceeded" error when messages exceed model's max_tokens limit.

Solution: Reduce conversation history, use summarization for old messages, or increase max_tokens parameter.

Rate Limit Errors

API returns 429 status code when rate limits are exceeded.

Solution: Implement exponential backoff retry logic, or upgrade plan for higher limits. See Rate Limits documentation.

Inconsistent Responses

Same prompt produces different outputs across requests.

Solution: Lower temperature for consistent outputs (recommend 0.0-0.3), or set top_p to control diversity.