Features

Text Generation

Text generation is the core capability of MARA Cloud. When you send a prompt to the API, the model processes your input and returns a generated response, whether that's answering a question, writing content, summarizing a document, or completing a conversation.
This guide walks you through the different ways to generate text, how to choose the right model for your task, how to write effective prompts, and how to manage conversations that span multiple turns. If you're new to working with language model APIs, start here. If you've already made your first request through the Quickstart, this guide will help you get more out of the API.

How to generate text

MARA Cloud offers three approaches to text generation, each suited for different use cases.

Standard (non-streaming)

The simplest approach. You send a request and receive the complete response once generation finishes. Best for batch processing or when you don't need real-time output.
python
from openai import OpenAI

client = OpenAI(
    base_url="https://bczfskny6zqw.poweredby.snova.ai/v1",
    api_key="your-mara-api-key",
)

completion = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "system", "content": "You are a helpful writing assistant."},
        {"role": "user", "content": "Explain what an API is in one paragraph."},
    ],
)

print(completion.choices[0].message.content)

Streaming

Instead of waiting for the full response, streaming delivers tokens as they are generated. This is ideal for chat interfaces, real-time applications, or anywhere you want to show output progressively.
python
completion = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "system", "content": "You are a helpful writing assistant."},
        {"role": "user", "content": "Explain what an API is in one paragraph."},
    ],
    stream=True,
)

for chunk in completion:
    print(chunk.choices[0].delta.content, end="")
Note: Each streamed chunk may contain multiple tokens. Keep this in mind if you're measuring throughput or calculating tokens per second.

Asynchronous

If your application handles multiple requests concurrently or uses non-blocking I/O, the async client lets you run generation without blocking the main thread.
python
from openai import AsyncOpenAI
import asyncio

async def main():
    client = AsyncOpenAI(
        base_url="https://bczfskny6zqw.poweredby.snova.ai/v1",
        api_key="your-mara-api-key",
    )

    completion = await client.chat.completions.create(
        model="MiniMax-M2.5",
        messages=[
            {"role": "system", "content": "You are a helpful writing assistant."},
            {"role": "user", "content": "Explain what an API is in one paragraph."},
        ],
    )

    print(completion.choices[0].message.content)

asyncio.run(main())

Picking the right model

Not all models are created equal. The right choice depends on what you're building.
What matters mostGo with
Complex reasoning, nuanced tasksA larger model like gpt-oss-120B
Speed and low latencyA smaller, faster model
Cost efficiencyA smaller model that meets your accuracy bar
Maximum accuracyThe largest model you can afford
The best approach is to experiment. Try a few models from the Model Catalog with your actual prompts and evaluate the results before committing.

Writing effective prompts

The quality of the model's output depends heavily on how you write your prompt. Good prompts are specific, structured, and provide enough context for the model to understand what you need.

Key elements of a strong prompt

  • Persona: Tell the model who it is. "You are a senior backend engineer" produces very different output than a generic prompt.
  • Context: Give background information. The more relevant context you provide, the better the response.
  • Output format: Be explicit about how you want the answer. "Respond in JSON with keys: title, summary, tags" is far more useful than "give me a summary."
  • Task: Clearly state what you want. Vague instructions lead to vague responses.

Techniques that improve results

  • In-context learning: Include one or two examples of the desired output in your prompt. The model picks up on the pattern.
  • Chain-of-Thought: Ask the model to think step-by-step before giving a final answer. This significantly improves reasoning on complex problems.

Understanding messages and roles

Every request to the chat completions API is structured as a list of messages. Each message has a role that tells the model who is speaking and a content field with the actual text.
RolePurpose
systemSets the model's behavior and personality for the entire conversation.
userYour input or question.
assistantThe model's previous responses. Include these to give the model memory of the conversation.
toolThe result of a tool or function call (see Function Calling & JSON Mode).

Multi-turn conversations

Language models don't have built-in memory. Each API call is independent. To create the experience of a continuous conversation, you pass the full message history with every request.
python
completion = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "Hi! My name is Peter and I'm 31. What is 1+1?"},
        {"role": "assistant", "content": "Nice to meet you, Peter. 1 + 1 equals 2."},
        {"role": "user", "content": "What is my age?"},
    ],
)

print(completion.choices[0].message.content)
Expected output:
text
You told me earlier, Peter. You're 31 years old.
Because the earlier messages are included, the model can reference Peter's name and age even though it has no memory of the previous exchange.

Things to watch out for in long conversations

  • Context window limits: Every model has a maximum number of tokens it can process in a single request. If your conversation history exceeds this, the request will fail. Check the Model Catalog for each model's context window.
  • No persistent memory: The model only knows what's in the current request. If you drop older messages, the model loses that context.
  • Token costs add up: Every token in the message history counts toward your usage. For long-running conversations, consider summarizing older exchanges to keep costs in check.

Next steps