MARA Cloud Documentation

Text generation is the core capability of MARA Cloud. When you send a prompt to the API, the model processes your input and returns a generated response, whether that's answering a question, writing content, summarizing a document, or completing a conversation.

This guide walks you through the different ways to generate text, how to choose the right model for your task, how to write effective prompts, and how to manage conversations that span multiple turns. If you're new to working with language model APIs, start here. If you've already made your first request through the Quickstart, this guide will help you get more out of the API.

How to generate text

MARA Cloud offers three approaches to text generation, each suited for different use cases.

Standard (non-streaming)

The simplest approach. You send a request and receive the complete response once generation finishes. Best for batch processing or when you don't need real-time output.

python

from openai import OpenAI

client = OpenAI(
    base_url="https://bczfskny6zqw.poweredby.snova.ai/v1",
    api_key="your-mara-api-key",
)

completion = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "system", "content": "You are a helpful writing assistant."},
        {"role": "user", "content": "Explain what an API is in one paragraph."},
    ],
)

print(completion.choices[0].message.content)

Streaming

Instead of waiting for the full response, streaming delivers tokens as they are generated. This is ideal for chat interfaces, real-time applications, or anywhere you want to show output progressively.

python

completion = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "system", "content": "You are a helpful writing assistant."},
        {"role": "user", "content": "Explain what an API is in one paragraph."},
    ],
    stream=True,
)

for chunk in completion:
    print(chunk.choices[0].delta.content, end="")

Note: Each streamed chunk may contain multiple tokens. Keep this in mind if you're measuring throughput or calculating tokens per second.

Asynchronous

If your application handles multiple requests concurrently or uses non-blocking I/O, the async client lets you run generation without blocking the main thread.

python

from openai import AsyncOpenAI
import asyncio

async def main():
    client = AsyncOpenAI(
        base_url="https://bczfskny6zqw.poweredby.snova.ai/v1",
        api_key="your-mara-api-key",
    )

    completion = await client.chat.completions.create(
        model="MiniMax-M2.5",
        messages=[
            {"role": "system", "content": "You are a helpful writing assistant."},
            {"role": "user", "content": "Explain what an API is in one paragraph."},
        ],
    )

    print(completion.choices[0].message.content)

asyncio.run(main())

Picking the right model

Not all models are created equal. The right choice depends on what you're building.

What matters most	Go with
Complex reasoning, nuanced tasks	A larger model like `gpt-oss-120B`
Speed and low latency	A smaller, faster model
Cost efficiency	A smaller model that meets your accuracy bar
Maximum accuracy	The largest model you can afford

The best approach is to experiment. Try a few models from the Model Catalog with your actual prompts and evaluate the results before committing.

Writing effective prompts

The quality of the model's output depends heavily on how you write your prompt. Good prompts are specific, structured, and provide enough context for the model to understand what you need.

Key elements of a strong prompt

Persona: Tell the model who it is. "You are a senior backend engineer" produces very different output than a generic prompt.
Context: Give background information. The more relevant context you provide, the better the response.
Output format: Be explicit about how you want the answer. "Respond in JSON with keys: title, summary, tags" is far more useful than "give me a summary."
Task: Clearly state what you want. Vague instructions lead to vague responses.

Techniques that improve results

In-context learning: Include one or two examples of the desired output in your prompt. The model picks up on the pattern.
Chain-of-Thought: Ask the model to think step-by-step before giving a final answer. This significantly improves reasoning on complex problems.

Understanding messages and roles

Every request to the chat completions API is structured as a list of messages. Each message has a role that tells the model who is speaking and a content field with the actual text.

Role	Purpose
`system`	Sets the model's behavior and personality for the entire conversation.
`user`	Your input or question.
`assistant`	The model's previous responses. Include these to give the model memory of the conversation.
`tool`	The result of a tool or function call (see Function Calling & JSON Mode).

Multi-turn conversations

Language models don't have built-in memory. Each API call is independent. To create the experience of a continuous conversation, you pass the full message history with every request.

python

completion = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "Hi! My name is Peter and I'm 31. What is 1+1?"},
        {"role": "assistant", "content": "Nice to meet you, Peter. 1 + 1 equals 2."},
        {"role": "user", "content": "What is my age?"},
    ],
)

print(completion.choices[0].message.content)

Expected output:

text

You told me earlier, Peter. You're 31 years old.

Because the earlier messages are included, the model can reference Peter's name and age even though it has no memory of the previous exchange.

Things to watch out for in long conversations

Context window limits: Every model has a maximum number of tokens it can process in a single request. If your conversation history exceeds this, the request will fail. Check the Model Catalog for each model's context window.
No persistent memory: The model only knows what's in the current request. If you drop older messages, the model loses that context.
Token costs add up: Every token in the message history counts toward your usage. For long-running conversations, consider summarizing older exchanges to keep costs in check.

Next steps

Function Calling & JSON Mode - Build structured workflows with tool use and JSON outputs.
OpenAI Compatibility - Understand supported and unsupported parameters.

Get Started

Models

Features

Platform

Data Privacy

Resources

Text Generation

How to generate text

Standard (non-streaming)

Streaming

Asynchronous

Picking the right model

Writing effective prompts

Key elements of a strong prompt

Techniques that improve results

Understanding messages and roles

Multi-turn conversations

Things to watch out for in long conversations

Next steps