One backend with six things AI applications actually need — inference, vector, queues, state, gateway, and governance. OpenAI-compatible where it can be, standards-based where it can't. One auth boundary, one invoice, one region. Below: what's in each, what the code looks like, what's live.
Managed vLLM endpoints for the leading open-source models. Drop-in replacement for the OpenAI SDK — chat, embeddings, function calling, streaming, batch, structured outputs. Pay per token, billed in local currency. No GPU management, no quotas, no waitlists.
/v1/chat/completions, /v1/embeddings, /v1/completions. Function calling, JSON mode, streaming, batch endpoints.from openai import OpenAI client = OpenAI( api_key="mnr_...", base_url="https://infer.manara.ai/v1", ) response = client.chat.completions.create( model="llama-3.3-70b", messages=[ {"role": "system", "content": "Reply in Arabic."}, {"role": "user", "content": "What is RAG?"}, ], stream=True, ) for chunk in response: print(chunk.choices[0].delta.content, end="") # ~180ms TTFT · ~30ms p50 · billed in local currency
# JSON mode + tool use — same SDK response = client.chat.completions.create( model="qwen-2.5-72b", messages=[...], tools=[{ "type": "function", "function": { "name": "search_invoices", "parameters": {...}, }, }], response_format={"type": "json_object"}, )
Managed pgvector with hybrid search. Embeddings co-located with inference — your retrieval, rerank, and model call hit the same region with single-digit-millisecond internal latency. Built on Postgres, so you query vectors and structured data in one SQL statement.
/v1/embeddings. Per-token billing, same invoice.# 1. embed emb = client.embeddings.create( model="bge-large", input="PDPL data residency rules", ).data[0].embedding # 2. retrieve — same region, single-digit ms docs = manara.vector.search( index="compliance-kb", query_vector=emb, hybrid=True, rerank="bge-reranker-v2", top_k=8, ) # 3. generate — co-located inference answer = client.chat.completions.create( model="llama-3.3-70b", messages=[ {"role": "system", "content": f"Context: {docs}"}, {"role": "user", "content": q}, ], ) # End-to-end RAG. Never leaves the region.
Long-running tasks, scheduled runs, retries, fan-out, dead-letter queues. The plumbing every AI agent eventually needs — typed step functions that resume after failure and don't lose state when the model times out.
import { defineAgent, step } from "@manara/agents"; export const processInvoice = defineAgent({ name: "process-invoice", async run({ invoiceId }) { // each step is durable + retryable const pdf = await step("fetch-pdf", () => storage.get(`invoices/${invoiceId}.pdf`)); const fields = await step("extract", () => llm.extract(pdf, { schema: InvoiceSchema })); const approval = await step("wait-approval", { timeout: "7d", // suspend, resume later }); await step("book", () => ledger.write(fields)); }, }); // resumes from any step on failure
Managed Postgres with branching, point-in-time recovery, read replicas, and scale-to-zero. Object storage with an S3-compatible API and in-region durability. The boring parts — done right, in the same region as your model.
-- Postgres connection — drop-in for any client DATABASE_URL="postgres://[email protected]:5432/app" -- vector + relational in one query SELECT i.id, i.customer_id, i.total, (i.notes_emb <=> $1) AS distance FROM invoices i WHERE i.region = 'ksa' ORDER BY distance LIMIT 10; # branching for previews $ manara db branch create --from main --name pr-247 ✓ branched in 2.1s
// S3-compatible — keep your aws-sdk import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3"; const s3 = new S3Client({ endpoint: "https://s3.khi.manara.ai", region: "khi", credentials: { ... }, }); await s3.send(new PutObjectCommand({ Bucket: "invoices-2026", Key: "INV-2026-0421.pdf", Body: buffer, }));
Route across local Manara models and frontier providers (OpenAI, Anthropic, Google) from a single endpoint. Per-team metering and spend caps. Semantic caching to cut cost. Fallback chains so a regional outage doesn't bring your app down. Content policy at the gateway, not in your app.
# gateway config-as-code routes: - match: { team: "support" } primary: "manara/llama-3.3-70b" fallback: - "manara/qwen-2.5-72b" - "openai/gpt-4o-mini" cache: { semantic: 0.92, ttl: "24h" } budget: monthly: "USD 2000" on_exceed: "downgrade" - match: { team: "compliance" } primary: "manara/jais-30b" # keep PII in-country policy: "no-frontier" # never route off-region # one key, many models, predictable spend
The governance layer regulated buyers can't ship without — built into the same platform as your inference, vector, and queues. Three capabilities, one boundary: every prompt is traced, every PII field is detected and masked, every call is written to a tamper-evident audit log. None of it leaves the region.
Pay for what you use. Inference per token. Compute, vector, and storage per second, per GB, per region. All invoiced in local currency, no minimum commitments, no extraterritorial metering.
See full pricing →No card required. One repo, one base_url change, six primitives ready to call.