Platform — The serverless backend for AI apps

01 · Core · Inference

OpenAI-compatible API.
Open-source models.

Managed vLLM endpoints for the leading open-source models. Drop-in replacement for the OpenAI SDK — chat, embeddings, function calling, streaming, batch, structured outputs. Pay per token, billed in local currency. No GPU management, no quotas, no waitlists.

API surface/v1/chat/completions, /v1/embeddings, /v1/completions. Function calling, JSON mode, streaming, batch endpoints.
Fine-tuningLoRA and QLoRA on your data. Adapters served from the same endpoint, billed at base-model rates.
SDKsopenai-python, openai-node, openai-go, Vercel AI SDK, LangChain, LlamaIndex — anything that speaks the OpenAI protocol.
Latency~30ms p50 in-region. Streaming TTFT typically under 200ms for 70B-class models at 8k context.
SLA99.95% on production-tier models. Dedicated capacity available for enterprise workloads.

Model

Context

Price / 1M tokens

Status

Llama 3.3 70B

128k

in $0.60 / out $0.90

Live

Qwen 2.5 72B

128k

in $0.55 / out $0.85

Live

DeepSeek V3

64k

in $0.30 / out $0.90

Live

Jais 30B

8k

in $0.40 / out $0.60

Live

Mistral Large

128k

in $2.00 / out $6.00

Live

Phi-4

16k

in $0.10 / out $0.20

Live

Your fine-tune

base

base rates

LoRA

Python
TypeScript
cURL
inference.py

from openai import OpenAI

client = OpenAI(
    api_key="mnr_...",
    base_url="https://infer.manara.ai/v1",
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "Reply in Arabic."},
        {"role": "user",   "content": "What is RAG?"},
    ],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

# ~180ms TTFT · ~30ms p50 · billed in local currency
        

Function calling
tools.py

# JSON mode + tool use — same SDK
response = client.chat.completions.create(
    model="qwen-2.5-72b",
    messages=[...],
    tools=[{
        "type": "function",
        "function": {
            "name": "search_invoices",
            "parameters": {...},
        },
    }],
    response_format={"type": "json_object"},
)
        

02 · Memory · Vector + Embeddings

RAG without round trips.

Managed pgvector with hybrid search. Embeddings co-located with inference — your retrieval, rerank, and model call hit the same region with single-digit-millisecond internal latency. Built on Postgres, so you query vectors and structured data in one SQL statement.

EngineManaged pgvector on Postgres 16. HNSW and IVFFlat indexes. Full SQL surface — JOIN embeddings with your business tables.
EmbeddingsBGE-large, E5-mistral, Cohere-compatible endpoints, plus OpenAI-compatible /v1/embeddings. Per-token billing, same invoice.
Hybrid searchBM25 lexical + dense vector + cross-encoder reranking, all server-side, all in-region.
ScaleTested to 100M vectors per index. Read replicas for retrieval-heavy workloads. Async write paths for ingestion bursts.

Python
rag.py

# 1. embed
emb = client.embeddings.create(
    model="bge-large",
    input="PDPL data residency rules",
).data[0].embedding

# 2. retrieve — same region, single-digit ms
docs = manara.vector.search(
    index="compliance-kb",
    query_vector=emb,
    hybrid=True,
    rerank="bge-reranker-v2",
    top_k=8,
)

# 3. generate — co-located inference
answer = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": f"Context: {docs}"},
        {"role": "user",   "content": q},
    ],
)

# End-to-end RAG. Never leaves the region.
        

03 · Work · Queues & Workers

Durable jobs.
Real agents.

Long-running tasks, scheduled runs, retries, fan-out, dead-letter queues. The plumbing every AI agent eventually needs — typed step functions that resume after failure and don't lose state when the model times out.

Step functionsTyped, durable steps. Each step checkpointed. Failure resumes from the last checkpoint, not the top.
SchedulingCron, intervals, delayed dispatch, time-zone-aware. Backed by Postgres — no separate queue infra to babysit.
Retries & DLQExponential backoff, jitter, max-attempt policies. Failed jobs land in a dead-letter queue with full context for replay.
ObservabilityPer-job traces, per-step timing, per-attempt logs. Replay from any step. Inspect state at every checkpoint.

TypeScript
agent.ts

import { defineAgent, step } from "@manara/agents";

export const processInvoice = defineAgent({
  name: "process-invoice",
  async run({ invoiceId }) {

    // each step is durable + retryable
    const pdf = await step("fetch-pdf", () =>
      storage.get(`invoices/${invoiceId}.pdf`));

    const fields = await step("extract", () =>
      llm.extract(pdf, { schema: InvoiceSchema }));

    const approval = await step("wait-approval", {
      timeout: "7d",    // suspend, resume later
    });

    await step("book", () => ledger.write(fields));
  },
});

// resumes from any step on failure
        

04 · Data · Postgres + Storage

Serverless Postgres.
S3-compatible storage.

Managed Postgres with branching, point-in-time recovery, read replicas, and scale-to-zero. Object storage with an S3-compatible API and in-region durability. The boring parts — done right, in the same region as your model.

PostgresPostgres 16 with pgvector, pg_search, PostGIS. Scale-to-zero on inactive databases. Branching for previews.
RecoveryPoint-in-time recovery to any second within the retention window. Continuous WAL archival. Snapshot replication available.
ReplicasRead replicas in the same region or across federated regions. Async logical replication for analytics.
Object storageS3-compatible API. Versioning, lifecycle policies, signed URLs, multipart uploads. In-region by default.

SQL + CLI
db.sql

-- Postgres connection — drop-in for any client
DATABASE_URL="postgres://...@db.manara.ai:5432/app"

-- vector + relational in one query
SELECT
  i.id, i.customer_id, i.total,
  (i.notes_emb <=> $1) AS distance
FROM invoices i
WHERE i.region = 'ksa'
ORDER BY distance
LIMIT 10;

# branching for previews
$ manara db branch create --from main --name pr-247
✓ branched in 2.1s
        

TypeScript
storage.ts

// S3-compatible — keep your aws-sdk
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";

const s3 = new S3Client({
  endpoint: "https://s3.khi.manara.ai",
  region:   "khi",
  credentials: { ... },
});

await s3.send(new PutObjectCommand({
  Bucket: "invoices-2026",
  Key:    "INV-2026-0421.pdf",
  Body:   buffer,
}));
        

05 · Control · AI Gateway

One key.
Many models.

Route across local Manara models and frontier providers (OpenAI, Anthropic, Google) from a single endpoint. Per-team metering and spend caps. Semantic caching to cut cost. Fallback chains so a regional outage doesn't bring your app down. Content policy at the gateway, not in your app.

Multi-modelSingle API key, many providers. Route by model, team, latency, or cost. Swap providers without redeploying.
Semantic cacheVector-based response cache with configurable similarity thresholds. Cuts redundant inference cost.
Spend controlsPer-team, per-key, per-model quotas. Hard caps, soft alerts, automatic downgrade ("over 80%? route cheaper").
Fallback chainsPrimary → secondary → tertiary, with per-link timeout policies. Resilient against outages.
Content policyPre-call input filtering, post-call output filtering. Custom policies per team or use-case.

YAML
gateway.yaml

# gateway config-as-code
routes:
  - match: { team: "support" }
    primary: "manara/llama-3.3-70b"
    fallback:
      - "manara/qwen-2.5-72b"
      - "openai/gpt-4o-mini"
    cache: { semantic: 0.92, ttl: "24h" }
    budget:
      monthly: "USD 2000"
      on_exceed: "downgrade"

  - match: { team: "compliance" }
    primary: "manara/jais-30b"     # keep PII in-country
    policy: "no-frontier"           # never route off-region

# one key, many models, predictable spend
        

06 · Trust · Governance

Trace every prompt.
Mask every PII field.
Audit every call.

The governance layer regulated buyers can't ship without — built into the same platform as your inference, vector, and queues. Three capabilities, one boundary: every prompt is traced, every PII field is detected and masked, every call is written to a tamper-evident audit log. None of it leaves the region.

Prompt tracesFull request/response capture with token-level cost, latency, and model attribution. Searchable, filterable, exportable. OpenTelemetry-compatible.
EvaluationsEval harness with built-in metrics (faithfulness, relevance, toxicity) plus custom Python evals. Run on traffic samples, datasets, or A/B variants.
PII maskingNamed entity recognition for 40+ PII types (national IDs, phone numbers, account numbers, health identifiers) in English and Arabic. Mask pre-call and post-retrieval.
Audit logTamper-evident, hash-chained log of every privileged action — key creation, model swap, policy change, data access. Exportable for regulators.
RBAC + SSORole-based access, SCIM provisioning, SAML/OIDC SSO. Per-region scopes. Approval workflows for production changes.

Incoming

"Refund customer 03-12345-X with card 4242-..."

▸ PII mask

"Refund customer [ID] with card [CARD]"
2 entities detected · 0ms overhead

▸ Inference

llama-3.3-70b · Riyadh · 184ms

▸ Trace

trace_id captured · tokens · cost · latency

▸ Audit log

hash-chained entry · tamper-evident

Response

✓ never left region

OpenTelemetry OIDC / SAML SCIM SOC2-ready

Six primitives. One platform.

OpenAI-compatible API.
Open-source models.

RAG without round trips.

Durable jobs.
Real agents.

Serverless Postgres.
S3-compatible storage.

One key.
Many models.

Trace every prompt.
Mask every PII field.
Audit every call.

Per token. Per second. Per region.

Start with the $50 credit. Ship before the meeting.

OpenAI-compatible API.Open-source models.

RAG without round trips.

Durable jobs.Real agents.

Serverless Postgres.S3-compatible storage.

One key.Many models.

Trace every prompt.Mask every PII field.Audit every call.

Per token. Per second. Per region.

Start with the $50 credit. Ship before the meeting.

OpenAI-compatible API.
Open-source models.

Durable jobs.
Real agents.

Serverless Postgres.
S3-compatible storage.

One key.
Many models.

Trace every prompt.
Mask every PII field.
Audit every call.