Six primitives. One platform.

One backend with six things AI applications actually need — inference, vector, queues, state, gateway, and governance. OpenAI-compatible where it can be, standards-based where it can't. One auth boundary, one invoice, one region. Below: what's in each, what the code looks like, what's live.

01 · Core · Inference

OpenAI-compatible API.
Open-source models.

Managed vLLM endpoints for the leading open-source models. Drop-in replacement for the OpenAI SDK — chat, embeddings, function calling, streaming, batch, structured outputs. Pay per token, billed in local currency. No GPU management, no quotas, no waitlists.

  • API surface/v1/chat/completions, /v1/embeddings, /v1/completions. Function calling, JSON mode, streaming, batch endpoints.
  • Fine-tuningLoRA and QLoRA on your data. Adapters served from the same endpoint, billed at base-model rates.
  • SDKsopenai-python, openai-node, openai-go, Vercel AI SDK, LangChain, LlamaIndex — anything that speaks the OpenAI protocol.
  • Latency~30ms p50 in-region. Streaming TTFT typically under 200ms for 70B-class models at 8k context.
  • SLA99.95% on production-tier models. Dedicated capacity available for enterprise workloads.
Model
Context
Price / 1M tokens
Status
Llama 3.3 70B
128k
in $0.60 / out $0.90
Live
Qwen 2.5 72B
128k
in $0.55 / out $0.85
Live
DeepSeek V3
64k
in $0.30 / out $0.90
Live
Jais 30B
8k
in $0.40 / out $0.60
Live
Mistral Large
128k
in $2.00 / out $6.00
Live
Phi-4
16k
in $0.10 / out $0.20
Live
Your fine-tune
base
base rates
LoRA
Python
TypeScript
cURL
inference.py
from openai import OpenAI

client = OpenAI(
    api_key="mnr_...",
    base_url="https://infer.manara.ai/v1",
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "Reply in Arabic."},
        {"role": "user",   "content": "What is RAG?"},
    ],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

# ~180ms TTFT · ~30ms p50 · billed in local currency
Function calling
tools.py
# JSON mode + tool use — same SDK
response = client.chat.completions.create(
    model="qwen-2.5-72b",
    messages=[...],
    tools=[{
        "type": "function",
        "function": {
            "name": "search_invoices",
            "parameters": {...},
        },
    }],
    response_format={"type": "json_object"},
)
02 · Memory · Vector + Embeddings

RAG without round trips.

Managed pgvector with hybrid search. Embeddings co-located with inference — your retrieval, rerank, and model call hit the same region with single-digit-millisecond internal latency. Built on Postgres, so you query vectors and structured data in one SQL statement.

  • EngineManaged pgvector on Postgres 16. HNSW and IVFFlat indexes. Full SQL surface — JOIN embeddings with your business tables.
  • EmbeddingsBGE-large, E5-mistral, Cohere-compatible endpoints, plus OpenAI-compatible /v1/embeddings. Per-token billing, same invoice.
  • Hybrid searchBM25 lexical + dense vector + cross-encoder reranking, all server-side, all in-region.
  • ScaleTested to 100M vectors per index. Read replicas for retrieval-heavy workloads. Async write paths for ingestion bursts.
Python
rag.py
# 1. embed
emb = client.embeddings.create(
    model="bge-large",
    input="PDPL data residency rules",
).data[0].embedding

# 2. retrieve — same region, single-digit ms
docs = manara.vector.search(
    index="compliance-kb",
    query_vector=emb,
    hybrid=True,
    rerank="bge-reranker-v2",
    top_k=8,
)

# 3. generate — co-located inference
answer = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": f"Context: {docs}"},
        {"role": "user",   "content": q},
    ],
)

# End-to-end RAG. Never leaves the region.
03 · Work · Queues & Workers

Durable jobs.
Real agents.

Long-running tasks, scheduled runs, retries, fan-out, dead-letter queues. The plumbing every AI agent eventually needs — typed step functions that resume after failure and don't lose state when the model times out.

  • Step functionsTyped, durable steps. Each step checkpointed. Failure resumes from the last checkpoint, not the top.
  • SchedulingCron, intervals, delayed dispatch, time-zone-aware. Backed by Postgres — no separate queue infra to babysit.
  • Retries & DLQExponential backoff, jitter, max-attempt policies. Failed jobs land in a dead-letter queue with full context for replay.
  • ObservabilityPer-job traces, per-step timing, per-attempt logs. Replay from any step. Inspect state at every checkpoint.
TypeScript
agent.ts
import { defineAgent, step } from "@manara/agents";

export const processInvoice = defineAgent({
  name: "process-invoice",
  async run({ invoiceId }) {

    // each step is durable + retryable
    const pdf = await step("fetch-pdf", () =>
      storage.get(`invoices/${invoiceId}.pdf`));

    const fields = await step("extract", () =>
      llm.extract(pdf, { schema: InvoiceSchema }));

    const approval = await step("wait-approval", {
      timeout: "7d",    // suspend, resume later
    });

    await step("book", () => ledger.write(fields));
  },
});

// resumes from any step on failure
04 · Data · Postgres + Storage

Serverless Postgres.
S3-compatible storage.

Managed Postgres with branching, point-in-time recovery, read replicas, and scale-to-zero. Object storage with an S3-compatible API and in-region durability. The boring parts — done right, in the same region as your model.

  • PostgresPostgres 16 with pgvector, pg_search, PostGIS. Scale-to-zero on inactive databases. Branching for previews.
  • RecoveryPoint-in-time recovery to any second within the retention window. Continuous WAL archival. Snapshot replication available.
  • ReplicasRead replicas in the same region or across federated regions. Async logical replication for analytics.
  • Object storageS3-compatible API. Versioning, lifecycle policies, signed URLs, multipart uploads. In-region by default.
SQL + CLI
db.sql
-- Postgres connection — drop-in for any client
DATABASE_URL="postgres://[email protected]:5432/app"

-- vector + relational in one query
SELECT
  i.id, i.customer_id, i.total,
  (i.notes_emb <=> $1) AS distance
FROM invoices i
WHERE i.region = 'ksa'
ORDER BY distance
LIMIT 10;

# branching for previews
$ manara db branch create --from main --name pr-247
✓ branched in 2.1s
TypeScript
storage.ts
// S3-compatible — keep your aws-sdk
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";

const s3 = new S3Client({
  endpoint: "https://s3.khi.manara.ai",
  region:   "khi",
  credentials: { ... },
});

await s3.send(new PutObjectCommand({
  Bucket: "invoices-2026",
  Key:    "INV-2026-0421.pdf",
  Body:   buffer,
}));
05 · Control · AI Gateway

One key.
Many models.

Route across local Manara models and frontier providers (OpenAI, Anthropic, Google) from a single endpoint. Per-team metering and spend caps. Semantic caching to cut cost. Fallback chains so a regional outage doesn't bring your app down. Content policy at the gateway, not in your app.

  • Multi-modelSingle API key, many providers. Route by model, team, latency, or cost. Swap providers without redeploying.
  • Semantic cacheVector-based response cache with configurable similarity thresholds. Cuts redundant inference cost.
  • Spend controlsPer-team, per-key, per-model quotas. Hard caps, soft alerts, automatic downgrade ("over 80%? route cheaper").
  • Fallback chainsPrimary → secondary → tertiary, with per-link timeout policies. Resilient against outages.
  • Content policyPre-call input filtering, post-call output filtering. Custom policies per team or use-case.
YAML
gateway.yaml
# gateway config-as-code
routes:
  - match: { team: "support" }
    primary: "manara/llama-3.3-70b"
    fallback:
      - "manara/qwen-2.5-72b"
      - "openai/gpt-4o-mini"
    cache: { semantic: 0.92, ttl: "24h" }
    budget:
      monthly: "USD 2000"
      on_exceed: "downgrade"

  - match: { team: "compliance" }
    primary: "manara/jais-30b"     # keep PII in-country
    policy: "no-frontier"           # never route off-region

# one key, many models, predictable spend
06 · Trust · Governance

Trace every prompt.
Mask every PII field.
Audit every call.

The governance layer regulated buyers can't ship without — built into the same platform as your inference, vector, and queues. Three capabilities, one boundary: every prompt is traced, every PII field is detected and masked, every call is written to a tamper-evident audit log. None of it leaves the region.

  • Prompt tracesFull request/response capture with token-level cost, latency, and model attribution. Searchable, filterable, exportable. OpenTelemetry-compatible.
  • EvaluationsEval harness with built-in metrics (faithfulness, relevance, toxicity) plus custom Python evals. Run on traffic samples, datasets, or A/B variants.
  • PII maskingNamed entity recognition for 40+ PII types (national IDs, phone numbers, account numbers, health identifiers) in English and Arabic. Mask pre-call and post-retrieval.
  • Audit logTamper-evident, hash-chained log of every privileged action — key creation, model swap, policy change, data access. Exportable for regulators.
  • RBAC + SSORole-based access, SCIM provisioning, SAML/OIDC SSO. Per-region scopes. Approval workflows for production changes.
Incoming
"Refund customer 03-12345-X with card 4242-..."
▸ PII mask
"Refund customer [ID] with card [CARD]"
2 entities detected · 0ms overhead
▸ Inference
llama-3.3-70b · Riyadh · 184ms
▸ Trace
trace_id captured · tokens · cost · latency
▸ Audit log
hash-chained entry · tamper-evident
Response
✓ never left region
OpenTelemetry OIDC / SAML SCIM SOC2-ready
Pricing

Per token. Per second. Per region.

Pay for what you use. Inference per token. Compute, vector, and storage per second, per GB, per region. All invoiced in local currency, no minimum commitments, no extraterritorial metering.

See full pricing →
Inference · 70B
$0.60
per 1M input tokens
Vector storage
$0.30
per GB / month
Postgres
$0.04
per compute-hour

Start with the $50 credit. Ship before the meeting.

No card required. One repo, one base_url change, six primitives ready to call.