Platform — VertexStudio

The runtime

One operating layer, five jobs done right

Most teams stitch a model gateway, an agent framework, an eval harness, and a metrics stack into something brittle. The VertexStudio Runtime collapses those into a single control plane: route every request to the right model, run agents that actually finish, gate every deploy on evals, and watch token-level cost and latency in real time — behind one API.

router_config.yaml

# VertexStudio Unified Router v3
router:
  strategy: adaptive_latency
  nodes:
    - id: edge-npu-cluster
      type: edge
      latency_p99: 6ms
      cost_per_token: 0.00001
    - id: on-prem-h100
      type: gpu_cluster
      latency_p99: 42ms
      cost_per_token: 0.00008
    - id: cloud-burst
      type: cloud
      latency_p99: 180ms
      cost_per_token: 0.00020

  rules:
    - if: latency_sla < 10ms
      route_to: edge-npu-cluster
    - if: batch_size > 32
      route_to: on-prem-h100
    - default: cloud-burst

# ✓ Routing 2.4M req/day | Avg cost: $0.00003

agent_workflow.py

# VertexStudio Agent Orchestrator
from vertexstudio import AgentGraph, Memory

graph = AgentGraph("research_agent")

@graph.node
async def planner(state):
    plan = await llm.plan(state.task)
    return {"steps": plan.steps}

@graph.node
async def executor(state):
    results = []
    for step in state.steps:
        result = await tools[step.tool](
            step.args, memory=Memory.get()
        )
        results.append(result)
    return {"results": results}

@graph.node
async def synthesizer(state):
    return await llm.synthesize(state.results)

graph.edge("planner" → "executor" → "synthesizer")
# ✓ 25 max steps | Persistent memory | Auto-retry

observability.yaml

# VertexStudio Observability Stack
telemetry:
  traces:
    backend: opentelemetry
    sampling_rate: 1.0
    token_level: true
    agent_step_tracing: true

  metrics:
    backend: prometheus
    dashboards: grafana
    alerts:
      - name: latency_spike
        threshold: p99 > 50ms
        action: page_on_call
      - name: cost_overrun
        threshold: hourly_tokens > 10M
        action: auto_throttle

  cost_tracking:
    per_team: true
    per_model: true
    anomaly_detection: ml_based

# ✓ 99.97% uptime | <5min MTTR

.vertexstudio-ci.yaml

# VertexStudio ML CI/CD Pipeline
pipeline:
  trigger: push
  stages:

    - name: train
      runner: h100-cluster
      script: python train.py
      artifacts: model_checkpoint

    - name: evaluate
      gates:
        - metric: accuracy > 0.94
        - metric: latency_p99 < 10ms
        - metric: regression_delta < 1%

    - name: canary_deploy
      traffic_split: 5%
      duration: 30min
      auto_promote: on_success

    - name: production
      strategy: rolling
      zero_downtime: true
# ✓ Avg deploy time: 8min | 0 regressions

guardrails.py

# VertexStudio Guardrails Engine
from vertexstudio.guardrails import Pipeline

guards = Pipeline([
    PII_Detector(
        entities=["NAME","EMAIL","SSN","PHI"],
        action="redact",
        confidence=0.92
    ),
    ContentFilter(
        categories=["harmful","illegal","bias"],
        model="vertex-guard-v2"
    ),
    PromptInjection(
        scan="jailbreak|override|ignore",
        action="block_and_alert"
    ),
    AuditLogger(
        immutable=True,
        compliance=["SOC2","HIPAA","FedRAMP"]
    )
])

# ✓ <0.5ms overhead | 99.3% precision

Now in private beta

Two products, one runtime story

The runtime above isn't a roadmap slide — it's shipping as two products in private beta. Request early access to run them against your own models and traffic.

Beta · VertexStudio Runtime

Ship production AI behind one API

VertexStudio Runtime is the Cloudflare-native control plane that turns the five jobs above into one endpoint. Point your app at it once; it routes each request to the right model, runs your agents, enforces guardrails, and meters every token — at the edge, close to your users.

Quality-aware routing across Workers AI, OpenAI, Anthropic, and your own self-hosted backends, with edge semantic caching and Llama Guard guardrails.
Stateful agents — a planner → executor → synthesizer loop with real tool calls, Vectorize-backed memory, and SSE streaming.
Observability built in — per-project request, latency, token, cost, and cache-savings metrics, backed by D1.
Multi-tenant by default — orgs and projects, scoped hashed API keys, and an append-only audit log.

Cloudflare WorkersD1VectorizeDurable ObjectsWorkers AI

Request early access

VertexStudio Runtime control plane showing request, token, cost and cache metrics, a latency chart, and the model router routing across edge and cloud providers. — Runtime control plane — live routing, cost, and latency across every provider.

Beta · QuantizedOps

Make any model smaller, faster, and trusted

QuantizedOps takes a model from import to a validated, optimized artifact you can actually ship. Pick a target, and it converts, evaluates, benchmarks, and gates the result — so you choose a quantization with eyes open on size, latency, and accuracy, not on faith.

Import anything — Hugging Face, or upload .gguf/.onnx; convert to llama.cpp / Ollama GGUF (Q4_K_M, Q5_K_M, Q8_0, F16) or ONNX Runtime INT8.
Validated, not guessed — every job runs baseline-vs-converted evaluation, runtime validation, and benchmarking, with a shareable report.
Compare and serve — rank variants on size, latency, throughput, and accuracy delta, then serve a finished artifact behind an OpenAI-compatible endpoint.
Built to run — multi-tenant RBAC, API keys, and audit; Kubernetes-native with scale-to-zero GPU workers and a conversion cache.

llama.cppONNX RuntimeRayKubernetesvLLM · roadmap

Request early access

QuantizedOps model optimization console showing an import-to-serve pipeline, a conversion job, and a comparison of quantization profiles by size, latency and accuracy. — Optimization console — convert, validate, and compare quantizations before you ship.

One Unified Runtime.
Zero Compromises.

One operating layer, five jobs done right

Two products, one runtime story

Ship production AI behind one API

Make any model smaller, faster, and trusted

Inference to Action
in Milliseconds

Run the Runtime
on Your Workload

One operating layer, five jobs done right

Two products, one runtime story

Ship production AI behind one API

Make any model smaller, faster, and trusted

Inference to Actionin Milliseconds

Run the Runtimeon Your Workload

Inference to Action
in Milliseconds

Run the Runtime
on Your Workload