Owned inference infrastructure

CueCode private inference cloud

Frontier-class open coding models on memory-dense Apple Silicon pods, sold per engineer seat behind an OpenAI-compatible endpoint.

Explore the unit model See the data plane

agent stream · 60 tok/s

// CueCode keeps the edit loop moving
async function repairFailingTest() {
  const trace = await read("pytest.log");
  const patch = await model.stream({
    repo: "checkout-service",
    floor: 12,
    endpoint: "private"
  });

  return applyPatch(patch);
}

~75%

lower seat cost than frontier power-user API spend

floor-bounded decode streams per four-node pod

$110K

pilot pod to validate the load-bearing throughput number

The product

Predictable private inference for teams that live inside agents.

Engineering teams are already mixing Cursor, Copilot, Claude, GPT, Gemini, and internal agents. The result is expensive, hard to govern, and increasingly tied to metered usage. CueCode replaces that chaos with owned capacity: a per-seat line item, auditable usage, and private model serving on hardware with known provenance.

The service runs open coding models on four-node Apple Silicon pods. Customers keep the familiar API surface while CueCode owns the serving layer, scheduler, cache placement, metering, and isolation.

Trusted access

Open weights on owned infrastructure, with tenant boundaries wired into the cache key.

Predictable spend

Seats price concurrency instead of surprise token bills from heavy agentic work.

Control plane

SSO, budgets, audit logs, per-seat metering, and endpoint compatibility.

Dedicated paths

Pooled, reserved, and dedicated tiers map isolation and concurrency to price.

Why now

Open coding models are good enough. The access layer is not.

AI coding spend is fragmenting

Teams are buying multiple copilots, chat tools, and agents at once. Heavy users create the most value and the least predictable bills.

Native open APIs are not enough

The best open coding models are attractive, but enterprise buyers need provenance, governance, stable capacity, and clear jurisdiction.

Owned capacity changes the unit

A private pod turns metered token surprise into a seat contract: concurrency, isolation, auditability, and a fixed line item.

Why it works

Memory bandwidth, idle gaps, and a batcher built for coding agents.

MoE inference is memory-bound

The active weights are read every token. Apple Silicon unified memory gives the pod dense, low-power bandwidth where the workload needs it.

Agents are bursty

Coding agents decode, call tools, wait, and resume. CueCode backfills the idle gaps so one pod can serve many engineers without wasting cycles.

Aggregate tokens are the unit

One forward pass advances every admitted stream by one token. The business sells the aggregate capacity above the 12 tok/s floor.

Flow test

Good tok/s is the difference between waiting and staying in the problem.

Agentic coding is not a paragraph in chat. It is code, diffs, logs, traces, and tool output arriving while the engineer is still reasoning. Once the stream falls below the floor, attention breaks.

Retention floor: 12 tok/s
CueCode target: 60 tok/s
Sample completion: 4.6s

Unit model

Move the assumptions. Watch the pod economics move with them.

5.1

engineers per pod at this workload

Single-stream speed: 15.7 tok/s
Aggregate output: 55 tok/s
Decode capacity: 4.6 streams
Capex payback: 1.2 years

Model formula: single = 28 / (1 + context / 64K); aggregate = single x batch; capacity = aggregate / 12.

Seat shape

One pod model, three ways to buy the concurrency it creates.

Pooled

$1,500/seat-mo

Shared capacity for light and mid-market teams that want predictable spend without dedicated hardware.

Reserved

$2,500+/seat-mo

Guaranteed stream headroom for engineers running multiple agents, branch explorations, and long test-fix loops.

Dedicated

Custom/pod

Hardware-isolated pods for sensitive codebases, regulated teams, and the heaviest orchestration workloads.

The moat

The scheduler decides which streams occupy the scarce decode slots.

S1 S2 S3 S4 S5

Mac 01
L 1-20

Mac 02
L 21-40

Mac 03
L 41-60

Mac 04
L 61-80

waiting streams

Q1 Q2 Q3

Admission controller fairness, floor, seat headroom

floor residency entitlement independence

KV placement resident, prefetch, tier to SSD

admitted batch

S1 S2 S3 S4 S5

Step 1

Retire finished streams

Finished, stopped, or cancelled generations leave immediately so a waiting stream can take the slot on the next token boundary.

floor residency entitlement independence

Pilot gate

One four-node pod validates or kills the thesis before scale capital.

$110K experiment

Four 512 GB nodes, fabric, rack, instrumentation, and the real serving path.

Throughput pass bar

Hold at least two reserved or five pooled streams above the 12 tok/s floor.

Commercial proof

Three to five design partners willing to pay per seat after the pilot.

Design partners

Bring CueCode to teams with real agentic coding load.

The first deployment should measure batching efficiency, duty cycle, time-to-first-token, tiering latency, and the buyer appetite for private per-seat inference.

Start a pilot conversation