Building Block

MoAI Fabric

Software-defined fabric that moves KV cache directly between heterogeneous chips and software — making prefill-decode disaggregation work in production across vendor, generation, and parallelism boundaries.

The Problem

KV Cache Is Where Heterogeneity Fails

Current inference software stacks assume KV cache producers and consumers are identical. When they aren't, KV cache transfer becomes the blocker against efficient use of heterogeneous chips.

Cross-Vendor Transport

Direct GPU-to-GPU RDMA is vendor-locked. Moving KV cache bytes between chips from different vendors has no native path — only a prohibitively slow detour through CPU memory.

Memory Layout

Different attention implementations arrange KV tensors differently in GPU memory. A producer's bytes can't be read by a consumer that expects a different layout.

Data Type & Quantization

Different precisions and quantization schemes encode the same value into different bit patterns. Bytes moved across them become unrelated numbers without explicit translation.

Parallel Partitioning

Different parallelism strategies split KV cache across multiple GPUs in different ways. A naive 1:1 GPU-to-GPU transfer can't reconstruct the right data.

Solution

Direct, Compatible KV Cache Transfer Across Vendors

MoAI Fabric moves KV cache directly between GPUs of any vendor, translating between memory layouts, dtypes, quantization schemes, and parallel partitioning along the way.

GPU

Vendor A

GPU

Vendor B

KV Cache Compatibility

Cross-Vendor Direct RDMA

What It Enables

Decouple Prefill and Decode

Once KV cache movement is no longer locked to identical hardware and software, prefill and decode can be deployed independently — each on the right vendor, generation, and parallelism for the job.

Across Vendors

Run prefill on NVIDIA GPUs and decode on AMD GPUs — or the reverse. Fabric translates the KV cache between vendor-specific formats and moves it directly across the network, with no slow CPU detour or vendor lock for either phase.

Across Generations

Mix GPU generations across phases — for example, B300 for prefill and H200 for decode. Different generations often use different KV cache formats; Fabric reconciles them transparently, so older inventory keeps earning its place alongside the newest chips.

Independent Sizing and Parallelism

Choose the GPU count and parallelism strategy for prefill and decode independently, driven by your latency and throughput SLOs. Fabric handles the KV cache partitioning mismatch when the two phases run at different scales.