MoAI Fabric
Software-defined fabric that moves KV cache directly between heterogeneous chips and software — making prefill-decode disaggregation work in production across vendor, generation, and parallelism boundaries.
KV Cache Is Where Heterogeneity Fails
Current inference software stacks assume KV cache producers and consumers are identical. When they aren't, KV cache transfer becomes the blocker against efficient use of heterogeneous chips.
Cross-Vendor Transport
Direct GPU-to-GPU RDMA is vendor-locked. Moving KV cache bytes between chips from different vendors has no native path — only a prohibitively slow detour through CPU memory.
Memory Layout
Different attention implementations arrange KV tensors differently in GPU memory. A producer's bytes can't be read by a consumer that expects a different layout.
Data Type & Quantization
Different precisions and quantization schemes encode the same value into different bit patterns. Bytes moved across them become unrelated numbers without explicit translation.
Parallel Partitioning
Different parallelism strategies split KV cache across multiple GPUs in different ways. A naive 1:1 GPU-to-GPU transfer can't reconstruct the right data.
Direct, Compatible KV Cache Transfer Across Vendors
MoAI Fabric moves KV cache directly between GPUs of any vendor, translating between memory layouts, dtypes, quantization schemes, and parallel partitioning along the way.
GPU
Vendor A
GPU
Vendor B
KV Cache Compatibility
Cross-Vendor Direct RDMA
Decouple Prefill and Decode
Once KV cache movement is no longer locked to identical hardware and software, prefill and decode can be deployed independently — each on the right vendor, generation, and parallelism for the job.
Across Vendors
Run prefill on NVIDIA GPUs and decode on AMD GPUs — or the reverse. Fabric translates the KV cache between vendor-specific formats and moves it directly across the network, with no slow CPU detour or vendor lock for either phase.
Across Generations
Mix GPU generations across phases — for example, B300 for prefill and H200 for decode. Different generations often use different KV cache formats; Fabric reconciles them transparently, so older inventory keeps earning its place alongside the newest chips.
Independent Sizing and Parallelism
Choose the GPU count and parallelism strategy for prefill and decode independently, driven by your latency and throughput SLOs. Fabric handles the KV cache partitioning mismatch when the two phases run at different scales.