Keep Your AI Workloads Running Through GPU Failures
Fault Detection
GPU failures sometimes occur silently, making them difficult to detect. With experience operating thousands of AMD GPUs, Moreh offers a fault-tolerance platform that detects faulty GPUs, sends alerts, and removes them from scheduling.
Failover for Training
With MoAI Training Framework’s built-in checkpoint-restart, training jobs remain unaffected by GPU failures. Execution resumes automatically using standby nodes without user intervention.
Failover for Inference
MoAI Inference Framework keeps an inference endpoint alive by routing requests to standby vLLM instances when a GPU failure occurs.