Overview
Moreh’s mission is to provide alternative options to NVIDIA GPUs for AI data centers through advanced software technologies. As part of this effort, we have been working closely with Tenstorrent and will be launching a data center solution in Q4 2025. Tenstorrent, led by legendary semiconductor architect Jim Keller, delivers scalable hardware through network-integrated AI chips. On top of that, Moreh adds its unique cluster architecture and software for efficiently utilizing many chips, completing a full-stack solution. We are confident that this is the best option to minimize the total cost of ownership (TCO) of AI data centers.
This article describes the architecture of the Tenstorrent solution we provide. Our approach, chip architecture, cluster architecture, and software architecture are fundamentally differentiated from conventional NVIDIA GPUs and DGX systems. We explain how this enables us to optimize large-scale AI infrastructure. Below is a summary of our differentiators:
- Approach
- We employ a larger number of lighter chips compared to GPUs, achieving high performance and efficiency at the cluster level rather than at the individual chip level.
- To realize this, scalable network architecture and software capable of efficiently leveraging such many chips are essential.
- Since individual chips do not require extremely high performance, they can be built on older process nodes (e.g., 6 nm or 12 nm) and use GDDR memory instead of HBM, thereby maximizing overall cost efficiency.
- The chips are not limited to inference but can be used for both training and inference. This is a crucial factor for large-scale AI data centers when adopting a new type of processor.
- Chip architecture
- Large software-managed SRAMs (approximately 1.5 MB per core) are adopted instead of a complex hardware-managed memory hierarchy such as coherent shared caches. With proper software support, this can minimize off-chip memory bandwidth requirements.
- Intra-chip inter-core communication is performed explicitly through a 2D torus Network-on-Chip (NoC), rather than indirectly via shared memory or caches. This allows direct data exchange between cores without consuming bandwidth from off-chip memory or shared caches, while giving the software more room to optimize data movement.
- A block floating-point format is supported, where 16 adjacent elements share a common exponent. This reduces memory footprint and bandwidth requirements by approximately half, without causing significant impact on accuracy.
- Cluster architecture
- Each chip is equipped with built-in Ethernet interfaces, enabling direct data transfer between two linked chips with low latency and without CPU intervention.
- Multiple chips are interconnected through a torus network, without requiring a complex switch network (similar to Google’s TPU clustering approach). A torus network is beneficial for communication patterns of typical AI workloads.
- Software architecture
- We provide an inference framework that performs distributed inference across multiple nodes and chips, presenting them as a single unified endpoint, and a training framework that allows multiple nodes and chips to operate as a single PyTorch device.
- Data distribution, task allocation, and inter-chip communication are automated by software. Consequently, although the number of chips increases compared to a GPU cluster, the overall infrastructure becomes easier to utilize, with workloads distributed to enable efficient communication over the torus network.
Please read further in the PDF file.