AI Field Day 5 - Enfabrica

For a complete list of AI Field Day 5 delegate coverage, please visit AI Field Day 5 page hosted by Tech Field Day.

Notes and Takeaways

The founders of Enfabrica saw that the nature of computation was changing when GPUs began to appear in hyperscale data centers with the networks of that day. 3 years later…

FLOPS demands are outpacing DRAM and Interconnect technologies. MFU/HFU indicate that networking performance will still be the arbiter of measure of usefulness in computation at scale.

This demand requires a high performance distributed server networking solution.

You’re probably wondering how we got here.

  • In the beginning, there was ccNUMA (and mainframe) and tightly coupled system design that scales upwards.
  • Later, an adjacent approach that scales outward, eventually became what we now call hyperscale cloud service providers.

Essentially, these two worlds are colliding with AI workload demands. A convergence of the fully distributed scale out and the centralized scale up is happening — kernels meeting CUDA — IPC domains meeting RPC domains.

Along the journey, new approaches involved fabrics that are independent and positional to the levels. Primitives that are ultimately part of fabrics which scale only serve to reduce MFU.

And now… Enfabrica has entered the market with a solution. The Enfabrica answer to these colliding worlds is to offer a superNIC that speaks memory, speaks packets, converts, and elastically binds them.

The acronym or backcronym is Accelerated Compute Fabric (ACFS) involves a SuperNIC.

  • Scaling AI workloads is hard with prior technologies
    • While compute performance has increased dramatically, data movement capabilities (IO and memory bandwidth) haven’t kept pace.
    • This disparity creates a bottleneck, limiting the efficiency of large-scale AI systems.
    • Existing networking architectures, designed for traditional computing, struggle to handle the unique demands of AI workloads.
    • Scaling up using tightly coupled interconnects (like NVLink) has limitations
    • Scaling out using distributed systems introduces latency and communication overhead.
  • ACF is designed to address these challenges.
    • ACF combines the advantages of both approaches, offering high bandwidth and low latency for both local (IPC) and distributed (RPC) communication.
    • Faster data movement between GPUs and across the network.
    • By integrating multiple networking functionalities, including PCI switching, RDMA, and high-speed Ethernet, there is potential for reducing system complexity and associated cost.
    • By optimizing data transfer, minimizes memory copies, and enhancing network resilience, this potentially contributes to better compute utilization and reduced job failures.
  • Key claims
    • Compatible with existing APIs and libraries minimizing disruption for developers.
    • Utilizes established protocols like Ethernet, IP, and RoCE, ensuring interoperability with current infrastructure.
    • Can be deployed in existing server chassis or as part of a disaggregated (composable) system, providing flexibility for different use cases.
    • Enables flexible (composable) system configurations, allowing for different types of processing units and storage to be interconnected for optimized performance.
    • Designed to support future standards like CXL 2.0 and Ultra Ethernet, ensuring adaptability and longevity .

Presenter Videos

To see a playlist of presenter videos, visit Tech Field Day and hit subscribe.

More presenters from AIFD5

For a complete list of AI Field Day 5 delegate coverage, please visit AI Field Day 5 page hosted by Tech Field Day.