Networking-and-Traffic
12 postsAt Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP. We look forward to continued collaboration with OCP to open designs for racks, servers, storage [...] Read More... The post OCP Summit 2024: The open future of networking hardware for AI appeared first on Engineering at Meta.
At the Open Compute Project (OCP) Global Summit 2024, we’re showcasing our latest open AI hardware designs with the OCP community. These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. By sharing our designs, we hope to inspire collaboration and foster innovation. If you’re passionate about building [...] Read More... The post Meta’s open AI hardware vision appeared first on Engineering at Meta.
AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B. This week at ACM SIGCOMM 2024 in Sydney, Australia, we are sharing details on the network we have built at Meta [...] Read More... The post RoCE networks for distributed AI training at scale appeared first on Engineering at Meta.
Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization. The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability. Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint delivered 35% more work for [...] Read More... The post Taming the tail utilization of ads inference at Meta scale appeared first on Engineering at Meta.
Threads has entered the fediverse! As part of our beta experience, now available in a few countries, Threads users aged 18+ with public profiles can now choose to share their Threads posts to other ActivityPub-compliant servers. People on those servers can now follow federated Threads profiles and see, like, reply to, and repost posts from [...] Read More... The post Threads has entered the fediverse appeared first on Engineering at Meta.
Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta’s family of apps. We’ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport. We’re sharing our experiment results from this approach, some of [...] Read More... The post Optimizing RTC bandwidth estimation with machine learning appeared first on Engineering at Meta.
While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources. In our own tests, SPTP boasts comparable performance to PTP, but with significant improvements in [...] Read More... The post Simple Precision Time Protocol at Meta appeared first on Engineering at Meta.
Meta is building for the future of AI at every level – from hardware like MTIA v1, Meta’s first-generation AI inference accelerator to publicly released models like Llama 2, Meta’s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and services at Meta’s scale also [...] Read More... The post Watch: Meta’s engineers on building network infrastructure for AI appeared first on Engineering at Meta.
Meta presents Chakra execution traces, an open graph-based representation of AI/ML workload execution, laying the foundation for benchmarking and network performance optimization. Chakra execution traces represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. In collaboration with MLCommons, we are seeking industry-wide adoption for benchmarking. Meta open [...] Read More... The post Using Chakra execution traces for benchmarking and network performance optimization appeared first on Engineering at Meta.
We’re introducing Arcadia, Meta’s unified system that simulates the compute, memory, and network performance of AI training clusters. Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively. Arcadia gives Meta’s researchers and engineers valuable insights [...] Read More... The post Arcadia: An end-to-end AI system performance simulator appeared first on Engineering at Meta.
Meta is transferring its IP for Evenstar, a program to accelerate the adoption of open RAN technologies, to the Open Compute Project (OCP). Meta will contribute Evenstar’s radio unit design to OCP, giving the telecom industry its first open, white box radio unit solution. The TIP Open RAN community will leverage the Evenstar radio unit [...] Read More... The post Meta’s Evenstar is transitioning to OCP to accelerate open RAN adoption appeared first on Engineering at Meta.
What the research is: Millisampler is one of Meta’s latest characterization tools and allows us to observe, characterize, and debug network performance at high-granularity timescales efficiently. This lightweight network traffic characterization tool for continual monitoring operates at fine, configurable timescales. It collects time series of ingress and egress traffic volumes, number of active flows, incoming [...] Read More... The post A fine-grained network traffic analysis with Millisampler appeared first on Engineering at Meta.