Data-Center-Engineering
13 posts[...] Read More... The post Powering AI innovation by acccelerating the next wave of nuclear appeared first on Engineering at Meta.
At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP. We look forward to continued collaboration with OCP to open designs for racks, servers, storage [...] Read More... The post OCP Summit 2024: The open future of networking hardware for AI appeared first on Engineering at Meta.
At the Open Compute Project (OCP) Global Summit 2024, we’re showcasing our latest open AI hardware designs with the OCP community. These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. By sharing our designs, we hope to inspire collaboration and foster innovation. If you’re passionate about building [...] Read More... The post Meta’s open AI hardware vision appeared first on Engineering at Meta.
We’re sharing more about the role that reinforcement learning plays in helping us optimize our data centers’ environmental controls. Our reinforcement learning-based approach has helped us reduce energy consumption and water usage across various weather conditions in our data centers. Meta is revamping its new data center design to optimize for artificial intelligence and the [...] Read More... The post Simulator-based reinforcement learning for data center cooling optimization appeared first on Engineering at Meta.
[...] Read More... The post Read Meta’s 2024 Sustainability Report appeared first on Engineering at Meta.
We are introducing a new metric— real-time server fleet utilization effectiveness —as part of the RETINAS initiative to help reduce emissions and achieve net zero emissions across our value chain in 2030. This new metric allows us to measure server resource usage (e.g., compute, storage) and efficiency in our large-scale data center server fleet in [...] Read More... The post RETINAS: Real-Time Infrastructure Accounting for Sustainability appeared first on Engineering at Meta.
Delivering new AI technologies at scale also means rethinking every layer of our infrastructure – from silicon and software systems and even our data center designs. For the second year in a row, Meta’s engineering and infrastructure teams returned for the AI Infra @ Scale conference, where they discussed the challenges of scaling up an [...] Read More... The post Aparna Ramani discusses the future of AI infrastructure appeared first on Engineering at Meta.
AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B. This week at ACM SIGCOMM 2024 in Sydney, Australia, we are sharing details on the network we have built at Meta [...] Read More... The post RoCE networks for distributed AI training at scale appeared first on Engineering at Meta.
We are open-sourcing DCPerf, a collection of benchmarks that represents the diverse categories of workloads that run in data center cloud deployments. We hope that DCperf can be used more broadly by academia, the hardware industry, and internet companies to design and evaluate future products. DCPerf is available now on GitHub. Hyperscale and cloud datacenter [...] Read More... The post DCPerf: An open source benchmark suite for hyperscale compute applications appeared first on Engineering at Meta.
Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inflection point due to the rise of artificial intelligence (AI), we [...] Read More... The post Maintaining large-scale AI capacity at Meta appeared first on Engineering at Meta.
Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We are strongly committed to open [...] Read More... The post Building Meta’s GenAI Infrastructure appeared first on Engineering at Meta.
While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources. In our own tests, SPTP boasts comparable performance to PTP, but with significant improvements in [...] Read More... The post Simple Precision Time Protocol at Meta appeared first on Engineering at Meta.
What the research is: Millisampler is one of Meta’s latest characterization tools and allows us to observe, characterize, and debug network performance at high-granularity timescales efficiently. This lightweight network traffic characterization tool for continual monitoring operates at fine, configurable timescales. It collects time series of ingress and egress traffic volumes, number of active flows, incoming [...] Read More... The post A fine-grained network traffic analysis with Millisampler appeared first on Engineering at Meta.