ML-Applications

35 posts

ML-Applications Production-Engineering

Meta Andromeda: Supercharging Advantage+ automation with the next-gen personalized ads retrieval engine

Andromeda is Meta’s proprietary machine learning (ML) system design for retrieval in ad recommendation focused on delivering a step-function improvement in value to our advertisers and people. This system pushes the boundary of cutting edge AI for retrieval with NVIDIA Grace Hopper Superchip and Meta Training and Inference Accelerator (MTIA) hardware through innovations in ML [...] Read More... The post Meta Andromeda: Supercharging Advantage+ automation with the next-gen personalized ads retrieval engine appeared first on Engineering at Meta.

12/2/2024

Data-Infrastructure ML-Applications Production-Engineering

Sequence learning: A paradigm shift for personalized ads recommendations

AI plays a fundamental role in creating valuable connections between people and advertisers within Meta’s family of apps. Meta’s ad recommendation engine, powered by deep learning recommendation models (DLRMs), has been instrumental in delivering personalized ads to people. Key to this success was incorporating thousands of human-engineered signals or features in the DLRM-based recommendation system. [...] Read More... The post Sequence learning: A paradigm shift for personalized ads recommendations appeared first on Engineering at Meta.

11/19/2024

Data-Center-Engineering Data-Infrastructure DevInfra ML-Applications Networking-and-Traffic Open-Source Production-Engineering

OCP Summit 2024: The open future of networking hardware for AI

At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP. We look forward to continued collaboration with OCP to open designs for racks, servers, storage [...] Read More... The post OCP Summit 2024: The open future of networking hardware for AI appeared first on Engineering at Meta.

10/15/2024

Data-Center-Engineering Data-Infrastructure DevInfra ML-Applications Networking-and-Traffic Open-Source

Meta’s open AI hardware vision

At the Open Compute Project (OCP) Global Summit 2024, we’re showcasing our latest open AI hardware designs with the OCP community. These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. By sharing our designs, we hope to inspire collaboration and foster innovation. If you’re passionate about building [...] Read More... The post Meta’s open AI hardware vision appeared first on Engineering at Meta.

10/15/2024

AI-Research ML-Applications Open-Source

How open source AI can improve population estimates, sustainable energy, and the delivery of climate change interventions

Data for Good at Meta is open-sourcing the data used to train our AI-powered population maps. We’re hoping that researchers and other organizations around the world will be able to leverage these tools to assist with a wide range of projects including those on climate adaptation, public health and disaster response. The dataset and code [...] Read More... The post How open source AI can improve population estimates, sustainable energy, and the delivery of climate change interventions appeared first on Engineering at Meta.

10/3/2024

Data-Center-Engineering ML-Applications

Simulator-based reinforcement learning for data center cooling optimization

We’re sharing more about the role that reinforcement learning plays in helping us optimize our data centers’ environmental controls. Our reinforcement learning-based approach has helped us reduce energy consumption and water usage across various weather conditions in our data centers. Meta is revamping its new data center design to optimize for artificial intelligence and the [...] Read More... The post Simulator-based reinforcement learning for data center cooling optimization appeared first on Engineering at Meta.

9/10/2024

AI-Research ML-Applications Open-Source AI-Infra-@-Scale

How PyTorch powers AI training and inference

Learn about new PyTorch advancements for LLMs and how PyTorch is enhancing every aspect of the LLM lifecycle. In this talk from AI Infra @ Scale 2024, software engineers Wanchao Liang and Evan Smothers are joined by Meta research scientist Kimish Patel to discuss our newest features and tools that enable large-scale training, memory efficient [...] Read More... The post How PyTorch powers AI training and inference appeared first on Engineering at Meta.

8/23/2024

AI-Research ML-Applications AI-Infra-@-Scale

Inside the hardware and co-design of MTIA

In this talk from AI Infra @ Scale 2024, Joel Colburn, a software engineer at Meta, technical lead Junqiang Lan, and software engineer Jack Montgomery discuss the second generation of MTIA, Meta’s in-house training and inference accelerator. They cover the co-design process behind building the second generation of Meta’s first-ever custom silicon for AI workloads, [...] Read More... The post Inside the hardware and co-design of MTIA appeared first on Engineering at Meta.

8/22/2024

AI-Research Culture ML-Applications Open-Source Production-Engineering AI-Infra-@-Scale

Bringing Llama 3 to life

Llama 3 is Meta’s most capable openly-available LLM to date and the recently-released Llama 3.1 will enable new workflows, such as synthetic data generation and model distillation with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models. At AI Infra @ Scale 2024, Meta engineers discussed every step of how we [...] Read More... The post Bringing Llama 3 to life appeared first on Engineering at Meta.

8/21/2024

AI-Research Data-Center-Engineering Data-Infrastructure ML-Applications AI-Infra-@-Scale

Aparna Ramani discusses the future of AI infrastructure

Delivering new AI technologies at scale also means rethinking every layer of our infrastructure – from silicon and software systems and even our data center designs. For the second year in a row, Meta’s engineering and infrastructure teams returned for the AI Infra @ Scale conference, where they discussed the challenges of scaling up an [...] Read More... The post Aparna Ramani discusses the future of AI infrastructure appeared first on Engineering at Meta.

8/20/2024

AI-Research ML-Applications Production-Engineering

How Meta animates AI-generated images at scale

We launched Meta AI with the goal of giving people new ways to be more productive and unlock their creativity with generative AI (GenAI). But GenAI also comes with challenges of scale. As we deploy new GenAI technologies at Meta, we also focus on delivering these services to people as quickly and efficiently as possible. [...] Read More... The post How Meta animates AI-generated images at scale appeared first on Engineering at Meta.

8/14/2024

AI-Research Data-Center-Engineering ML-Applications Networking-and-Traffic

RoCE networks for distributed AI training at scale

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B. This week at ACM SIGCOMM 2024 in Sydney, Australia, we are sharing details on the network we have built at Meta [...] Read More... The post RoCE networks for distributed AI training at scale appeared first on Engineering at Meta.

8/5/2024

ML-Applications Virtual-Reality Meta-Tech-Podcast

Meet Caddy – Meta’s next-gen mixed reality CAD software

What happens when a team of mechanical engineers get tired of looking at flat images of 3D models over Zoom? Meet the team behind Caddy, a new CAD app for mixed reality. They join Pascal Hartig (@passy) on the Meta Tech Podcast to talk about teaching themselves to code, disrupting the CAD software space, how [...] Read More... The post Meet Caddy – Meta’s next-gen mixed reality CAD software appeared first on Engineering at Meta.

7/18/2024

DevInfra ML-Applications

AI Lab: The secrets to keeping machine learning engineers moving fast

The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A/B test common ML workflows – enabling proactive improvements and automatically preventing regressions on TTFB. AI Lab prevents TTFB regressions [...] Read More... The post AI Lab: The secrets to keeping machine learning engineers moving fast appeared first on Engineering at Meta.

7/16/2024

ML-Applications Networking-and-Traffic Production-Engineering

Taming the tail utilization of ads inference at Meta scale

Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization. The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability. Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint delivered 35% more work for [...] Read More... The post Taming the tail utilization of ads inference at Meta scale appeared first on Engineering at Meta.

7/10/2024

Data-Infrastructure ML-Applications

Meta’s approach to machine learning prediction robustness

Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta’s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure our ML systems are intrinsically [...] Read More... The post Meta’s approach to machine learning prediction robustness appeared first on Engineering at Meta.

7/10/2024

Data-Infrastructure DevInfra ML-Applications

Leveraging AI for efficient incident response

We’re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system. The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations. Our testing has shown this new system achieves 42% accuracy in identifying root causes for investigations at their [...] Read More... The post Leveraging AI for efficient incident response appeared first on Engineering at Meta.

6/24/2024

Data-Center-Engineering ML-Applications Production-Engineering

Maintaining large-scale AI capacity at Meta

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inflection point due to the rise of artificial intelligence (AI), we [...] Read More... The post Maintaining large-scale AI capacity at Meta appeared first on Engineering at Meta.

6/12/2024

AI-Research ML-Applications

Introducing the next-gen Meta Training and Inference Accelerator

[...] Read More... The post Introducing the next-gen Meta Training and Inference Accelerator appeared first on Engineering at Meta.

4/10/2024

ML-Applications Networking-and-Traffic Video-Engineering

Optimizing RTC bandwidth estimation with machine learning

Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta’s family of apps. We’ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport. We’re sharing our experiment results from this approach, some of [...] Read More... The post Optimizing RTC bandwidth estimation with machine learning appeared first on Engineering at Meta.

3/20/2024