Data-Infrastructure
Meta
Tue Nov 19 2024
Sequence learning: A paradigm shift for personalized ads recommendations
AI plays a fundamental role in creating valuable connections between people and advertisers within Meta’s family of apps.
Data-Center-Engineering
Tue Oct 15 2024
OCP Summit 2024: The open future of networking hardware for AI
At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters.
Meta’s open AI hardware vision
At the Open Compute Project (OCP) Global Summit 2024, we’re showcasing our latest open AI hardware designs with the OCP community.
AI-Research
Thu Oct 03 2024
How open source AI can improve population estimates, sustainable energy, and the delivery of climate change interventions
Data for Good at Meta is open-sourcing the data used to train our AI-powered population maps.
Tue Sep 10 2024
Simulator-based reinforcement learning for data center cooling optimization
We’re sharing more about the role that reinforcement learning plays in helping us optimize our data centers’ environmental controls.
Fri Aug 23 2024
How PyTorch powers AI training and inference
Learn about new PyTorch advancements for LLMs and how PyTorch is enhancing every aspect of the LLM lifecycle.
Thu Aug 22 2024
Inside the hardware and co-design of MTIA
In this talk from AI Infra @ Scale 2024, Joel Colburn, a software engineer at Meta, technical lead Junqiang Lan, and software engineer Jack ...
Wed Aug 21 2024
Bringing Llama 3 to life
Llama 3 is Meta’s most capable openly-available LLM to date and the recently-released Llama 3.
Tue Aug 20 2024
Aparna Ramani discusses the future of AI infrastructure
Delivering new AI technologies at scale also means rethinking every layer of our infrastructure – from silicon and software systems and even...
Wed Aug 14 2024
How Meta animates AI-generated images at scale
We launched Meta AI with the goal of giving people new ways to be more productive and unlock their creativity with generative AI (GenAI).
Mon Aug 05 2024
RoCE networks for distributed AI training at scale
AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for traini...
ML-Applications
Thu Jul 18 2024
Meet Caddy – Meta’s next-gen mixed reality CAD software
What happens when a team of mechanical engineers get tired of looking at flat images of 3D models over Zoom? Meet the team behind Caddy, a n...
DevInfra
Tue Jul 16 2024
AI Lab: The secrets to keeping machine learning engineers moving fast
The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers.
Wed Jul 10 2024
Taming the tail utilization of ads inference at Meta scale
Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization.
Meta’s approach to machine learning prediction robustness
Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per...
Mon Jun 24 2024
Leveraging AI for efficient incident response
We’re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system.
Wed Jun 12 2024
Maintaining large-scale AI capacity at Meta
Meta is currently operating many data centers with GPU training clusters across the world.
Wed Apr 10 2024
Introducing the next-gen Meta Training and Inference Accelerator
[.
Wed Mar 20 2024
Optimizing RTC bandwidth estimation with machine learning
Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Met...
Mon Mar 18 2024
Logarithm: A logging engine for AI training workflows and services
Systems and application logs play a key role in operations, observability, and debugging workflows at Meta.
Tue Mar 12 2024
Building Meta’s GenAI Infrastructure
Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters.
Mon Jan 29 2024
Improving machine learning iteration speed with faster application build and packaging
Slow build times and inefficiencies in packaging and distributing execution files were costing our ML/AI engineers a significant amount of t...
Thu Jan 18 2024
Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta
At Meta, the quest for faster model training has yielded an exciting milestone: the adoption of Lazy Imports and the Python Cinder runtime.
Thu Jan 11 2024
How Meta is advancing GenAI
What’s going on with generative AI (GenAI) at Meta? And what does the future have in store? In this episode of the Meta Tech Podcast, Meta e...
Tue Dec 19 2023
AI debugging at Meta with HawkEye
HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning ...
Wed Nov 15 2023
Watch: Meta’s engineers on building network infrastructure for AI
Meta is building for the future of AI at every level – from hardware like MTIA v1, Meta’s first-generation AI inference accelerator to publi...
Wed Oct 18 2023
How Meta is creating custom silicon for AI
Olivia Wu, Meta’s Technical Lead for Infra Silicon, discusses the design and development of Meta’s first-generation AI inference accelerator...
Thu Sep 07 2023
Using Chakra execution traces for benchmarking and network performance optimization
Meta presents Chakra execution traces, an open graph-based representation of AI/ML workload execution, laying the foundation for benchmarkin...
Arcadia: An end-to-end AI system performance simulator
We’re introducing Arcadia, Meta’s unified system that simulates the compute, memory, and network performance of AI training clusters.
Thu Aug 24 2023
Code Llama: Meta’s state-of-the-art LLM for coding
Mon Aug 14 2023
Meta Connect 2023: September 27 – 28
Wed Aug 09 2023
Scaling the Instagram Explore recommendations system
Explore is one of the largest recommendation systems on Instagram.
Thu May 18 2023
MSVP is Meta’s first video processing ASIC
Meta introduces its first-generation AI inference accelerator