ML-Applications
Meta
Thu Jul 18 2024
Meet Caddy – Meta’s next-gen mixed reality CAD software
What happens when a team of mechanical engineers get tired of looking at flat images of 3D models over Zoom? Meet the team behind Caddy, a n...
DevInfra
Tue Jul 16 2024
AI Lab: The secrets to keeping machine learning engineers moving fast
The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers.
Wed Jul 10 2024
Taming the tail utilization of ads inference at Meta scale
Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization.
Data-Infrastructure
Meta’s approach to machine learning prediction robustness
Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per...
Mon Jun 24 2024
Leveraging AI for efficient incident response
We’re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system.
Data-Center-Engineering
Wed Jun 12 2024
Maintaining large-scale AI capacity at Meta
Meta is currently operating many data centers with GPU training clusters across the world.
AI-Research
Wed Apr 10 2024
Introducing the next-gen Meta Training and Inference Accelerator
[.
Wed Mar 20 2024
Optimizing RTC bandwidth estimation with machine learning
Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Met...
Mon Mar 18 2024
Logarithm: A logging engine for AI training workflows and services
Systems and application logs play a key role in operations, observability, and debugging workflows at Meta.
Tue Mar 12 2024
Building Meta’s GenAI Infrastructure
Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters.
Mon Jan 29 2024
Improving machine learning iteration speed with faster application build and packaging
Slow build times and inefficiencies in packaging and distributing execution files were costing our ML/AI engineers a significant amount of t...
Thu Jan 18 2024
Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta
At Meta, the quest for faster model training has yielded an exciting milestone: the adoption of Lazy Imports and the Python Cinder runtime.
Thu Jan 11 2024
How Meta is advancing GenAI
What’s going on with generative AI (GenAI) at Meta? And what does the future have in store? In this episode of the Meta Tech Podcast, Meta e...
Tue Dec 19 2023
AI debugging at Meta with HawkEye
HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning ...
Wed Nov 15 2023
Watch: Meta’s engineers on building network infrastructure for AI
Meta is building for the future of AI at every level – from hardware like MTIA v1, Meta’s first-generation AI inference accelerator to publi...
Wed Oct 18 2023
How Meta is creating custom silicon for AI
Olivia Wu, Meta’s Technical Lead for Infra Silicon, discusses the design and development of Meta’s first-generation AI inference accelerator...
Thu Sep 07 2023
Using Chakra execution traces for benchmarking and network performance optimization
Meta presents Chakra execution traces, an open graph-based representation of AI/ML workload execution, laying the foundation for benchmarkin...
Arcadia: An end-to-end AI system performance simulator
We’re introducing Arcadia, Meta’s unified system that simulates the compute, memory, and network performance of AI training clusters.
Thu Aug 24 2023
Code Llama: Meta’s state-of-the-art LLM for coding
Mon Aug 14 2023
Meta Connect 2023: September 27 – 28
Wed Aug 09 2023
Scaling the Instagram Explore recommendations system
Explore is one of the largest recommendation systems on Instagram.
Thu May 18 2023
MSVP is Meta’s first video processing ASIC
Meta introduces its first-generation AI inference accelerator