Production-Engineering

17 posts

How did the Threads iOS team maintain the app’s performance during its incredible growth? Here’s how Meta’s Threads team thinks about performance, including the key metrics we monitor to keep the app healthy. We’re also diving into some case studies that impact publish reliability and navigation latency. When Meta launched Threads in 2023, it became [...] Read More... The post How we think about Threads’ iOS performance appeared first on Engineering at Meta.

12/18/2024

Andromeda is Meta’s proprietary machine learning (ML) system design for retrieval in ad recommendation focused on delivering a step-function improvement in value to our advertisers and people.  This system pushes the boundary of cutting edge AI for retrieval with NVIDIA Grace Hopper Superchip and Meta Training and Inference Accelerator (MTIA) hardware through innovations in ML [...] Read More... The post Meta Andromeda: Supercharging Advantage+ automation with the next-gen personalized ads retrieval engine appeared first on Engineering at Meta.

12/2/2024

AI plays a fundamental role in creating valuable connections between people and advertisers within Meta’s family of apps. Meta’s ad recommendation engine, powered by deep learning recommendation models (DLRMs), has been instrumental in delivering personalized ads to people. Key to this success was incorporating thousands of human-engineered signals or features in the DLRM-based recommendation system. [...] Read More... The post Sequence learning: A paradigm shift for personalized ads recommendations appeared first on Engineering at Meta.

11/19/2024

At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP. We look forward to continued collaboration with OCP to open designs for racks, servers, storage [...] Read More... The post OCP Summit 2024: The open future of networking hardware for AI appeared first on Engineering at Meta.

10/15/2024

Llama 3 is Meta’s most capable openly-available LLM to date and the recently-released Llama 3.1 will enable new workflows, such as synthetic data generation and model distillation with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models.  At AI Infra @ Scale 2024, Meta engineers discussed every step of how we [...] Read More... The post Bringing Llama 3 to life appeared first on Engineering at Meta.

8/21/2024

We launched Meta AI with the goal of giving people new ways to be more productive and unlock their creativity with generative AI (GenAI). But GenAI also comes with challenges of scale. As we deploy new GenAI technologies at Meta, we also focus on delivering these services to people as quickly and efficiently as possible. [...] Read More... The post How Meta animates AI-generated images at scale appeared first on Engineering at Meta.

8/14/2024

Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization. The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability.  Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint delivered 35% more work for [...] Read More... The post Taming the tail utilization of ads inference at Meta scale appeared first on Engineering at Meta.

7/10/2024

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inflection point due to the rise of artificial intelligence (AI), we [...] Read More... The post Maintaining large-scale AI capacity at Meta appeared first on Engineering at Meta.

6/12/2024

When Threads first launched one of the top feature requests was for a web client. In this episode of the Meta Tech Podcast, Pascal Hartig (@passy) sits down with Ally C. and Kevin C., two engineers on the Threads Web Team that delivered the basic version of Threads for web in just under three months. [...] Read More... The post Behind the scenes of Threads for web appeared first on Engineering at Meta.

5/14/2024

Meta’s family of apps serves trillions of image download requests every day. And if you’re into high-quality images, you’ve probably noticed that Instagram and Threads have added support for high dynamic range (HDR) photos. Now people on Threads and Instagram can upload and share images that are more true-to-life, with the full color and range [...] Read More... The post Bringing HDR photo support to Instagram and Threads appeared first on Engineering at Meta.

3/26/2024

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources. In our own tests, SPTP boasts comparable performance to PTP, but with significant improvements in [...] Read More... The post Simple Precision Time Protocol at Meta appeared first on Engineering at Meta.

2/7/2024

On July 5, 2023, Meta launched Threads, the newest product in our family of apps, to an unprecedented success that saw it garner over 100 million sign ups in its first five days. A small, nimble team of engineers built Threads over the course of only five months of technical work. While the app’s production [...] Read More... The post How Meta built the infrastructure for Threads appeared first on Engineering at Meta.

12/19/2023

Python plays a big part at Meta. It powers Instagram’s backend and plays an important role in our configuration systems, as well as much of our AI work. Meta even made contributions to Python 3.12, the latest version of Python. On this episode of the Meta Tech Podcast, Meta engineer Pascal Hartig (@passy) is joined by Amethyst [...] Read More... The post Writing and linting Python at scale appeared first on Engineering at Meta.

11/21/2023

Python 3.12 is out! It includes new features and performance improvements – some contributed by Meta – that we believe will benefit all Python users. We’re sharing details about these new features that we worked closely with the Python community to develop. This week’s release of Python 3.12 marks a milestone in our efforts to [...] Read More... The post Meta contributes new features to Python 3.12 appeared first on Engineering at Meta.

10/5/2023

Fixit is dead! Long live Fixit 2 – the latest version of our open-source auto-fixing linter. Fixit 2 allows developers to efficiently build custom lint rules and perform auto-fixes for their codebases. Fixit 2 is available today on PyPI. Python is one of the most popular languages in use at Meta. Meta’s production engineers (PEs) [...] Read More... The post Fixit 2: Meta’s next-generation auto-fixing linter appeared first on Engineering at Meta.

8/7/2023

Short-lived certificates (SLCs) are part of our latest efforts to further secure our Transport Layer Security (TLS) private keys on our edge networks. SLCs have a very short exposure compared to traditional certificates and lower the chances of a compromised private key being abused. Implementing SLCs has required us to address tradeoffs between operability and [...] Read More... The post Using short-lived certificates to protect TLS secrets appeared first on Engineering at Meta.

8/7/2023

We’re rolling out MySQL Raft with the aim to eventually replace our current MySQL semisynchronous databases.  The biggest win of MySQL Raft was simplification of the operation and making MySQL servers take care of promotions and membership. This gave the provable safety of Raft and reduced significant operational pain. Making MySQL server a true [...] Read More... The post Building and deploying MySQL Raft at Meta appeared first on Engineering at Meta.

5/16/2023