Data-Center-Engineering
Meta
Mon Aug 26 2024
RETINAS: Real-Time Infrastructure Accounting for Sustainability
We are introducing a new metric— real-time server fleet utilization effectiveness —as part of the RETINAS initiative to help reduce emission...
AI-Research
Tue Aug 20 2024
Aparna Ramani discusses the future of AI infrastructure
Delivering new AI technologies at scale also means rethinking every layer of our infrastructure – from silicon and software systems and even...
Data-Infrastructure
Wed Jul 10 2024
Meta’s approach to machine learning prediction robustness
Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per...
Mon Jun 24 2024
Leveraging AI for efficient incident response
We’re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system.
Wed Jun 19 2024
PVF: A novel metric for understanding AI systems’ vulnerability against SDCs in model parameters
We’re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems’ vulnerability against sil...
Wed Jun 12 2024
How Meta trains large language models at scale
As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challengin...
Mon Jun 10 2024
Serverless Jupyter Notebooks at Meta
At Meta, Bento, our internal Jupyter notebooks platform, is a popular tool that allows our engineers to mix code, text, and multimedia in a ...
Wed May 22 2024
Composable data management at Meta
In recent years, Meta’s data management systems have evolved into a composable architecture that creates interoperability, promotes reusabil...
Mon Mar 18 2024
Logarithm: A logging engine for AI training workflows and services
Systems and application logs play a key role in operations, observability, and debugging workflows at Meta.
Tue Dec 19 2023
AI debugging at Meta with HawkEye
HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning ...
infrastructure
Pinterest
Tue Nov 07 2023
Running Unified PubSub Client in Production at Pinterest
Tue Oct 31 2023
Automating data removal
Meta’s Systematic Code and Asset Removal Framework (SCARF) has a subsystem for identifying and removing unused data types.
Tue Oct 24 2023
Automating dead code cleanup
Meta’s Systematic Code and Asset Removal Framework (SCARF) has a subsystem for identifying and removing dead code.
Tue Oct 17 2023
Automating product deprecation
Systematic Code and Asset Removal Framework (SCARF) is Meta’s unused code and data deletion framework.
Thu Sep 07 2023
Arcadia: An end-to-end AI system performance simulator
We’re introducing Arcadia, Meta’s unified system that simulates the compute, memory, and network performance of AI training clusters.
Tue Aug 29 2023
Scheduling Jupyter Notebooks at Meta
At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users.
Tue May 16 2023
Building and deploying MySQL Raft at Meta
We’re rolling out MySQL Raft with the aim to eventually replace our current MySQL semisynchronous databases.