Infrastructure

23 posts

An incident report for the Canva outage on November 12, 2024.

Brendan Humphreys12/20/2024

Cloudflare’s global fleet benefits from being managed by open source firmware for the Baseboard Management Controller (BMC), OpenBMC. This has come with various challenges, some of which we discuss here with an explanation of how the open source nature of the firmware for the BMC enabled us to fix the issues and maintain a more stable fleet.

Nnamdi Ajah10/22/2024

The Kubernetes team runs several multi-tenant clusters across Cloudflare’s core data centers. When multi-tenant cluster isolation is too limiting for an application, we use KubeVirt. KubeVirt is a cloud-native solution that enables our developers to run virtual machines alongside containers.

Justin Cichra10/8/2024

How we improved our continuous integration build times from hours to less than 30 minutes.

Marco Lacava7/30/2024

As engineers working at Spotify, we frequently find ourselves explaining our robust data platform to fellow professionals who are contemplating [...] The post Data Platform Explained appeared first on Spotify Engineering.

Spotify Engineering4/2/2024

GitHub uses MySQL to store vast amounts of relational data. This is the story of how we seamlessly upgraded our production fleet to MySQL 8.0. The post Upgrading GitHub.com to MySQL 8.0 appeared first on The GitHub Blog.

Jiaqi Liu12/7/2023

At Spotify, we have experimented with the Bazel build system since 2017. Over the years, the project has matured, and support for more languages and ecosystems have been added, thanks to the open source community and its maintainers at Google. In 2020,  it became clear that the future of our client development required a unified [...] The post Switching Build Systems, Seamlessly appeared first on Spotify Engineering.

Spotify Engineering10/17/2023

TL:DR At Spotify, we run containerized workloads in production across our entire organization in five regions where our main production workloads are in Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP). If we detect suspicious behavior in our workloads, we need to be able to quickly analyze it and determine if something malicious has [...] The post Analyzing Volatile Memory on a Google Kubernetes Engine Node appeared first on Spotify Engineering.

Spotify Engineering6/22/2023

How we reliably migrated hundreds of GBs of relational DB data for our service split project

Dafu Ai6/19/2023

This is part 3 in our series on Fleet Management at Spotify and how we manage our software at scale. See also part 1 and part 2. For the third part of this Fleet Management series, we’ll discuss what we call “fleet-wide refactoring” of code across thousands of Git repos: the tools we’ve built to [...] The post Fleet Management at Spotify (Part 3): Fleet-wide Refactoring appeared first on Spotify Engineering.

Spotify Engineering5/15/2023

Understanding our data and usage patterns was the real key.

Josh Smith5/4/2023

This is part 2 in our series on Fleet Management at Spotify and how we manage our software at scale. See also part 1 and part 3.  At Spotify, we adopted the declarative infrastructure paradigm to evolve our infrastructure platform’s configuration management and control plane approach, allowing us to manage hundreds of thousands of cloud [...] The post Fleet Management at Spotify (Part 2): The Path to Declarative Infrastructure appeared first on Spotify Engineering.

Spotify Engineering5/3/2023

Introduction As the field of machine learning (ML) continues to evolve and its impact on society and various aspects of our lives grows, it is becoming increasingly important for practitioners and innovators to consider a broader range of perspectives when building ML models and applications. This desire is driving the need for a more flexible [...] The post Unleashing ML Innovation at Spotify with Ray appeared first on Spotify Engineering.

Spotify Engineering2/1/2023