Infrastructure
23 postsAn incident report for the Canva outage on November 12, 2024.
null
Cloudflare’s global fleet benefits from being managed by open source firmware for the Baseboard Management Controller (BMC), OpenBMC. This has come with various challenges, some of which we discuss here with an explanation of how the open source nature of the firmware for the BMC enabled us to fix the issues and maintain a more stable fleet.
The Kubernetes team runs several multi-tenant clusters across Cloudflare’s core data centers. When multi-tenant cluster isolation is too limiting for an application, we use KubeVirt. KubeVirt is a cloud-native solution that enables our developers to run virtual machines alongside containers.
How we improved our continuous integration build times from hours to less than 30 minutes.
null
As engineers working at Spotify, we frequently find ourselves explaining our robust data platform to fellow professionals who are contemplating [...] The post Data Platform Explained appeared first on Spotify Engineering.
GitHub uses MySQL to store vast amounts of relational data. This is the story of how we seamlessly upgraded our production fleet to MySQL 8.0. The post Upgrading GitHub.com to MySQL 8.0 appeared first on The GitHub Blog.
At Spotify, we have experimented with the Bazel build system since 2017. Over the years, the project has matured, and support for more languages and ecosystems have been added, thanks to the open source community and its maintainers at Google. In 2020, it became clear that the future of our client development required a unified [...] The post Switching Build Systems, Seamlessly appeared first on Spotify Engineering.
TL:DR At Spotify, we run containerized workloads in production across our entire organization in five regions where our main production workloads are in Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP). If we detect suspicious behavior in our workloads, we need to be able to quickly analyze it and determine if something malicious has [...] The post Analyzing Volatile Memory on a Google Kubernetes Engine Node appeared first on Spotify Engineering.
How we reliably migrated hundreds of GBs of relational DB data for our service split project
This is part 3 in our series on Fleet Management at Spotify and how we manage our software at scale. See also part 1 and part 2. For the third part of this Fleet Management series, we’ll discuss what we call “fleet-wide refactoring” of code across thousands of Git repos: the tools we’ve built to [...] The post Fleet Management at Spotify (Part 3): Fleet-wide Refactoring appeared first on Spotify Engineering.
Understanding our data and usage patterns was the real key.
This is part 2 in our series on Fleet Management at Spotify and how we manage our software at scale. See also part 1 and part 3. At Spotify, we adopted the declarative infrastructure paradigm to evolve our infrastructure platform’s configuration management and control plane approach, allowing us to manage hundreds of thousands of cloud [...] The post Fleet Management at Spotify (Part 2): The Path to Declarative Infrastructure appeared first on Spotify Engineering.
Introduction As the field of machine learning (ML) continues to evolve and its impact on society and various aspects of our lives grows, it is becoming increasingly important for practitioners and innovators to consider a broader range of perspectives when building ML models and applications. This desire is driving the need for a more flexible [...] The post Unleashing ML Innovation at Spotify with Ray appeared first on Spotify Engineering.