Infrastructure

Post-incident-review PIR Incident Infrastructure Post-mortem

Canva incident report: API Gateway outage

An incident report for the Canva outage on November 12, 2024.

Brendan Humphreys12/20/2024

Hardware Traffic Infrastructure Envoy gRPC load-balancing Networking Web-Server

What’s new with Robinhood, our in-house load balancing service

null

Yi-Shu Tai 10/23/2024

Cloudflare

Infrastructure Open-Source OpenBMC Servers Firmware

Is this thing on? Using OpenBMC and ACPI power states for reliable server boot

Cloudflare’s global fleet benefits from being managed by open source firmware for the Baseboard Management Controller (BMC), OpenBMC. This has come with various challenges, some of which we discuss here with an explanation of how the open source nature of the firmware for the BMC enabled us to fix the issues and maintain a more stable fleet.

Nnamdi Ajah10/22/2024

Cloudflare

Kubernetes Infrastructure

Leveraging Kubernetes virtual machines at Cloudflare with KubeVirt

The Kubernetes team runs several multi-tenant clusters across Cloudflare’s core data centers. When multi-tenant cluster isolation is too limiting for an application, we use KubeVirt. KubeVirt is a cloud-native solution that enables our developers to run virtual machines alongside containers.

Justin Cichra10/8/2024

Infrastructure Developer-Platform Continuous-Integration

Faster continuous integration builds at Canva

How we improved our continuous integration build times from hours to less than 30 minutes.

Marco Lacava7/30/2024

Metadata Caching Databases Infrastructure storage Panda Chrono

Meet Chrono, our scalable, consistent, metadata caching solution

null

Lihao He,Ganesh Rapolu,Yu-Wun Wang 7/25/2024

Data Data-Science Infrastructure Platform

Data Platform Explained

As engineers working at Spotify, we frequently find ourselves explaining our robust data platform to fellow professionals who are contemplating [...] The post Data Platform Explained appeared first on Spotify Engineering.

Spotify Engineering4/2/2024

GitHub

Engineering availability databases Infrastructure MySQL

Upgrading GitHub.com to MySQL 8.0

GitHub uses MySQL to store vast amounts of relational data. This is the story of how we seamlessly upgraded our production fleet to MySQL 8.0. The post Upgrading GitHub.com to MySQL 8.0 appeared first on The GitHub Blog.

Jiaqi Liu12/7/2023

Hardware Traffic AI Infrastructure data-center Networking 400G sustainability

From AI to sustainability, why our latest data centers use 400G networking

Daniel Parker and Amit Chudasma11/14/2023

Developer-Tools Infrastructure Open-Source People

Switching Build Systems, Seamlessly

At Spotify, we have experimented with the Bazel build system since 2017. Over the years, the project has matured, and support for more languages and ecosystems have been added, thanks to the open source community and its maintainers at Google. In 2020, it became clear that the future of our client development required a unified [...] The post Switching Build Systems, Seamlessly appeared first on Spotify Engineering.

Spotify Engineering10/17/2023

Developer-Tools Infrastructure Platform backend

Analyzing Volatile Memory on a Google Kubernetes Engine Node

TL:DR At Spotify, we run containerized workloads in production across our entire organization in five regions where our main production workloads are in Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP). If we detect suspicious behavior in our workloads, we need to be able to quickly analyze it and determine if something malicious has [...] The post Analyzing Volatile Memory on a Google Kubernetes Engine Node appeared first on Spotify Engineering.

Spotify Engineering6/22/2023

Relational Database Migration with AWS Database Migration Service (DMS)

Backend Infrastructure

How we reliably migrated hundreds of GBs of relational DB data for our service split project

Dafu Ai6/19/2023

data-center Hardware Magic-Pocket sustainability Infrastructure

How the data center site selection process works at Dropbox

Edward del Rio6/13/2023

Developer-Tools Infrastructure People Platform engineering-leadership

Fleet Management at Spotify (Part 3): Fleet-wide Refactoring

This is part 3 in our series on Fleet Management at Spotify and how we manage our software at scale. See also part 1 and part 2. For the third part of this Fleet Management series, we’ll discuss what we call “fleet-wide refactoring” of code across thousands of Git repos: the tools we’ve built to [...] The post Fleet Management at Spotify (Part 3): Fleet-wide Refactoring appeared first on Spotify Engineering.

Spotify Engineering5/15/2023

Amazon-S3 Infrastructure Cost-Optimisation

How Canva saves millions annually in Amazon S3 costs

Understanding our data and usage patterns was the real key.

Josh Smith5/4/2023

Developer-Tools Infrastructure People Platform engineering-culture engineering-leadership

Fleet Management at Spotify (Part 2): The Path to Declarative Infrastructure

This is part 2 in our series on Fleet Management at Spotify and how we manage our software at scale. See also part 1 and part 3. At Spotify, we adopted the declarative infrastructure paradigm to evolve our infrastructure platform’s configuration management and control plane approach, allowing us to manage hundreds of thousands of cloud [...] The post Fleet Management at Spotify (Part 2): The Path to Declarative Infrastructure appeared first on Spotify Engineering.

Spotify Engineering5/3/2023

single-page-applications Web Edison Infrastructure tech-debt Front-end webserver developer-productivity

How Edison is helping us build a faster, more powerful Dropbox on the web

Giles Copp4/11/2023

Hardware Infrastructure storage cost-optimization areal-density HAMR Magic-Pocket sustainability SMR

After four years of SMR storage, here's what we love?and what comes next

Eric Shobe3/8/2023