SRE

3 posts

We run a wide fleet of machines on the Cloudflare global network. This post describes how we minimize customer disruption during large-scale machine reboots using math!

Opeyemi Onikute7/12/2023

Guaranteeing that our servers are continually upgraded to secure and vetted operating systems is one major step that we take to ensure our members and customers can access LinkedIn to look for new roles, access new learning programs, or exchange knowledge with other professionals. LinkedIn has quite a large fleet of servers on-premise that depend on internal tooling to ensure they stay on the latest operating systems. This post will introduce an internal tool that serves as an interface for managing servers' lifecycles at the LinkedIn scale. We will emphasize the rationale behind […]

Rohit Jamuar5/11/2023

Saira joined our Bangalore site reliability engineering (SRE) team to tackle large-scale, site engineering challenges and grow. She highlights for us the impactful work she found here — from ushering in LinkedIn’s next-generation, server query system that runs over a fleet of 350,000 servers, to mentoring the next generation of female engineers: In my engineering career, I’ve always followed the path less taken. As a student, I became interested in cloud computing and systems engineering after working on an OpenStack project. My passion for this engineering area led to me pursuing a […]

12/16/2022