How To Safely Make Changes In Large Distributed Systems

Name: How To Safely Make Changes In Large Distributed Systems
Uploaded: 2026-05-28T12:00:19.0000000+00:00
Duration: 25 min 14 s
Description: Distributed systems are hard. Making changes in distributed systems is even harder.

May 28, 2026• 67 views

vlogvloggervloggingmercedesmercedes AMGMercedes AMG GTAMG GTbig techsoftware engineeringsoftware engineercar vlogvlogssoftware developmentsoftware engineersmicrosoftprogrammingtips for developerscareer in techfaangwork vlogdevleaderdev leadernick cosentinoengineering managerleadershipmsftsoftware developercode commutecodecommutecommuteredditreddit storiesreddit storyask redditaskredditaskreddit storiesredditorlinkedin

Distributed systems are hard. Making changes in distributed systems is even harder.

Watch on YouTube ← All Videos

Transcript is auto-generated and may contain errors.

Hey folks, I'm just driving to work. It's Friday. I'm I'm just realizing now there's a lot of there's a lot of videos I still haven't uploaded. I think I probably am by the time you're watching this, it's probably two weeks out. Um, so I got to catch up on that. Just been busy, been on call. I figured uh one of the things top of mind for me and I don't know how this conversation is going to go is around uh like change management um you know safely making changes in in systems and I I was kind of I wanted to see if I could frame this more around like uh slightly more general but then kind of focusing a little bit more on on on changes and deployment and yeah and kind of just making changes over a large surface area.

So the the the sort of general framing I want to start with is this idea in like software development that I think I don't know if I can say most people but I think a lot of people are familiar with uh with the agile manifesto or at least some of the concepts in there right I think that there's been periods of time where like agile was super in favor has fallen completely out of favor I don't I don't even know where people stand on this kind of anymore but I think that there's regardless of where you, you know, stand on agile or not. I think something that I suspect many people have alignment with is like this idea of uh maybe people over process, right? I I don't know. And maybe that's uh maybe that's already a hot take. Maybe people don't agree with that. But this balance between having some type of processes in place.

By the way, I'm just realizing now I don't know if it's the Canadian thing to say process versus process. Someone said that's a Canadian thing and I don't know which way is what cuz I think I say both both process and process. So I don't I don't know which end that puts me on but just realizing now. Um I think you have to strike a balance where if you have no pro processes in place that it's like things are kind of chaotic but if everything is over overprocessed then it's almost like you can't get anything done because you're there there's inevitably going to be a lot of like friction in different scenarios that the process doesn't fit but you have to have a process.

So I think you know from from an agile perspective and I want to talk about agility here not you know the definition of agile software development but in terms of having agility you know I I like the idea of having uh of having some process I think going between it right some process in place uh some guidelines uh but but in like in the end we need to make sure that there are I don't know like escape hatches or like we can we can lean on people to make right decisions in in the right uh situations because there will will always be exceptions to things and that's kind of that's going to be one of the the focal points as I kind of talk through this is like I I think that any type of any situation where you're like we have a rule there there will be an exception to the rule at some So, um, with that said, like how do you strike this balance, right?

Because if you have so many things in place that it feels like, you know, you'll hear you'll hear people say this about larger companies or like there's a lot of red tape to get something done, right? We're just trying to do something simple, but there's so much red tape to get this done. And like what what does that mean and why does that happen? And I think a lot of it is because as things grow, right, in terms of more people, more systems, and I don't just mean like your software systems that you're shipping, you're deploying, but even just like, you know, you have different teams on different platforms doing different things and then you're sharing something and it could be like, I don't know, like the fact that everyone's using Azure DevOps or using GitHub or using something else like you have these systems

in place and having some common ways to do stuff probably makes a lot of sense so that everyone's not reinventing the wheel all the time relearning the same issues, right? Like one team is like, "Oh crap, like we don't have this kind of stuff locked down in our in our repositories like uh you know, we have to have better access uh policies." And if every team had to go learn that from scratch, like there's a lot of waste there, right? So, are there things that you put in place across the board and that's just how how it works? But when you keep layering this stuff on over time and and and things are growing, like inevitably the the larger the surface area that you apply rules to and processes to, the inevitably there's going to be more exceptions that come up. Right?

So when I think about this kind of stuff from a from a general sense, I don't know how exactly I would bucket these types of things, but I think there's probably things that you know like 95 to 99% of the time like do it, enforce it, and like it should be a lot of friction for people to to work around it, right? I just to I'm going to try to come up with an example off the top of my head. If you had some type of automation that scanned for secrets and credentials and passwords and stuff like that in your CI/CD, like to me that's probably something that should be really high friction to work around that. Now, does that mean that there's never a situation where you could allow it? I think the answer is no.

And just to make up another random example for that, what if you were there's a team building a tool and they have a a unit test or something and in the unit test they're they're literally trying to build software that's like doing something with, you know, the format of secrets or something else. And so they they quite literally need to embed a secret into text. And it's not even a real secret. It's just that it, you know, the variable is called secret. the actual text has the shape of, you know, some generic secret and then it gets flagged and then it's like, oh, you can't commit this. You can't get a build. And it's like, well, this is actually an exception to it because it's not a real secret and and we should be allowed to do this. So, that's a that would be a really rare case, right?

But for the most part, you have a paved path that kind of protects everyone else for the 99% case, just as an example, right? Um, and I'm sure you could think of others for different scenarios, but I have like in my mind like a bucket that kind of looks like that. I think there's probably a similar bucket that's less um maybe should be like a lot less strict, but still is like causes some friction. We talked about this the other day on um like code coverage, right? I was saying that I don't love the idea of gating on code coverage because what I've seen in practice is that once you're near that threshold of like this is the the code coverage you should be hitting like I think there's value in knowing the number I think there's value in you know trying to maintain high

code coverage but what I was saying in that video was that I find that once it's near the threshold like there's either edge cases where like the tooling's not quite working and misreport reporting and then you have people gaming it to like just try and get the number higher, but the tests are kind of dog anyway. So like what's really the value? Uh so kind of you're you're doing something to get value out of it and really chasing the wrong thing in the end. But to use that as an example, I think code coverage could be a one that falls into this bucket. If your organization is one that's like, "We want to maintain that." I'm not here to say that's a good or bad idea, but that could be one of these things where it's like most of the time, maybe not 99% of

the time, maybe it's like, I don't know, 70 to 90% of the time, you're like, "This this should be doing the right thing." And then for the other part of that, um, you have, you know, some some escape hatches that people can kind of, uh, get around in, right? But it kind of makes people think like you have to pause and this is part of it, right? You have to kind of pause and be like, "Wait a second." Like this thing's here for a reason, right? Um it's trying to to give me this paved path. It's trying to uh you know, protect me. It's trying to protect the service. It's trying to protect the code. Whatever it is, there's a purpose for it. And I'm finding that like I need to work around it. So like at least it gets you to pause and go, hm.

Right. like should should I be doing this right? I think that even that as a minimum for a lot of things I think is helpful just to at least get you to stop and think like am I doing the right thing? Uh some people like I'm just trying to come up with another example. If we think about like flaky tests and CICD, right? Some people would say automatically like absolutely not. that is uh that's one of those you know 99.99% things like soon as you have a flaky test it's like your test is evicted it's uh it's no go's getting pinged for it like hey we removed your flaky test like too bad uh here's an email here's a bug whatever other people might say hey you know what uh maybe that falls into like the the overwhelming majority bucket but like there's some escape hatches depends on who you are, your organization.

That's, I think, a very polarizing one for some people. I can imagine some people hearing me say this will be kind of pissed off and want to get to the comments. And I'm just talking about it generally. I'm not even telling you what my opinion is on it. So, kind of have these these buckets. And then I think there's even like a much much lower tier which is like we have recommended ways to do things but we shouldn't even we shouldn't enforce it. It's like these are these are suggested guidelines and like there's been uh success shown with them and so like if you're looking for guidance follow this right and to to kind of give you an example um I I think I've been pretty fortunate that in my career so far no one has forced upon us you must work this way in your teams and I think part of that is like fortunately because I've been a manager I've been able to have some flexibility that way.

But I feel like I've always been supported by by my managers that, you know, there's there's a level of trust that it's like, hey, this is your team to work with. Like, we trust that you're going to, you know, do the right thing. Now we've tried all sorts of different things especially being in a startup before Microsoft but um that's something where it's like just to again make up an example maybe most teams are like we follow some type of like uh scrum process we have some type of sprint planning we have some type of like uh daily or weekly sync meeting like these are these are guidelines no one's going to say like you must do this or else like your day doesn't progress but here's some general frameworks right so I kind of see things on a bit of a spectrum maybe in

some buckets and I think kind of now shifting over to this idea of of change management and stuff like that so for context at Microsoft I when I started at Microsoft it was on the M365 deployment team and so the the interesting thing uh from my perspective is that if we consider the the size of of a fleet, right? The the surface area of the number of services and things that you're actually doing here like it's it's kind of outrageous. There's like there's a lot of capacity, there are a lot of services and there's a deployment infrastructure to take care of that. Now the one of the interesting things about technology is that we have technology that allows us to go fast which is great right like you always want my opinion at least is like you always want technology to have the flexibility to

do things fast but from a process perspective or process perspective I think that it's important that we have guard rails and this is where it's like okay well what's the right threshold hold for that, right? So, when we're doing deployments, we actually have a deployment cadence. There's actually a rhythm for doing it. There is a a time period uh that you kind of have to accept when you're deploying things. And this is for the overwhelming majority of use cases, right? This is arguably close. It's not quite the I mean, I guess it would be maybe something like the 99.9% case, right? We have something in place that purposefully slows you down. So like does the techn It's not quite snap your fingers and it's everywhere. Does the technology allow you to go very fast? Yeah. Like it allows you to go pretty damn fast. Maybe not like that to be across the planet, but pretty fast, right?

I think maybe some partner services would uh would disagree, but I think pretty fast. Um, and so the the reality is though that we we don't allow that by default and it's actually quite an exceptional case if you do need to go faster because there are risks with doing so and because of how all of these things work together uh across an enormous platform like you rushing to try and get a feature out, right? Like oh I just want I just want to see my feature get to somewhere. Um it's it's so important like yes everyone's feature is important especially in isolation but when you think about it in the grand scheme of things compared to the entire platform the entire surface area of all of M365 like is it the most important thing like maybe not. Um and so the unfortunate reality there is when people rush things people break things.

Not intentionally, of course, but statistically, not that I have the data, but I'm pretty confident statistically you see a lot more shitbreaking when people are trying to go faster. Especially with fixes, right? I have a fix for something. It's so urgent. Let me just rush it out to fix it. This is why we try to go backwards instead of forwards a lot of time, right? It used to work. Put it back to where it used to work and do that fast. So point here is that there is process in place to try and enforce that. And I think for some people they might find that very frustrating, right? I I certainly did and I think there's still instances where I do like man like why is that so slow? But I think it is for for quite a good reason.

The tricky part is that going back to what I was saying about surface area is that because it is a blanket rule across a very large surface area, the number of exceptions that seem to come up is always higher. Right? When you try to generalize things over a huge surface area, there are just going to, you know, be more uh exceptions. So that's like on deployment. Now there's another another sort of thing that's similar to deployment. It's still around change management and we have concepts for flighting, right? So not only do we deploy but we also flight changes and um I think again I think it's good that we have uh processes in place for this. I think it's good that we have guard rails uh genuinely because I think depending on when you're watching this and how much you pay attention to the news

and stuff like over the past year two years there's been some examples of uh of changes I think in Azure I think AWS I think I don't know if Google had one there was some other uh tech company that had some issue might have been cloudflare um but basically like you know configuration that went too fast uh either by accident or you know through some other mechanism and it was incorrect configuration went too far too fast and big problem right so and I'm not saying that those places and this includes ashure I'm not saying they don't have technology to allow them to to safely make changes I'm just saying that this is an example of of of like why we have such uh process and technology So yeah, we we need to make sure that across a huge surface area, we're making a change and like is it when you have such a surface area with so much potential variety in different situations, you you want to be as safe as you can.

Now that's at odds sometimes with like I want to get my work done. I want to get done. Um I want to fix this bug like there's a live site issue. like I need to get this out. Um, you know, there's friction there, right? You you want to go fast, but at the same time, I think these things are there rightfully so to slow you down. And so I I do think that after spending you know a good five and a half years now in both uh sort of on the deployment side and then now after as a consumer or a as a customer of the deployment uh team along with being a customer of this other technology for flighting. Uh yeah, like I I do think that these things are valuable and I do think that we need to make sure that there are escape hatches.

I think both are very valuable, right? Um because if you end up having it's so strict that there's never any exception. I kind of think like in most situations where you have rules that uh rules that are so firm you can't do anything about it, you're going to run into some some issues and because it will be a more rare situation, it's going to feel more urgent, more problematic and then if there's no escape hatch, it's like you're kind of screwed. So anyway, I've been thinking about this a lot just because um because of the type of work that I do and uh and I and I often find myself kind of like fighting with with flighting systems or change management systems. Not again, not because I think that they're a bad idea. I think that they're incredibly valuable. I think that, you know, having the right sort of enforcement for the overwhelming majority is incredibly important.

Uh, and then I I do think that uh we we have to also challenge what the status quo is because I I do think that again I'll just use my without going into the specifics for some of the work that I'm doing. I think it's a class of work that of course warrants being extremely safe. again changes over a large surface area. But I I think that there's also classes of what we do that if you bucket them all in the same group, it's it's quite unfair because I would I would very much argue some is much much much more risky and others are the exact opposite of that spectrum where I'm like there is there I I can't even imagine there being risk.

like it would be very difficult for me to even consider what could go wrong compared to some of the others where I'm like, "Oh yeah, like that could have a huge blast radius." And I think when you put these all together and you treat them the exact same, I think that's where you get a lot of friction. So um again I think when you have all of uh you have a platform or a technology uh coupled with process that's applied to to everyone sort of universally there are more opportunities for for friction. I think that's where in many companies you get this sort of a this red tape experience where you're like man it's hard to do anything.

And then for for all of that reason, I think that it's important that we do try to to challenge these things again, not to not to like fight the system and be like, "We don't need your your rules and we don't need your guidelines, like let us do what we want so so we can just get our done." I think it's like, you know, trying to understand why these things are there to try and protect and help you and then genuinely understanding like are you truly working in a different sort of setup, right? Is your scenario actually unique in that it does warrant some exception to this? Because it very well might. And I I think that if we never give uh you know opportunity to such conversations that's where it's like oh it's always been this way and like you know for to to some people they're like this sucks or like it's never going to be better.

It has it's always been bad and it's like it's not that might feel that way and it I feel like it's because people uh either didn't speak up about it or they did and there was no movement, right? I feel like you kind of have to at least make the space for those conversations otherwise you get trapped in process and then there's tons of red tape and maybe in a lot of cases for for not good reason. So I I do think it's uh important to to kind of have those open conversations about um you know do we do we need to change something or is this an exception, right? So anyway, those are some random thoughts. Not a a fully cohesive, you know, scenario to go through. Just kind of thinking for me more recently about change management, being safe making changes, reminding myself like why these things are there, the importance and value of that.

And then also trying to have this this conversation starter where it's like are these genuinely exceptional cases that that shouldn't fit that? What does that mean? What does that look like? Is that a is that a bigger pattern? Right? Is this a a oneoff thing or is this a you know it's a one-off thing that I'm frustrated with and doesn't fit or is it genuinely like this is uh maybe this is bigger than that. This guy's got the parking spot and I want he's not even in a car. He's on a bucket. Cool. Anyway, I'm at the office. I got to jump on a call in 10 minutes and then I'm on call until 6 p.m. So, thanks for watching. If you got questions about software engineering or career development, that kind of stuff, leave them below in the comments. If you have thoughts on change management, safe deployment, different processes for guard rails, love to hear from you in the comments.

Otherwise, if you have questions that you want to submit anonymously, just go to codemute.com. You can give me a little write up there. Uh I don't get any information about who you are unless you type it and I have no reason to kind of uh even try to infer. Even if you put in like company names and stuff, I try not to even mention that. But yeah, happy to try and help make you a video. just my perspective and hopefully it helps. Take care.

Frequently Asked Questions

These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.

How do you balance process and agility when deploying changes across a large surface area?: I think you have to strike a balance between having some processes and agility. If you have no processes in place things can get chaotic, but if everything is overprocessed it creates friction and slows you down. There will always be exceptions to rules, and you need escape hatches so teams can make the right decisions in the right situations.
What role do guard rails and flighting play in safe changes?: I think deployment in large platforms uses guard rails and a defined cadence. We have a deployment cadence that slows you down by default, and going faster is an exceptional case because there are risks with changing a large surface area. We also use flighting to gate changes, and I believe these guard rails exist to protect the service and the code.
Can you give examples of situations where strong rules should be enforced vs. exceptions?: I think in most cases you want to enforce the rules, but there should be exceptions. For example, automation that scans for secrets should be high friction most of the time, though I can imagine rare cases where you need to allow an exception. I also hear debates about flaky tests in CI/CD—some teams evict flaky tests, others argue for escape hatches. There are also lower-tier guidelines that are recommended but not enforced.