From the ExperiencedDevs subreddit, let's look at systems where you are the single point of failure.
📄 Auto-Generated Transcript ▾
Transcript is auto-generated and may contain errors.
Hey folks, we're going to the experienced devs subreddit. This is a post about thoughts on I guess engineers who create systems that are essentially I don't know like maintainable operational by by others on a team i.e. they are not a single point of failure versus the inverse of that where engineers uh seem to create systems where they are the single point of failure and uh I think this topic maybe hasn't directly come up on this channel before maybe indirectly and I've talked about this from some angles before um whether it's on social media or uh in live streams and stuff like that So the topic is not necessarily new but uh maybe we can think about it from some different angles here. Right?
So in this kind of example it's I think it's pretty common for people to you know to kind of uh jump to conclusions or I don't know kind of look at things more on the surface and we should talk about that and then we can see you know how how things might differ if we start to look at them from different angles. So when we talk about single points of failure like this and I think one of the the really common things here is well when you have engineers doing this especially I think in the in this Reddit post they were saying you know like contractors or they kind of framed it like there's maybe it stands out more with like some specific I don't know uh setups or roles that It's almost like a job security thing, right? Where you have people that are like, well, if I'm the only one that can do it, then, you know, then they need to keep me.
I'm important. Therefore, you know, I will be the single point of failure. Um, you know, in the in the case of a contractor, then it's like, okay, cool. Like, I've secured I've secured that, you know, they need to keep paying me uh going forward, that kind of thing. And I'm I'm not saying this doesn't exist or that's never a reason, but I will say that I I often don't think that um like malicious intent is often like the the primary motivator, right? Again, I'm not I'm not not saying that that no one is ever malicious, but I I just generally think that um you know, most people's primary motivation is not not to be malicious. And so, yeah, we have single points of failure like this uh where people are trying to have job security, but I think I think there's maybe some other things to to consider here, right?
So if we if we think about I don't know like the what are the observed side effects if we're comparing these two types of scenarios right where you have someone who has holy cow sorry someone that was not okay there's Uh, someone making a left-hand turn, not from the left turn lane. And that was almost a pretty bad accident. Jeez. Probably hard for you to see cuz you're not watching in front of me, but um the okay symptoms. I don't know if even that's even the right word. when you observe this kind of thing right and and we're comparing these instead of looking at the intentions like what what are the things that we observe in scenarios where there is a system process whatever that is maintainable operational not with a single point of failure versus one that is um like what are the things
we have in in either case right so you might have in a system that is uh not a single point of failure maybe you have uh documentation that is supporting like how to work with it if it's not doing the right thing. Maybe you have monitors and alerting so that you know when the system's not doing the right thing and those alerts and monitors have instructions for when they're firing like what to go diagnose or uh you have all of those things and there's like a self-healing mechanism, right? So that when it's happening um I'm just making this up, right? like it it restarts itself and there's enough redundancy of it that it can go restart itself and not uh be unavailable. Uh so instances of it are are recovering. Um obviously I'm just kind of making up a a super generic thing here but we have different things like that.
we have um I don't know like the code is accessible, code is understandable, readable, highly tested. So that means that if someone has to go in and make a change to to fix a bug or to add a feature, right, without the the single point of failure that that they can, right? It's not, oh god, um, I have to go touch Jim's code. And no one's ever been able to read Jim's code because Jim is such a a wizard of software engineering that, you know, the code is a different level. Um, so it's it's kind of like what I'm hinting at uh maybe poorly is like almost like every step of the way the different parts of of a software system have have this like kind of like scalability or not single point of failure in mind. So all the way from like the code to the operations, it's it's done in a way that others can can jump in.
And this isn't like a it's not a simple thing to do, right? So maybe you take one of these things and we're like, okay, like sure, write the code in a way that doesn't suck, you know, like other other people should be able to read it. That's why there's people on pull requests to make sure that the code is kind of flowing in a way that the team uh can work with this kind of stuff. But truly to like to to go build something end to end where this is a factor along the way uh is non is non-trivial. And I would say there's like for most people there's extra work and effort that goes into making that happen, right? So if this is the case and and my other claim is let's like the the the thing that I'm uh positing here is that most of the time it's not not a malicious thing when this happens.
I think generally a lot of the time when this happens um it's it's either like I guess I would say like two things. One is like time and the other is um the other is like personal priority which I would say is not a malicious thing um but it kind of conflates what a goal uh should be uh and and maybe a combination of both. So on the first one, what I mean by time is just essentially like everyone or I shouldn't say everyone. I would say most people at most places are probably under a lot of stress to continually get more and more work done. And so there's always this feeling of a lot of pressure, uh feeling rushed, right? get something going, get it working, move on to the next. And so what I was just saying around like it takes more time and effort often to go build systems end to end like this.
It's like if we're constantly looking for shortcuts to make things happen, again, not because we're being malicious, just because, you know, we're trying to to get more work done, then, um, I think that it's it's easy to start missing things like this, even if you're doing it at the coding part and then you're like, "Okay, and it's going to have metrics and monitoring. Cool. You got like you're checking the right boxes along the way." But even with that, it might be like, "Okay, cool. The monitor's firing and you're not there." Is it obvious why it's firing? Are there instructions for what to do? Um, you know, do does everyone have the right, I don't know, permissions to go access the right dashboards and systems and monitors are firing? Like, are there instructions for what? Like, I don't know. Like, did you check all of the boxes?
Because all that it takes is to have, you know, one gap in that process and then suddenly people feel blocked, right? Like, okay, thank you for making the alert. It's firing. We know the system's on fire. Like, at least we know that. That's that's genuinely better than not knowing. But now, what do we now what do we do? Like, how do we make it not on fire? um like that needs to also feel very smooth end to end. So timing and uh you know feeling rush which I I would say is like very much uh could be team culture, organization culture. I think that's a big factor. Um but ultimately I would say like I think time is one one big part. And then the other thing I mentioned was around like uh personal alignment, motivation, that kind of thing. And what I mean by that specifically is like this guy's got to go.
Uh sorry, this person was in the far left like carpool lane. Everyone was trying to pass them. So they moved over so people could pass them, but now they're in front of me going not fast enough. So yeah, on the on the personal motivation, priority alignment kind of thing. I think I've talked about this before in other videos, but I think one of the things that can happen here is that um again not a malicious thing, but when people are um people are set up especially by their managers and this is again very much like a uh team culture kind of thing, corporate culture kind of thing depending on the organization where like instead of incentivizing individuals to be promoted and receive like rewards and compensation for making everyone around them better, right? Instead of doing that, it's very much a like tell me what you were able to deliver, right?
when it's all about you, people like I don't I don't know a better way to make this more clear, but um people will prioritize that. Um and I I don't know. I guess I I don't have like stats to prove this, but from working in a startup for 8 years versus working in a big tech company for 5 and 1/2 years, uh where one of these two places is very much around levels and motivation for promotions and stuff like that. uh I can absolutely see a shift around uh people personally aligning their focus, right? If I work on these things, if I go get this done, if it's if it's something I own end to end, then I can claim that I had the impact, right? It's it's about what what can I show that I did? And don't get me wrong, a lot of the the time it's like what I did had big impact, right?
I had the impact for a a broader surface area. So there's this facade of like yes, it had organiz like teamwide impact or organizationwide impact. There's this facade of that, but the the entire motivation was not how do I go do something to make the team better. or how do I go do something to make, you know, our organization overall better? That's that's the facade. The reason you're doing that is not because you're motivated by that. The reason you're doing that is because you want a promotion. Those are different motivations. I'm not saying those two things cannot coexist. like I want to have a very big positive impact because I care about what I'm doing and it would be great to be compensated for that. I'm not saying those can't coexist. I'm just saying the primary motivator makes a big difference. And I think when that happens, again, not a malicious thing.
It's not about sabotage or anything like that. It's just about you're focused on you. You hunker down. You try to get done. again timelines constraints like that but this is for you to own end to end and I think the I think a lot of the time when this happens you are genuinely not framing things in a way that is how can I make this work for a broader team you're getting it to a point where you can say how can I demonstrate that I've built the thing that has the impact. Great, that's done. Check the box. Move on to the next thing. How do I go repeat that? And the problem is now you create a bunch of things like this where they're not um they're not enabling a whole team to be more effective overall, right? They're they're a deliverable that yes can can help a team.
Absolutely. Right. Not dismissing that part, but is it is it manageable by anyone on the team? Right. Is it is it designed to be that way or is it designed so that hopefully it doesn't have any issues and when it does like you're going to be the person to do it, but hopefully that's infrequent enough. Um because that doesn't scale at some point when you own a bunch of critical pieces. Especially when there's live services and stuff involved, inevitably shit's going to break. And now the whole team is paying a tax for a thing that's supposed to help, but you were too focused on it being for your promotion. Right? So, I think these are two other things I wanted to talk about. I'm just getting to CrossFit here, but like basically time that people don't do it maliciously. It's just they don't have sufficient time to to really do a good job on you know building things end to end in a way that's going to be uh manageable by by a team.
And then the other part is this sort of a personal motivation around uh promotions, rewards, compensation and stuff like that. Again, not a malicious thing, just sort of the wrong focus and then check the box, move on to the next thing because you you did the impact. Check the box. Anyway, those are some thoughts. Um, be curious if you have other thoughts, too. I'm not saying these are the only things, but these are two that come to mind aside from just the uh, you know, the hashtag jobsecurity. So, if you got questions on software engineering career development, leave them below in the comments. Otherwise, go to code.com. can submit stuff anonymously that way. And I will see you in the next video.
Frequently Asked Questions
These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.
- What does the speaker say about the primary motivations behind engineers becoming the single point of failure?
- I don't think malicious intent is the primary motivator. I think the speaker identifies time pressures and personal priorities (like promotions) as two big factors. I note that under time pressure, people rush and cut corners, which can create single points of failure. I also point out that personal priorities can drive people to focus on delivering for themselves rather than for the team.
- What practices does the speaker describe to avoid a single point of failure in a system?
- I describe several things that can help: documentation that explains how to work with the system, monitors and alerting with instructions for what to diagnose when they fire, and a self-healing or redundant setup so instances recover automatically. I also talk about code being accessible, readable, and highly tested so others can make changes without depending on one 'wizard' like Jim. I note that it's not trivial and there can be gaps, but these practices help ensure the codebase doesn't break the team when someone is unavailable.
- How does the speaker describe organizational culture and motivation as factors in this issue?
- I discuss that time pressure and personal motivation around promotions are two big factors. I point out that in some organizations promotions and rewards are tied to what you deliver rather than how you help the team, which can lead to a focus on self rather than collective impact. I also say that this isn't malicious, but it can result in a 'check the box, move on' mindset that hampers building systems that the whole team can own.