A user submitted a question asking if it's normal for things to blow up in production when developers release features. Is this a sign of a bigger problem or just typical for dev teams?
📄 Auto-Generated Transcript ▾
Transcript is auto-generated and may contain errors.
Hey folks, we're going to a YouTube question from web developer ninja 9220. I think this is an interesting question. Can you talk about features blowing up in dev team's faces? What first releasing in the wild? Like is this common? If happening is lack of good testing slashdesign. I think there's a bunch of stuff that could be happening here. I think it's a good one to talk through. Um and apologies in advance. I have had the flu for the past week. It's a long weekend, so I'm trying to catch up on making some videos, but like a little congested and the last video I just filmed, I had to pause to cough a little bit. So, apologies in advance if I can't get the mic out of the way, but I'll do my best. So, features blowing up in a dev team's face when they're first releasing in the wild.
Okay. Um, I would say like if this is consistently happening, you probably got some some things to sort through. the the stance that I have on this is like no matter how much planning and upfront work you do, there's always possibilities where things go wrong and things don't work as you expect. So when this person's asked about features blowing up in the whole team's face, um that sounds pretty extreme and I would say like what's the severity of like when we say blow up like what's actually going on? So my my perspective is that like anytime that we fail at things that there's opportunities to learn and that means not only from the perspective of a like a single developer, right? Like so let's say one developer is working on a feature releases it as part of the team and then that doesn't work. It's not like well whole team's like hey buddy you got to figure out how to like not suck so much in the future like on you to get better.
I think that's like sure that one individual who was maybe primarily responsible for the feature has some learning and some things to improve but the whole team always has you know some responsibility and accountability and that might mean like you know do we need better code reviews like who was working on the build system because like why didn't it catch that like do we have flighting mechanisms in place do we have AB testing did your test suite not um not run Um, did it did you not even write the test? Was there no policies to enforce it? Like, was there no alerting or telemetry? Like, there's a million different angles you can go look at something like this, you know, feature blowing up. Maybe everything, let's just give another example. Maybe literally every single, you know, technical detail along the way is perfect somehow, right? Whatever perfect means.
But whatever you like functionality you landed for the user is like not actually what they want. So you delivered the perfect thing executed perfectly but it's not even what they want. So like is that blowing up in their face? Right? There's so many different ways to kind of interpret this. And to me, it always comes back to like if it didn't work the way that you were hoping, whether that's, you know, didn't meet the user's expectations, there were crashes, there were latency issues, there were other performance problems, there was availability issues, you name it. Something didn't go how you expected. There's always opportunities to learn and you should kind of go back to your process and figure out like how do we go tweak or improve what we're doing? How do we make a change to what we're doing so that next time we can see if it makes it better?
And you keep iterating cuz it's never going to be perfect. And even when you think you get to like some type of perfect state, it doesn't stay perfect. You have to keep evolving and changing because the world that we're working in continues to evolve and change. Okay. So, features blowing up in dev's team's faces when first releasing to the wild. Uh different ways to maybe interpret this. Um, is this the first, you know, it's like an MVP, a product launch? Okay. I would say if that's the kind of thing, would did you somehow have so much demand like you couldn't scale? I feel like that's kind of probably a rare thing. Uh, it's a great problem to have if you like built something awesome and like you're like, "Oops, it doesn't scale cuz we had a million people sign up on day one." If so, what do you go learn from that?
Like how did you not anticipate that? And then PS, follow up with me because I would love to learn what you did to get a million people to sign up on day one. Um, you know, is it is it the wrong thing? So, you launched your MVP, you had all these uh, you know, beta or alpha users and then they're like like this is pretty terrible. It's not a performance issue. It's not a latency or availability issue or things crashing. It's just like not really what we want. It kind of sucks. Okay. So, like what do you learn from that? How like what how were you talking to potential customers about what to go build, right? Did you just guess? Did you actually solve the problem of some group of users and then accidentally presented that to a different group of users confusing the problem? It's more of like a product kind of fit type of thing, right?
Maybe a little bit less about software engineering and a little bit more about like, you know, product management, product engineering. People love slapping engineering as a as a word onto everything and just, you know, prompt engineering, AI engineering, product engineering, engineering, engineering. By the way, are any of us actually engineers? Cuz certainly not at the professional level, right? Do any of us have professional designation? People get really bummed out about that when I post about that kind of stuff online. For the record, I'm also not a professional engineer by trade. Um, but there's a lot of ways to interpret this, right? So, um, is it a lack of good testing? Is it a lack of design? I think it could be. I don't think it's exclusive to that, which is why I was trying to give some other examples. I don't think that um, you know, just so let's flip it, right?
Let's say that you did an architectural design on this kind of stuff, new feature architectural design. you're talking about uh what things are going to look like for the database, how that's going to scale, like uh you know whether you're going to have capacity concerns or like you're and then you're doing performance analysis on queries to be able to read and write. Um and then on top of that in your design and your implementation, you have you're talking about the testing. You implement the test, you have all this test coverage. So you have like on paper what looks like really good architecture, really good testing like okay now when it comes to reality like did your architecture and testing and all of this stuff did that match reality? Did you have a misunderstanding of what reality is? Right? It might sound kind of funny. You might say, "Well, if you did, then you have bad architecture and testing." And it's like, sure, maybe.
I I wouldn't say it's necessarily bad, but I think you can have a little misunderstanding in some of the assumptions about your environment that have a big impact and you didn't expect that. Right? I've literally have worked on with engineers on designs where multiple people reviewed at principal plus level up to like you know uh like VP architect level people reviewed um you know design gets done signed off on starts rolling it out and we're like wait a second that query performance doesn't scale we're actually getting throttled by the other service does that mean that like the architecture was bad. Does that mean that like his testing was bad? His or her testing was bad. Is that their fault? Like, no. Missed assumption, right? Misunderstanding. That's fine. Does that mean the entire thing's invalidated? Nope. In this case, I would argue that a lot of good things happened because if it was caught very, very early, it means that we can pivot on part of that design, right?
We don't have to scrap the whole thing. and we pivot on part of that design. We didn't cause catastrophic outages. Probably didn't cause any outages because we were able to catch it early when there was enough signal and enough realistic representation of the environment. So, it's I I would say that it's not, you know, strictly because of bad design or bad testing. And I think that it really comes down to the classification of what does something blowing up mean? Um, I do think that in many scenarios if you're like if you're rolling out a feature, right? Um, and this person they're the username is web developer ninja. So, I'm assuming they're talking about web development, maybe some types of distributed systems. Um, even if it's not really, maybe it's just a website, right? Did you did you push to prod? Is there a staging or canary environment?
Could you have tested it that way? Do you have that, but it's not actually representative of production? It's a proxy for it, but it's not close enough. So, it looked good in Canary or it looked in your testing environment, but there's some detail that's not the same as production, and that's what screwed you up, right? Like, there's so many things to think about here. And the reason I'm trying to give all these different examples is to kind of bring it back to like the end of the day, shit's going to break. You're going to build stuff that's not right. Whether it's what not what the user wanted or technically it was busted or something happened like it could be anything. It's not going to be possible to eliminate all problems. You can do lots of work up front and you should do planning and architectural designs and things like that.
You should do these things. They help. But will go wrong. you're not going to have a, you know, a perfect career. 100% of everything you've ever touched just worked perfectly the first time or couldn't be improved. And so if you accept that, then we can think around this other part which is that when things go wrong because they will, what do we do about that? And that's why I always come back to like a blameless culture, taking accountability, and that means stepping in even if you're like, I didn't I didn't touch that code. I didn't even look at the code review. I wasn't on that to press approve. Okay, but like should you have been maybe because maybe you could have caught it, right? Like we all have a part to play no matter how you slice and dice it. So if you are all on the same team trying to make better, try to learn, work together, and be accountable together for improving things.
And so it doesn't mean that you radically change your process every time there's something that happens. Keep making incremental improvements. Keep testing these improvements and keep getting better. And don't stop. Because as I said a little bit earlier, if you stay static thinking that you've reached, you know, perfect solution, you're gonna have a bad surprise later. So, I hope that helps. I think it's a really interesting question. Um, yeah, if you have questions, leave below in the comments. Uh, for web developer ninja 9220, if there's something more specific you want to ask about this, please just follow up. Happy to do my best on it. Thanks. See you in the next one.
Frequently Asked Questions
These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.
- Is it common for features to 'blow up' when first released to production?
- I believe that no matter how much planning and upfront work you do, there are always possibilities where things go wrong and don't work as expected. While consistent failures might indicate issues to sort through, occasional problems are normal and present opportunities to learn and improve.
- What should a development team do when a feature fails after release?
- When a feature doesn't work as hoped, whether due to crashes, performance issues, or not meeting user expectations, I think the whole team shares responsibility. We should analyze what went wrong, review our processes like testing, design, and monitoring, and make incremental improvements to avoid repeating the same mistakes.
- Can good architecture and testing still result in feature failures in production?
- Yes, even with thorough architectural design and testing, misunderstandings or incorrect assumptions about the environment can cause issues. I’ve seen cases where everything was reviewed and signed off, yet real-world factors like query performance or service throttling caused problems. This doesn’t mean the design or testing was bad, but highlights the need to catch such issues early and pivot accordingly.