I Broke Production. Here's What I Learned...

I Broke Production. Here's What I Learned...

• 163 views
vlogvloggervloggingmercedesmercedes AMGMercedes AMG GTAMG GTbig techsoftware engineeringsoftware engineercar vlogvlogssoftware developmentsoftware engineersmicrosoftprogrammingtips for developerscareer in techfaangwork vlogdevleaderdev leadernick cosentinoengineering managerleadershipmsftsoftware developercode commutecodecommutecommuteredditreddit storiesreddit storyask redditaskredditaskreddit storiesredditorlinkedin

Yes, even after 20+ years of writing code and 10+ years of shipping software professionally... I break prod. Here's my embarrassing story of breaking production!

📄 Auto-Generated Transcript

Transcript is auto-generated and may contain errors.

Hey folks, I'm just going to do a quick update here. Um, I've been on call this week and uh, when I'm on call, I generally don't get a lot of driving in because uh, Whoa. Sorry. I don't know if you could hear that. That was pretty crazy. It sounded like an alarm except I'm driving on the highway. Maybe that audio won't come through, but I'd be you probably see by the look on my face that was I was very confused. Um, it's actually just a construction site and I don't know what kind of machine they were using, but it sounded like a like a telephone or an alarm going off. Hard to explain. And I I'm like looking for flashing lights and there's nothing. Anyway, um I've been on call this week. When I'm on call, I don't get to drive much because uh of the timing for the on call shift.

So, I just did a video. So, I'm just going to do a quick one just to give folks an update for interesting things I think that are interesting that I'm working on. Um, I had a fun bug in my brand ghost platform from the weekend that I spent some time debugging and uh it's a really embarrassing problem and I thought it would be kind of fun to talk about briefly. So um I had mentioned previously that I was doing a lot of refactoring so that I could use my uh my needler dependency injection type scanning framework. So I had this all working um locally and I was kind of like ready to push this up. There's like 1500 tests that pass. They run they literally you know a bunch of them will stand up the application go make web requests to it.

I feel pretty good about it and um you know it runs the scheduleuler it runs like it runs it does a good job of like actually exercising the system but I pushed it to production and uh overnight you know I push it production when it had uh deployed I went to the website I was interacting with it works great awesome because especially for for this kind of stuff. In my mind, it's going to blow up like right away, I would think. But again, I have tests that run the app end to end. Wasn't seeing an issue. Okay, so as you might have guessed, there's a problem. And um so in the morning when I woke up, I had a message uh in support and someone was like, "Hey, like my post didn't go out." And I looked at my posts and I was like, interesting. Like my post also didn't go out.

The whole point of the application is that posts are supposed to go out. It's a it's a scheduling tool. So I was like, "Oh my god." Like, "Okay, so immediately like revert, right? I I know that I'm the person who did that. I must be." So revert, no worries. Like it it's very quick to go back, but uh you know, like that's fortunately it was like a Sunday night. It's probably the least uh amount of uh posting from schedules and stuff like that. But that's a real that's a real issue that I caused in my production system for paid users. So I'm while I'm laughing about it, it's more like I'm embarrassed by right I'm not laughing like oh whatever it's just my users. It's like I'm laughing because in hindsight when I talk about what this issue was even though I have so much test coverage and high confidence on it like it really opened my eyes.

So what had happened is that well the next step that I took after I reverted is I said well man that's really embarrassing because if the core of the application doesn't work and I don't know about that even though I have all of this test coverage that proves it works end to end like there's a problem there's a gap right and I'm going in my mind I'm there's got to be a really obvious thing going on here. And uh so I said, I'm going to put all of this effort in place to make sure I have like better health checks and visibility. So I wrote all this health checking stuff that tells me like is myuler running and I want to know like the you know the last run like the last 100 times it ran like I need that those details. So, I built it, right?

And I was running it locally and I get these health checks and I'm like, "Hell yeah." Like, not going to not going to mess this up again because if I see those health checks not working, I'll know right away. So, great. Push it up to production. Health checks are showing my stuff's not running. I'm like, "God damn it." Okay. So, but the again, I had checked this stuff locally. everything's working. So, like what's going on? And then I I realized I got pretty lucky, but I had realized that when I checked the So, while I'm redeploying, I went and I checked my dashboard for the environment variables. And I'm like, what's like something's got to be misconfigured here. And I said, I don't think that myul is crashing. I don't think it's ever being started. That's my suspicion. Right? That's my intuition with this kind of problem based on being the person who built the system.

And um so I looked at the configuration and there is something that was set that by default is the opposite value. So the only way that I could ever see that value is if something explicitly said it. But I only ever explicitly set that when I'm running locally. So I said, "No way. I didn't I didn't commit my local changes. Like I didn't break that. There's no way." And I went back to Git and I looked and I proved it. I proved it to myself. I didn't change that. I said, "That's the only thing it could be and I didn't do it. So what the hell's going on?" And then it dawned on me that I swapped two lines of code. I swapped two lines of code that changed how the config was ordered when it was loaded in. So I was always overriding some configuration with another one.

Literally just swap two lines of code. And then I had this existential crisis where I said, "Oh my god." Like, I need a way to detect this, right? It's so embarrassing. Um, so now, not only do I have the health checks and stuff in place, I have uh alerting and monitoring on now for logging. So, if uh there's anomalies in my logs, if I'm not getting logs, uh that kind of thing, then I'll be alerted immediately or like within 10 minutes. I think it's immediate enough. And um so kind of like, you know, super embarrassing for me, but the re, like I said, the reason I'm laughing is cuz I swapped two lines of code. That's embarrassing. and um and but it caught like a really big gap in my in my observability of my system. Right? Like I said, I was running the application. It was all working.

It's just that when I launched it, I basically had a setting that said don't run that thing explicitly. Don't run that thing. So, yeah, pretty pretty dumb. So, just a heads up, what I'm going to be doing is uh I had this um for for .NET developers, I use Aspire. Some of you may know what it is, some of you may not. Has a pretty cool dashboard, but um I had this idea where I was going to build a Blazer admin dashboard and uh just because there's stuff like Aspire is very generic. It's good. It's very generic though. And uh there's some like admin tooling and debugging stuff that I want to put into a dashboard personally. And so I had this Blazer admin dashboard, but like would only ever run it locally sometimes. But what I decided I'm going to do is I just had AI rewrite it.

So I was changing um the paradigm for it to briefly explain because that dashboard existed in my codebase and I was only ever running it locally. I could just reuse my tech stack, right? I don't need to make API calls to my server because I can make the same calls like natively through my code path. I don't need to go out to the internet to do it. But I said I actually want to use this to like do check this other app is running now. So, uh, I had AI go sort of, uh, rewrite it. And right now, I don't have all of those APIs in place that I need, which is fine. But I had AI go rewrite this dashboard to to basically strip out all of the code um, that was there that was kind of calling through the native code paths that I have.

strip that out and then kind of stub it with like this is where the API call is going to go. I have to go build those still, but I'm going to make some videos where I demonstrate. Um, I'll do another needler refactor so folks can see that because it undid all of my dependency injection and put all this in. I can't stand it. So, I'll do another refactor video like that with Needler. This will be for a Blazer application. U, we'll get to see authentication. and we'll get to see um how to set that up with Blazer in general. So, we'll do that and um what else? I'll see what else I can kind of carve out of that. But that's going to be like a real admin dashboard. Like that's a real example to walk through. So, I thought that would be kind of cool and relevant.

Um but yeah, just for a couple of details on on how this worked. like I had this admin dashboard. Like I said, the calling convention into the code paths is something that I I wanted to change. It's not that it was wrong before. I just I don't want to follow that path. But the dashboard itself was still like had the screens and stuff I wanted. I just want to call my live server for it. Oh my god, buddy. Worst driver I've ever seen today. Holy cow. Slowed down. light turns red. We easily could have gone through before, but they slowed down. And then they still went through, but I can't go because they're an absolute dummy. Um, not that the code was wrong before, but I I wanted to actually call the live service. So, um, it did a pretty good job of keeping things intact.

the first pass through it just deleted a bunch of stuff like um yeah kind kind of annoying but I said hey like I was using clawed code I said go back some uh some of the git history you can go back this many commits and I want you cuz I was committing a couple things along the way before I realized like it just deleted some pages out of the app and I said go back a couple commits bring those suckers back. Um, I had to coach it a little bit through making sure that it was compiling code. So, that was a important detail. We had to fix the O. O was the biggest thing to go address because uh when it had deleted some code and it replplumbed the authentication, it made all of these nice settings and stuff. So, I was configuring these settings. This part's been vibe coded apparently, right?

So, I'm configuring these settings and I'm like, why isn't my off working? Like, that's the right value and then I go check and it's like it put all these settings in place but didn't consume them. So, I was like, "Hey man, like these 20 fields I just configured, like you got to use them." So, uh yeah, kind of like just stupid back and forth this way, but honestly, it did a pretty good job. And um now I I'll see like maybe I'll I'll make some videos showing how to get AI to to either reuse some of the existing endpoints that work that fit or like hey I need to create them and this will be interesting because it's a front end and a back end in the same sort of repository to work through. So we can have AI kind of do both sides of it which I think will be kind of fun.

Like here's what I need in the front end. Cool. Like you have access to the whole backend uh you know repository. So like do you have what you need there? If not, how are we going to go build that to follow the backend practices not just build some API that the front end needs completely in isolation from all of the backend technology? So we'll see. Um, I feel pretty good about it. I think it'll do a good job. Uh, just especially based on what I was seeing today with how it was able to iterate through things. So, looking forward to that. But that's been my journey with uh some since the weekend. Oh, is there a spot right here? Nice. We got the close spot. It's never free. Let's get this parked. I like these spots because then there's no one to my left and I can park way over the line.

You can't maybe could see on the ground at some point in another video, but um there's like these fat margins at the end spots so I can park into that and that way because all of these spots are compact, the person on the other side of me isn't going to dig my door. Ask me how I know about that. I will see you in the next video. Take care.

Frequently Asked Questions

These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.

What caused the production issue in my brand ghost platform despite having extensive test coverage?
The production issue was caused by swapping two lines of code that changed how the configuration was ordered when it was loaded. This caused some configuration to be overridden incorrectly, leading to the core application not running as expected even though all tests passed locally.
How did I improve observability and monitoring after the production issue?
I implemented better health checks that report on the scheduler's last run times and overall status. Additionally, I set up alerting and monitoring for logging anomalies, so I get notified within about 10 minutes if something goes wrong or if logs stop coming in.
What changes am I making to the Blazer admin dashboard to improve its functionality?
I had AI rewrite the Blazer admin dashboard to remove native code path calls and replace them with API calls to the live server. This allows the dashboard to run in production and interact with the live service. I'm also working on adding authentication and building the necessary backend APIs to support this new approach.