Do You Have A HIGH Tolerance For Bugs In Your Software?

Do You Have A HIGH Tolerance For Bugs In Your Software?

• 204 views
vlogvloggervloggingmercedesmercedes AMGMercedes AMG GTAMG GTbig techsoftware engineeringsoftware engineercar vlogvlogssoftware developmentsoftware engineersmicrosoftprogrammingtips for developerscareer in techfaangwork vlogdevleaderdev leadernick cosentinoengineering managerleadershipmsftsoftware developercode commutecodecommutecommuteredditreddit storiesreddit storyask redditaskredditaskreddit storiesredditorlinkedin

From the ExperiencedDevs subreddit, this Redditor was curious about folks who had worked in different domains where there were high and low tolerances for bugs. Sounds funny -- let's discuss!

📄 Auto-Generated Transcript

Transcript is auto-generated and may contain errors.

Hey folks, I'm just leaving the office. We're going to go to experience devs for the topic today. And uh I thought this one was pretty interesting. This question sounds kind of funny on the surface, but um I peaked through the comments and and I was surprised to see that people weren't just totally like bashing it, but kind of like also chiming in uh with real examples. So the question was really like what sorts of things have you worked on in like in your career, right? What types of projects and stuff have you worked on where the tolerance for bugs is very low, right? And and the reason this sounds kind of funny is like well we're building software the tolerance should be should have very low tolerance for bugs. But there was this acknowledgement that like there's some types of things that people build where you can absolutely be more tolerant to that.

And uh I thought this would be a kind of a fun one to talk through. Um because I think that in reality like we can kind of say like in theory we don't want to have any bugs. we go through all these things, but like in practice, that's just it's almost not what happens. So, uh I figured I'd share some background for some for some spaces I've worked in where the the tolerance for bugs was very low, but I can also talk about like where there was I don't want to say like it's it sounds funny to say like there was tolerance cuz what I don't mean to suggest is like we didn't give a Like that's not and I want to be transparent about that the whole way through is at no point am I going oh like we never cared therefore we just

like wrote buggy software like it's probably going to sound like that as I'm talking but it's never what I mean it's just that like there was more rigor around like the other areas. So just want to get that out of the way disclaimer. Um but so for me it was digital forensic software and the the reason like let me let me kind of explain some of the background as to why and then kind of like how some of these pieces work. So, um, the reason why that this stuff is is super critical is because, uh, a lot of the time, like one of the really key parts about digital forensics software is being able to structure otherwise unstructured data. And so, what does that mean? Well, okay. So something that would be very simple for digital forensic software to do is to like look across, you know, a file folder hierarchy and get you all of the documents, right?

Find me all the files that end in DOC or uh docx or PDF, right? Like that's a pretty like lowlevel or sorry uh lowhanging fruit kind of approach to like getting you documents. But like what about documents that have been deleted, right? So like in digital forensics tooling we need to be able to as much as like recover as much information as possible and structure it. Now that means that often times you are dealing with stuff that is not um is not necessarily in an easy accessible format. Edit might not be something that's like just open the file because you're literally taking pieces of file segments from a hard drive. My god, come on people. Um, so it's not just as simple as like open the file the uh within a folder kind of thing. Sorry, trying to get on the highway and no one was moving over and then we had people that were going too slow to merge onto the highway and not helping.

So, when you have data that's like this, you're a lot of the time, I don't want to say the majority of the time because it's not really fair, there's a lot of files that are intact on a on a machine. Um, but there's going to be a lot of situations where you're going through parts of hard disks or within like archives of uh different file like whatever it happens to be, you're dealing with data that is not guaranteed to be structured perfectly for you. Now, when we go to restructure that data, there's going to be opportunity to mess things up. But what we need to be very clear about, and this is where there's a very low tolerance for bugs, is like is what we're reporting when we structure that data. So things like timestamps and things like that, we need to be extremely clear and accurate with a time stamp.

Why? Because if that's being used in a court of law and we have like there's evidence that comes from our tool that's used in a court of law to either try and prove someone guilty or help prove someone's innocence. If we are wrong with the timestamps that could tell a dramatically different story. Okay. So, it's really about trying to make sure like I I can't remember uh at some point like the you know the the saying for the company was like bring the truth to light, right? I can't remember what the exact mission statement was but like you know like we had it on the walls and stuff like about uncovering the truth because that's the whole goal with the software is to make sure that Oh, come on buddy. You got to cause an accident. Holy I don't know if you could see that Mercedes actually locked the seat belt on me.

Absolute dumbest people drive around here. I cannot believe it. This person's lined up in traffic. The guy in front of me, the guy in front of me almost hit the person and then he tried to pull out again. Like, if you're not driving at a normal speed, you cannot pull out in front of traffic. Oh. Um, that's the first one we got on video, I think, on this channel. So, what was I even saying? Uh, we're talking about, you know, uncovering the truth, right? So things like timestamps we need to be extremely accurate with because if we misrepresent that, not only are we not doing like what we set out to do, but like you could literally have the opposite effect because now the story that the data is telling is no longer the true story. Another problem is that if that um is shown to be inaccurate, right?

And then you erode the trust in the tool. How can they trust the rest of the data that you're showing? You have bugs in your software. How do we trust the data that you're showing? Right? So that becomes really problematic. uh not only like it looks terrible in court, they could toss it out, but you know, even worse would be if it was wrong and they used it. Like that's that's not okay. So for us, that kind of thing was really important. Give me one sec cuz I got to switch lanes. Just want to make sure it was going fast out by this huge F-150. Um, and then another thing just to kind of comment on it too is like sometimes the the data that we're covering is not it's not as obvious as one might think, right? There was uh trying to think I can't remember the um the specifics of it, but my wife was watching this Karen Reed trial.

She loves watching Court TV um and like the streamer like the lawyer streamers who talk through through court and it was cool because she was watching this Karen retrial and they were talking about the software that I helped build. So, it was super cool. And then they even had an expert witness come on that I used to work with and I was like, "There's Jessica Hyde. Like, I know Jessica." Like, super cool. So, I could kind of nerd out um a little bit because my wife, like, rightfully so, is not super interested in the work that I do because it's pretty nerdy and she has better things to care about. But, she was really interested in court. So when they were talking about the digital evidence, she could ask me questions and I could explain to her from like the software side like what goes into that.

So anyway, there was part of this trial where they were talking about um some of the recovered evidence and like I was saying a little bit earlier, we want to be able to recover as much data as possible and present as much data as possible, but not all of it is in a in a state where you can it's going to sound kind of weird, but like it's not like always easily explainable. So for example, you might come across timestamps, okay, in the data that you're recovering and you're like the timestamps are like you want as much as possible, you want to be able to get back timestamps cuz timestamps can really help paint a picture of the activity that's happening. The tricky part is like when we think about forensics and recovering data, it's not like for the most part it's not like people are building applications or file formats and stuff where they're like I'm going to write this so that someone can use it for digital forensics.

They're writing it because they're software developers. So they're like I got to store this datetime because that's what I'm going to use for caching or something else. Like it's not it's not meant for forensic record. It's meant for literally supporting functionality of the application. Now, what happens is that when we're trying to recover data, we come across a timestamp and we're like, oh we better include that because it's going to be useful. And someone goes, well, what does it mean? And sometimes it's like, well, that's that's clearly the last the last right time or the last modified time or something, right? There's some really obvious things. The message was sent at this point. Well, wait. Wait, is that when the message was sent or is that when the message was actually like received on the device? Is that when the message is read? Like all of a sudden these things become more nuanced.

And there was this thing in this Karen Reed trial and sorry if I butcher this because like I wasn't really paying attention. my wife really was. But there was this uh this data that was recovered where they were talking about the timestamp for these like uh browser tab activity and they were trying to figure out was this the time that the search was performed? Was this the time that the tab was activated? Was this the time the tab went into the background? Like there was a time stamp and it wasn't clear exactly what it was. but they were trying to say or or disprove that this is the time that a search occurred. Now, when you label it something, if that's not accurate, that's a problem, right? So, it's extremely important to have the fidelity of the data preserved as much as possible, right? You're not taking time stamps and rounding them to the nearest day.

um and you're not calling things what they're not. You have to be very explicit and transparent about what that data is. Furthermore, you keep you can like kind of go through this exercise and talk about different types of data and and understand why it's really important. Um there's like in in digital forensics historically, investigators and examiners are obsessed with hashing things. Absolutely obsessed with it. cryptographic hash on something because if a bit changes then they can go oh like we're no longer talking about the same data like overthe-top obsessed with it and like part of me gets it and the other part of me is like at some point but this is getting silly like I'd see people hashing things where I'm like there's you've already proven it's the same thing like through two other mechanisms why are you hashing it yet again just like seemingly completely nonsense but I get the motivation behind it.

So, they're obsessed with that kind of stuff, which is a good reminder like for you understanding your customer as a developer, like that's how much they care about the accuracy of this stuff. Um, there's one more thing I wanted to talk about on that and I'm forgetting it. It'll come back to me, but data accuracy incredibly important. Oh, this is what it was. So, if you keep going through this exercise of like talking about the different types of data that get recovered, like essentially it's a big search engine, right? It's a search engine for digital devices. So, that means that when we run a search, if you were to do that search again, you should expect the exact same set of data to come back. And if you did it again, you should expect the exact same set of data to come back.

Now, that means if you're making updates to the software and you rerun the same search, you should get the exact same set of data to come back or more more data, right? You've made updates to it. You better not be missing stuff. It better be more data. or the same. The exception to that is if you are fixing bugs and correcting things, right? So the accuracy in terms of being able to, you know, go from one version to the next and make sure that you're not regressing on these things is critical because it goes back to like, okay, so like between one version and the next now we're missing the timestamps. Now we're missing the records that had those timestamps. Like oh isn't that a big problem? because now that's no longer showing data that we know is there. So being like bug-free in that regard was absolutely critical for us.

Zero tolerance, right? Like just not okay. Now where were we more tolerant? Again, the disclaimer is not like, oh, we just don't care. We'll go build crappy software. But it was the difference between do we need to go stay up all night so we can hot fix this versus like the release goes out in 2 weeks like we can put the fix in and it'll be ready for the you know 2 weeks when we're going to go release this. Okay. The difference where it was more we were more lenient when we caught bugs like that was really around like workflows of things that weren't blocking. Right. So, if you were interacting with the user interface and you pressed a button and it caused some weird state to happen, but you could still perform the action you needed. It was just like kind of glitchy or we have two or three paths to do the same thing for for whatever reason, same type of thing that you're trying to accomplish in the end.

One of those stop working and you can't click through, but you can go through the other paths. Like, we're pro we're not potentially not going to go hot fix that. It's not like we want to ignore it. We're going to go fix it, but we're not going to go like rush. We're not going to stop everything we're doing to go fix that. So, we had to go ask ourselves these questions of like, is this going to block someone from being able to perform their examine uh their examination or investigation on this data? And um and if it's not or it's not going to put them in a state where they're going to misrepresent this data, then we can perhaps be more lenient to that. So, same suite of software we're talking about, but that's something to consider. Um, certainly regressions were the biggest things. So, I'll give you another kind of angle on this.

Um, one of the teams I managed was uh responsible for doing mobile acquisitions. So, I was just describing to you kind of how we would scan through data and sort of uh structure it, right? So we're looking for data that's either there or it's deleted and we're recovering it. So that we need to go from unstructured to structured data. The team that I was responsible for, one of the teams uh for many years was um responsible for doing mobile acquisition. So you would connect a mobile device and then we would go as best we can go collect all of the data that we can from that. It's not doing the other part that I just said like the search engine, but it's basically trying to make as high fidelity of a copy of that device as possible. So, of course, in a perfect world, that means a bite forbite exact copy of the device, right?

If it's a bite for bite exact copy of the device, you literally have a perfect replica of it. Excellent. It's not easy to do and we were always looking for ways and new techniques to be able to do that because that is the ultimate best case scenario. But a lot of the times with mobile devices you're not able to do that. Just limitations uh especially with the hardware. So you would fall back to like, well, what's the highest? We call it like a logical acquisition versus what I was just describing as a physical acquisition. So you would default back to like, well, what's the best logical acquisition we can do to get as much information back off the device so that when we go to search it, we can get as much information as possible. So where am I going with this in terms of like bug tolerance?

Well, mobile phones are tricky because there's a ton of them. There are so many device models. It is absolutely insane. And we were a small team, like a handful of people, okay? And I can remember when we were like literally trying to research different devices and recovery techniques and stuff. There was times where like we're a small company, right? like times I go to the mall and it's like I got to go spend five grand on phones and like some of the the kiosks literally would not allow us to buy phones. They were like no we're not selling that to you or like we want to go buy like $3,000 worth of phones from you. Nope, won't do it. And I'm assuming in hindsight I think that's because they thought that we might be like criminals trying to like do malicious things with phones so they won't sell us devices.

like we're coming with stolen credit cards trying to buy a bunch of phones or something, but we're like literally doing the opposite. We're like we're trying to support police officers. So, with all these devices and stuff and the the ways that we're trying to be able to recover data from these devices is like there's a lot of cutting edge stuff and a lot of surface area that we physically like it's impossible for us to cover, right? So, for example, we might be trying to uh use a mechanism that allows us to do certain things on Samsung devices, but how many Samsung devices are there? There's tons of just Samsung devices or Qualcomm chipsets or, you know, HTC phones or Sony phones. Like, there's the the matrix of all of these combinations of things is nuts.

And there would be some times where if we're looking at this matrix, we're like, "Hey, there's a technique that we might be able to use to be able to go get a good acquisition of this type of device, but we're still not going to be realistically a able to go buy all of those devices. It would be nuts. You know, one of every model kind of thing. We can't do that. It won't. It's not scalable. So, what would we do? Well, we would go do our best, right? We would get some devices to go work on, make sure that we can have this thing in place, and then we would basically open it up as like an experimental new feature, and we would have all of this telemetry in place to let us know what types of devices people are putting through those workflows.

And what would happen is we could measure the success rates of those things and see ah this new experimental feature that we've turned on like it's working really well for like A B and C but like if we go back to our matrix for like DE and F it's actually terrible. So we go great okay like we can narrow this down and then we might need to go buy more devices in these other categories to go do further research. But this is because it was a new feature and what we could do is basically when it wasn't working it would fail. It's not a regression. It was never go like it never worked in the first place. So we're basically just potentially opening up new doors for people and that's something where we could be more like tolerant to a bug, right? We're saying it's supposed to work across this group of devices.

If it doesn't, it's not the end of the world. It's not like we gave them something and lied about it and said like, "Oh, here's here's data from the device and it's totally not." It would just end up not supporting the device and they would they could go try another mechanism that we have. So my point here is that for these scenarios, it didn't have to be perfect. And that was something that actually allowed us to have a ton of agility in developing. we could do a best effort thing for these new approaches, pack it with telemetry, and then we would go monitor the the telemetry. So, we could see all the times where like it's failing on things that should be passing on. We could see uh sometimes we'd see like, hey, they're using devices through this path and we didn't even think that they would work, but now we have concrete evidence that that's happening.

uh or like any any combination of this that you can imagine and that let us build confidence or understand where to go invest into. So just an example of something where we could be more tolerant to bugs or being imperfect, but we actually use that data as a feedback loop. Um, reflecting on that, that was like one of the most uh, it sounds kind of silly, but like one of the most powerful things we had on that team going for us was um, was our ability to be like truly agile in that way given that we had a really good data feedback loop. We could iterate super fast. Oh man, sorry. There's metal on the road and I I made sure I drove around it, but I was pretty nervous. Like sharp metal. um we could use data very like effectively and iterate very fast.

So um you know for us we could we didn't continuously ship but like we had continuous builds right like if you pushed up it was a really cool thing for our team and like there's many teams that are like this especially in web development but we're desktop software and um you know the other teams in the company are doing monthly releases and they have to do all this work to gear up for their monthly release and for us it was Like you could if you wanted to you could take any of the builds and this is how it should be, right? Like if someone was like we want to do a release tomorrow, great. Just take take the build. Well, what if you like don't you need time to prepare? Like no, if it's if it's built, it's ready. It's done. There's there's no more. We have 100% confidence in whatever comes out of that build system.

Take it every time. um that's how that product was built from the beginning. So, we can't really say that about some of the other ones. So, they had more things that had to go in place, which is fine. But for us, that's how it went. And um that meant that if we wanted to hot fix something or we wanted to, you know, they had a one-off release, we any point we wanted, we could have this feature where we're like, "Oh this is the coolest thing. We don't want to wait a month to go release. We want to release it right now." We could just go upload the build. It's ready. And then if marketing wanted to do anything around that, then they could. So it was really cool for desktop software. We got to be super agile that way. I'm not using agile to talk about you know the development process just literally agility and um that came with telemetry as well.

So yeah I feel like without the telemetry it would not have worked. Uh I mean the with the fact that is that we could release uh very rapidly which is good but we had visibility um for a little bit of extra context too like you might be saying well dude there's like obviously telemetry and all this software like no But um little bit of background uh especially in digital forensics some people literally have their forensic workstations not connected to a network. So telemetry don't make sense a lot of the time. But um there was this this notion that that people would be so adverse to telemetry that there would be no point ever putting telemetry into a product. In fact, it would scare people away. Right? This is this is forensic software. You're putting telemetry in here. Like get out of here. Like absolutely not. And uh we actually used this product to prove that wasn't the case.

We were very transparent about what we would send back just like usage statistics about like what we're doing so that we can improve the product. You know, never sending back actual data uh that's being like uh examined or extracted or or searched or anything like that. Just like operational information so that we can debug better. So had a lot of you know disclaimers and stuff like that and it was optin and uh another sort of thing that we had was automatic updates and uh that's one of the best pieces of telemetry cuz we were told you know industry standard is all these things aren't on the internet no chance. Well, if that's the case, then having, you know, automatic updates makes literally no sense at all, right? And they were like, "Oh, interesting. How many updates are happening or how many downloads do we get off the website versus automatic updates through the app?" Yeah, there's a lot more people with uh automatic updates on.

This light's broken. Damn it. Come on, man. It's already gone through twice. This bozo's got to back up. The guy behind him's got to move forward. So dumb. Anyway, yeah, we use telemetry a lot and uh kind of pioneered that for for our company to be able to say like, hey, look, if we use telemetry, we can actually like we I've told this on code commute before, but we use telemetry so effectively even with like error reporting that, you know, uh really awesome guy we had in tech support. I remember when we hired him uh one of in my career one of the best hires for a role ever and um you know he he's so good at talking with customers like he can get the technical stuff not a developer get the technical stuff can work with people that are pissed off about things

and just navigate I don't know how he has the patience for it he's amazing at it and uh so we would work with him and it got to the point where like we could tell him about that we fixed issues before customers even reported them because we could see it in telemetry and because we could build that fast. This guy's running out to press the button. Nice. Um because we could build and release fast, you know, if a customer needed a hot fix build or something, we could just say like it's it's on the portal, go give it to them. Um but yeah, overall I'm gonna wrap this up just because I'm waiting at this light and I don't want to keep rambling.

uh kind of an interesting reflection I thought to be able to think about different you know same set of software same suite of software and these different scenarios where truly no tolerance for any bugs at all and some situations where it's not that we're just like willing to have bugs but if they're if they come up we're not you know staying up all night to go hot fix them different scenarios and that's kind of neat So anyway, if you have questions about software engineering or career advice, leave them below in the comments or go to codecommute.com and you can check out my other channels where I got other software engineering stuff. So Devleer is my main channel with programming tutorials and C or AI tooling. And then I have the Devleer podcast where you can check out conversations where I interview other software engineers and a live stream every Monday at 700 p.m.

Pacific. And then there's Dev Leader Path to Tech where I do resume reviews. Then I'll put together some other videos on transitioning into software developer role. So, thank you so much for watching and I will see you next

Frequently Asked Questions

These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.

In what types of software projects is the tolerance for bugs extremely low, and why?
I worked on digital forensic software where the tolerance for bugs was very low because the software is used to structure otherwise unstructured data for legal evidence. Accuracy is critical, especially for things like timestamps, since incorrect data could misrepresent the truth in court and erode trust in the tool.
How do you handle bug tolerance differently when developing features for mobile device data acquisition?
For mobile acquisitions, we dealt with many device models and hardware limitations, so we were more tolerant of bugs in new or experimental features. We would release best-effort implementations with telemetry to monitor success rates and failures, allowing us to iterate quickly and improve support without risking critical errors.
What role does telemetry play in managing bug tolerance and software quality in your projects?
Telemetry was essential for us to monitor how features performed in real-world use, detect issues before customers reported them, and maintain high confidence in our builds. It enabled us to release quickly and fix problems proactively while respecting user privacy by only collecting operational data, not sensitive forensic information.