Skill Issue? Build Guardrails for Your AI Agents Writing Code

Name: Skill Issue? Build Guardrails for Your AI Agents Writing Code
Uploaded: 2025-11-13T13:00:52.0000000+00:00
Duration: 19 min 5 s
Description: Got skill issues like me? No worries -- we don't need to be the most amazing prompt sorcerers in the universe if we can do better keeping our agents on track with other guard rails!

November 13, 2025• 112 views

vlogvloggervloggingmercedesmercedes AMGMercedes AMG GTAMG GTbig techsoftware engineeringsoftware engineercar vlogvlogssoftware developmentsoftware engineersmicrosoftprogrammingtips for developerscareer in techfaangwork vlogdevleaderdev leadernick cosentinoengineering managerleadershipmsftsoftware developercode commutecodecommutecommuteredditreddit storiesreddit storyask redditaskredditaskreddit storiesredditorlinkedin

Got skill issues like me? No worries -- we don't need to be the most amazing prompt sorcerers in the universe if we can do better keeping our agents on track with other guard rails!

Watch on YouTube 🎧 Listen on Spotify ← All Videos

Transcript is auto-generated and may contain errors.

Hey folks, I am just headed to CrossFit here. I'm going to talk about zoomed in on my head too much. What's going on? No, that's too much. That I don't know. Something's going on there. That's weird. We'll figure it out later. Um going to talk about some AI software development from over the weekend and uh direction. I've been trying to push some stuff in to see if it helps a little bit. And I think I think I have enough content specifically for for this type of thing from over the weekend that I'll make some YouTube videos on Dev Leader, which is my main YouTube channel, and and actually walk through code to explain some of this. So, this will be um it's going to sound more specific for net developers, but there's probably depending on the language you're using and the infrastructure, there's probably some very similar things that you can explore.

But, um, basically, I'm getting a little fed up with the only feedback from people when, you know, seemingly LLM and agents are doing a bit of hallucinating or uh, building crap code that it's like it's it's purely just a skill issue, right? It's like uh and it's always the same people that apparently have been building AI systems for, you know, uh 20 years and they're an expert. They can they haven't had to manually type a line of code ever because their agent army just does everything perfectly. Um so skill issue for you. But um I think what I'm finding kind of frustrating is like um you know same same set of guard rails in place like you're if you're using co-pilot your co-pilot instructions um you can prompt you know seemingly the same way like I'm not saying that my prompts cannot be improved. I'm sure they can be, but I don't think this is the only thing going on.

But, uh, can you go from having something where you're working with an agent and it produces some code and you're like, heck yeah, like that's that's pretty sweet that it did that. And then essentially things degrading like I've the example that I was kind of thinking through from the weekend was you know working with co-pilot and all of a sudden it just starts like deleting code or working with co-pilot like as we're iterating on a file and all of a sudden it just stops remembering how to to do any formatting. So there's no um this is going to be this time of year where this thing does not stop beeping and I lose my mind. Uh so you'd have you know it's making the edits but all of a sudden there's no spacing between the lines and there's no indentation out of nowhere. So like pardon me but what did I change about my prompt in mid conversation that has an agent going ah you know what screw it.

We don't need any any formatting. So there's like there's stuff that happens like this that's pretty stupid. Like I find given that I use it a lot. Like it happens often enough where I'm like this this is frustrating. And so yes, we can improve our prompts. Um more recently like especially because I do a lot of work through uh like GitHub co-pilot where the agents are running in the cloud. Um, and I think you can do this in VS Code, but like using custom agents, I don't think you can. I kind of frustrated with Microsoft. Yes, I work there and it's frustrating for me, but like I use Visual Studio Enterprise, the one that costs money and the feature set is lagging behind Visual Studio Code like crazy. So like we can't do unless I'm just stupid. I have Visual Studio uh Enterprise like insiders, so it's like the latest features unless maybe mine's just not showing an update, but you can't use custom agents.

Like I why? Um but so I use this in GitHub Copilot online. So having a custom agent alongside my co-pilot instructions, oh my god, is my sanity. um and then trying to prompt better. And I'm hoping that like this is giving it more and more guard rails to do less and less dumb stuff. Another example um that comes to mind that's uh it's probably feels a little bit better than just silly formatting stuff is like tests. So in brand ghost I have established a pattern where effectively any class that I am testing and this is not for you to decide whether it's good or bad or right or wrong. This is just a pattern that I have in the codebase. Okay pattern that I have in the codebase is that when you are testing things we resolve the system under test from our dependency container. The reason I'm doing this is because I am trying to shift as much as possible away from uh from unit tests that are just like mocked services.

Not that I'm I have like a an a hatred for that. Like I actually love to write code that is unit testable this way. But uh every every test file your system under test gets resolved from the dependency container. And my logic here is that as things evolve over time, you know, more or less constructor parameters, this kind of stuff, we're just touching mocks less. Like I want you to and and constructors and test setup. I want to minimize this effort so that you're using as many real things as possible because we can. I already everything's hooked up to use a database when it's running. It works perfectly fine. And the only things I really want to mock are like making external calls to services that I don't own or like at third party library boundaries, right? Like things that I don't have control over like I I I'm not testing that.

Um, so what does co-pilot do? Despite my documentation in the repository saying one thing, my prompt saying to follow patterns of uh functional tests that are in the codebase, my agent having the same instructions, like it will go write tests and then not do that. Okay. So yes, I guess my prompt like anytime I'm telling an agent to go write code, my prompt must literally have an example of how to do it because uh the documentation exists. The prompts will suggest reading documentation, following existing patterns. Um but it just doesn't. So, I think, you know, part of me is like, okay, well, with this type of thing, anytime I go tell an agent to do work, you're telling me that I have I personally as the prompter have to go tell it, here's every single possible pattern that I want you to consider. Like, I I get it.

It probably does help, but do you see how ridiculous that sounds? Right? Like you're telling me any coding pattern that exists in my codebase, if I wanted to get it right, I have to go have a snippet of that code passed into the prompt. Like it's just it's not scalable. That's it's just it's ridiculous. And so I get that if I'm having, you know, a back and forth with an agent, I yes, at that point I do have to go tell it, you know, let's here's an example like follow this because clear because clearly you're not. And I'm sure it'll get back on track. This is how I've been correcting it. It's frustrating because I will kick off a bunch of work to say get done overnight and then I wake up and review the code and it's always things like this, right?

It's like sure there's some things where I'm letting uh the agent go kind of explore and I'm not expecting it to get it right, but there's some stuff like in example today I asked it to go add some telemetry and the actual code change is perfectly fine. It's great, but the tests are not. So, and this is like a recurring thing. So, now like I I can't have that code landed and running. It's going to delay it even more time. So, I would really just like to be able to get it to do that part, right? Because it's a simple thing. I know with one message to the LLM, I would say that's not how we test. Here's an example of how we test and it will do it. But why couldn't it do it in the beginning despite all of these attempts to say like go code this way?

So where I'm heading with all of this is that uh I am starting to add analyzers to the code. And if you're not familiar with likenet, we have Roslin and we can write like custom analysis rules that do essentially like static analysis for us. Uh and you can write some pretty complicated stuff to go enforce rules, which is really really really neat. And um I've known about these for ages, but I have never never done it. And I've I've talked about this like on on Twitter and stuff too in different uh comment threads and stuff like that, but there's source generation and analyzers. And those are two things innet that I've just spent like zero time on unfortunately. It's one of those things where like I'm like that sounds really cool, seems helpful, but I got other to do and it it's never it's never really come up where I'm like oh I like depend on this.

But that time was this weekend. And I said, "This kind of thing is happening too often where there's patterns in the code base that are just completely being missed despite them being literally everywhere and um it's time to do something about it." So with analyzers, we can go write essentially custom rules so that when your code is going to compile, it can evaluate if your code is following the rules that you set. And you can have like different levels of uh of diagnostic. So you can have it like as an info or like a warning or an error, this kind of thing. And so I started putting a bunch of these in. And the way that I approached this was um in like on GitHub they have GitHub spaces and because Brando is a a multi-repository offering uh like our front end and back end

are in in different repositories for example same with like our our blog is in a different repository we can in GitHub copilot spaces we can add in sort of like different repositories as sources of information. Then when I chat with the LLM, like it's kind of like chatting with chat GBT, but it has full context to my codebase because it can literally use the repositories that I've linked up. So I had a conversation with it the other night and basically said like I want to start putting analyzers in to the repository specifically for the backend code cuz that's in in C. And I said what I'd like you to do is to look at the co-pilot instructions, look at the custom agent file. I would like you to look in the the documentation folder and then also analyze the code. And this is such a generic thing to say, analyze the code, but um analyze the code and basically look for common patterns that are reused across the codebase.

Give me a bulleted list of all of the things that you think having analyzers in place would help with to enforce these patterns. And then I tried to be kind of specific like don't don't go try and write the analyzer code like in this chat. Just give me the list and like your uh your reasoning, your logic for for why you feel this is an impactful area to have an analyzer over. And it spat out like 20 things. By the way, I've repeated this exercise and it's given me like even like 30 things after um after giving me the initial set where I implemented a bunch. So it's pretty interesting. like it goes pretty uh goes pretty deep. And so it did a really really really good job. Even though that prompt like arguably someone who listened to what I said is like that's probably a terrible way to do it.

Like you're going to get results from all over the place. Like what does this even mean? I don't know. Like I'm I'm just experimenting with it. But it did a good job and called out a lot of areas where I'm like literally I've seen pull requests where I'm like fighting with co-pilot like, "Dude, just look at the file beside this one and like you completely made up new patterns. Don't do that." Um, so I think it was like 8 to 10 uh analyzers. Like I I was doing this in bed on Saturday night and I sat there for like half an hour in bed and I was just like on my phone filing GitHub issues and would have been way more effective at a computer but I was already in bed and I was too excited to go to sleep until I did this and I fired off like uh 8 to 10 of these things and had it go write custom analyzers for me overnight.

Now, they weren't perfect, and I This is one of those things where I wasn't expecting it to be, right? Um I don't even I've never even written an analyzer. I don't feel like I'm in a good situation or a good seat to tell Co-Pilot how to do it right and how to do it wrong, um if I've never done it before. So, it made like 8 to 10 of these poll requests, and I checked them out one by one, and I looked at what it was trying to do. By the way, um I need to go actually better understand how the analyzers work and get set up and stuff, but now I have a bunch of examples to play with, but I I tried it out, right? Like I you know, if you change some of the code that the analyzer is supposed to catch, try to build, it should catch it, right?

It should underline it in Visual Studio and say like, "No, no, no, that's not right. Here's why." And so I played around and you know uh just did some iteration back and forth across I think I I missed all but one so far. I'm in progress on it because it's way more complicated. But uh yeah it was really cool. it was like, "Hey, here's a scenario that you're missing." Or I would, you know, pull down the pull request code like that branch and it's like none of the code is compiling and I'm like, well, hold on. Like this is actually a totally valid code snippet and the analyzer is like is missing that scenario like or it's including it when it shouldn't. So kind of both ways. Some things were missed and some things were aggressively included.

um and just iterated with it and I said like here's an example to include or exclude write the test and we went back and forth and I think now that I have a bunch of examples like I could I could probably go do some of that myself but it's an interesting crossroads because I'm like this is such a a neat thing where it's like I have a bunch of analyzers in my code now I think we're up to I think 24 in total custom ones I didn't write any code for them. I don't I don't actually know in practice how it works. I understand conceptually, but in practice, I didn't do it. I want to go learn. I'm actually genuinely curious about analyzers and source generators, like genuinely. But at this point, there's like 24 of them, and I don't know how they work. Pretty neat.

Um, but the idea behind all of this is now that I have these analyzers in place, I'm hoping to see I'm hoping it's dramatic improvements to following standards. And the reason for that is that the code literally will not compile unless you do it the way that I say. Um, and I think I, you know, I asked co-pilot last night kind of the same thing, laying in bed. Now that I've added all these analyzers, reperform your analysis. What are the gaps? And uh I didn't really make it that far. I just like asked it to do that, but then it was like I had crossfit super early in the morning. I'm not staying up to do that. But that's a when I get home thing and we'll put it to the test, right? I don't know. Maybe uh maybe I prioritized all the wrong things. I don't know.

But uh whole point of this was guardrails. And I'm I'm hoping that having more programmatic guard rails in place will make it so that working with agents is a little bit less frustrating on some of the stuff. I feel like it should just get right because there's a million patterns. It's almost like it it feels like it has to go out of its way to go introduce some of these patterns versus just take the easy route which is like look beside you. But I I assume it's because it's trained on so much data, right? That whatever I have is not the common thing. So pretty interesting stuff. But like I said, I'm going to make some YouTube videos on that on my main channel which is Dev Leader. And um I'll walk you through some of those things and then uh when I'm ready and I've learned about how analyzers work and I feel like I can talk through them, I'll I'll make some videos on that.

Um so yeah, that's all. If you have uh questions you want answered, leave them below in the comments. Otherwise, codekim.com. Submit anonymously. Anony anonymous cinnamon. Bye.

Frequently Asked Questions

These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.

Why do AI coding agents like GitHub Copilot sometimes produce poor or inconsistent code?: I find that despite using the same prompts and instructions, AI agents often hallucinate or produce bad code, like losing formatting or deleting code unexpectedly. It feels like the agent sometimes ignores established patterns or instructions, which is frustrating because it means I have to constantly correct it or provide explicit examples in my prompts.
How can custom analyzers help improve code quality when working with AI agents?: I've started adding custom analyzers in my codebase to enforce coding patterns programmatically. These analyzers act as guardrails by performing static analysis and preventing code from compiling if it doesn't follow the rules I set. This helps reduce the amount of dumb mistakes AI agents make by ensuring the code adheres to our standards before it runs.
What challenges do you face when prompting AI agents to write tests following your project's patterns?: Even though I document the testing patterns and include instructions in my prompts, AI agents like Copilot often write tests that don't follow those patterns. This means I have to provide explicit code examples in the prompt to get it right, which isn't scalable. It’s frustrating because the agents don’t always read or follow the existing documentation or codebase conventions on their own.