Working With AI Agents Through Iterative Loops With Guard Rails

Working With AI Agents Through Iterative Loops With Guard Rails

• 73 views
vlogvloggervloggingmercedesmercedes AMGMercedes AMG GTAMG GTbig techsoftware engineeringsoftware engineercar vlogvlogssoftware developmentsoftware engineersmicrosoftprogrammingtips for developerscareer in techfaangwork vlogdevleaderdev leadernick cosentinoengineering managerleadershipmsftsoftware developercode commutecodecommutecommuteredditreddit storiesreddit storyask redditaskredditaskreddit storiesredditorlinkedin

In this video, I talk about some of the development I'm doing leveraging AI agents. I've been experimenting more with letting them work through longer running tasks, setting up guard rails, and then ensuring there's a feedback loop in place.

📄 Auto-Generated Transcript

Transcript is auto-generated and may contain errors.

Hey folks, we'll do a little uh AI talk today. Um just doing a bit of reflection on stuff I've been thinking about at work or like uh in personal projects and stuff like that. So um okay, where to start? So I mean super high level couple things I've just been playing around with. talked a little bit before about how I have a bit of like a dependency injection like discovery helper kind of framework I'm doing. Um I realized that I I didn't notice this was happening and it's probably the side effect of just actively not developing this kind of stuff at work on a day-to-day basis. But um Microsoft had a semantic kernel and it looks like they've kind of gone um to the evolution of that. I I suppose is a Microsoft agent framework and it's it seems like a very similar setup. Um but it looks like it's a lot more around like orchestrating agents, but when I look at the the methods and stuff like that, it's all it it's similar, right?

Like it's different but very similar and I can see it being like a natural evolution of what's going on. Um but in my dependency injection scanning kind of uh tooling that I was building uh I had support for semantic kernel. I said okay well you know time to time to also add in something for Microsoft agent framework and it's been cool. So, I've been sitting, you know, uh, with co-pilot and chatting back and forth in the CLI and, um, kind of started off by saying, hey, like we have support for this, uh, in needler is the name of the framework. So, we have support for semantic kernel and there's this Microsoft agent framework like what what can we do here? Like, and so it kind of started off trying to do uh, you know, some parallels and saying like, hey, we do it for semantic kernel, we could do the same thing here.

Great. So, we kind of build this basis, but as it's implementing it, it's not that it's wrong or whatever, but I'm looking at it, I'm like, ah, it doesn't feel like it's it just doesn't feel like it's adding value. So, I'm kind of I don't know, frustrated. It's not the right word, but kind of puzzled like, okay, well, what more could we do here? because there's no point in like having this dependency injection stuff all hooked up with it if I mean if it doesn't really add any value. Um, so I'm in plan mode with co-pilot on the CLI and just kind of going back and forth and um kind of just getting it to brainstorm different ideas and a lot of them are kind of like oh we could use this extension method and I'm like sure and it's not like it's bad but so so what almost right like so what that doesn't really belong in this dependency injection thing that we're building.

It's it's just convenient like I could make another library that's like helpers or something. So finally it suggests something where it's like well we could we had some example applications built with needler and Microsoft agent framework and it was like well we defined the agents this way in code like what if instead we used an attribute and we could uh do a bit of source generation for for some scenario. And then I kind of had this moment where I was like, "Oh, like wait a second." Yeah. Like not just that. I think it was trying to do like source generation for a prompt or something and I was like, you have all these other fields you're you're setting up at runtime. Like what if instead we could have like a you know an empty class declaration and you annotate it with an attribute and then at source generation time we build out like what that is and register the agent.

So started going down this path and playing with some things. Um so that's been pretty cool and I'm just doing this kind of I don't know like brainstorming conversational loop with co-pilot now on needler to to talk about okay like give me some ideas. We'll talk through it. I want to see some example code so that I have a I feel like like obviously it doesn't doesn't know things because it's not like a it's a sentient sentient being that's like a developer. Um but sometimes it makes suggestions and I'm like that's not really actually helping anyone. So when I can see some of the examples and go back and forth, I'm like, okay, I can see how this ties into our goals with the dependency injection and source generation stuff. And um then okay, like sign off on it, have it go build it. It puts it into the example application that's in the repository.

So we can keep building on it and see these things evolve. And um and then we just go back to the drawing board after it's all done and tested and it's like okay well what else can we do right so how do we keep making some things better and then once they're feeling pretty good how do we branch into other you know feature areas of Microsoft agent framework and then build those up and then maybe we get some more ideas and go back to some of the originals and it's been a pretty cool loop um you know why no good reason like I'm not even using this in production or anything maybe I will be able to. But this is really just so um I'm getting some practice on like the sort of this design loop. Um and honestly like Copilot's building all the code on this repository.

I did so much of the original code to get things working the way I wanted. I've made other videos and talked about how I used uh AI to migrate it to a lot of source generation support, but since then it's just been like mostly AI writing the code and and me kind of guiding and architecting it. So that's one thing. Um the other thing that I've been doing is uh making upgrades to my blog and this stems from some of the stuff we're doing um in Brand Ghost, right? So, um, for those of you that haven't watched my other videos, Brand Ghost is a social media and uh, content crossosting platform that I have and I use it for all of my social media and we we sell this to, um, to content creators. And so, um, one of the things that is really nice about Brand Ghost and sort of my own experience is that it is built on everything that I have done as a content creator.

It's just like systemizing it for others to use. And um there's some interesting scenarios like some people don't know this but for me as a content creator I actually started by blogging and I started by blogging in 2013 and gave up on it. And when I started getting back into it into like social media and content creation, I did pick up blogging again. And the first thing that I did for Brand Ghost before Brand Ghost was actually a thing was I started writing tools to help me blog more effectively. So for example, using AI to get the like the the ideas to get the structure of the blog articles. Um, and this is like at the time using AI to help with this kind of stuff.

I'm thinking back on it, it was so convoluted because what AI would produce was such dog that like you're like, "Cool, I could see why this would help, but like it's just not just not quite there." Um, it would be like using some of the old AI image generators and you're like, "Cool, it made a picture, but I can't use this for anything." Um, so now that like for Brand Ghost as a as a business, we're trying to build up our our marketing and stuff like that and making sure we have blog articles and and all of that. It's like, okay, well, how do how do we get back to systemizing this? So, um, as a it's really helpful as a content creator for Brand Ghost, I am the expert user, right? I can share my insights, my experiences, and help guide what we're doing. And when it comes to writing blogs, like I that was the first thing I cut out when I was getting overwhelmed.

I was like, it's too much effort to write blogs. So, I'm trying to force myself to get back into it. And um and was just like, okay, well, time to, you know, time to see what we got going on. So um yeah just trying to update my blog you know looking at uh different tooling like page speed insights to see like you know is I haven't been touching my blog in ages in terms of like infrastructure have things degraded like how is it performing and using AI with MCP tools to kind of do the measurement and then uh again going back into this design loop. Now, the thing that I wanted to focus on for the rest of this video, though, is some of these uh highle loops or processes I'm doing. I haven't I've seen a couple of like, you know, headlines for different articles and stuff.

I have a feeling what I'm leaning into here is sort of like harness driven engineering kind of stuff. Uh but I don't know, like I don't actually know what the all the fancy terms and things are. I I don't read as much as I do things. Um, so I'm I'm speculating that what I'm about to describe is probably along those lines, but this is kind of the evolution of my own stuff. So, um, I described already this kind of brainstorm, you know, see it in the plan, go implement, and then go back to the drawing board. But there are some other loops that I think are interesting. And the the concept that I I want to share is really like when I'm having AI go off and do work. Um I think we've all experienced this, right?

Like obviously if you let it go and do bigger bodies of work, like more tasks, right, to go automate a whole bunch of stuff, um inevitably it finishes and you're like, "Cool, it did a lot of stuff. Maybe most of it's good, but there's always some sprinkled in. There's there there always is. And um this is like I feel like it's inevitable. We haven't solved this with people either, right? If you were to tell someone, hey, like, I want you to go build this, this, this, this, this, this, and then you don't have any interaction, even if they're someone that you really trust, they're a great software engineer, like um if you had opinions on anything that they were building, by the time they came back to you and said, "I built this, this, this, this, this." You might look at some of those things and go, "Oh, that's just not how I would have done it, or that's not that's not exactly what I wanted." It's it's just reality, right?

It's not different with people. Um, so that's why having like tighter feedback loops can really help because if you're getting people to build stuff and they get a certain way through and you're syncing with them, you can go, oh, cool. Like, yep, that's exactly what we talked about. Or, oh, wait, like, what's going on here? Let's take some time to to see if we're heading in the right direction still. And so I think I don't know like I definitely noticed that with AI I want to be more hands-off. I that's one of the benefits is I feel like hey if I can just give it work great but when it's either doing like I have some loops that are like at work I was trying some things out with doing um like investigating live site issues right I don't need it to be perfect. I don't need it.

I would love it to be, but I don't need it to give me a result at the end that's like this is definitively 100% like the root cause blah blah blah. That's very helpful, very nice. I am also okay if it just does like pulls a bunch of data, gives me an like some overall analysis of it and has suggestions because that to me at least in the work I've been doing is the is where the most time is spent on that kind of stuff. The tricky part is that like if you're reading through what it's doing, I will catch it doing stuff where I'm like, that's just wrong. Like I see what you did there, but it is just wrong. Like you you made the wrong decision about something. And so how do we find ways to keep like improving the next time we go do this?

Right? Because if I were to go run the same type of investigation on a slightly different scenario, um maybe I get lucky and it works like it does better. It doesn't make the same stupid mistake or it repeats it. So what I'm trying to do with a loop like that or in a development loop where I'm having it build things and just as an example at the end of it I go cool like this was great this was great but it keeps doing like one of the things that was really pissing me off was like um it kept like making these huge test files and putting regions to break up the code so you can collapse them. This is in in C. And I'm like, there's no reason we need 2,000 lines of tests in a single file. Just like make a new file. You literally segregated the code by regions to indicate these are completely different things.

Why are like why are you doing this? Right? But it the reason it keeps doing it is because I'm not telling it to not do it. Um and it's not baked into like sort of the work that it's doing. So we all know that we have these instruction files. Okay. So what I've been trying to do more of is build more of these flows where I can feel more comfortable leaving things go execute. then I have enough information at the end to go like back over and whether it's reviewing a bunch of commits locally uh and the example that I was sharing with doing an investigation looking at data points like queries that it ran charts um looking over this stuff and I have feedback on it great but I'm also trying to do a couple of other things one and this is like the more meta part is like I am trying to build another sort.

It's like a the extra level of analysis like great, you did the work. Thank you for doing all of that. Now, I want you to go back over all of your work and I want you I want you to be the reviewer of this. I want you to do what I've been doing, which is critique it. And I don't mean a code review. That's sure that's part of it. That could be a step. But I don't just mean a code review. I mean like you were calling these tools and they were failing and you just kept moving and decided like it was okay that we had a failure, right? Like make note of that. Let's keep going through all of this, right? Like you did a query and you got uh in some tool and you got zero results back. Does that mean that it doesn't exist or does that mean the tool failed?

Do do you know the answer to that? You're not sure? Okay, make a note of that. So, basically having um this follow-up flow that I can go run and and have it do an analysis and a critique of all of the things that it had done in the handsoff stage. And obviously, this has flaws, too. It's not perfect cuz it's again doing more hands-off work, analyzing itself. And so, um, what I found is really cool about this though is that it does a pretty good job of finding some And it doesn't have to be perfect because every time we do it and it finds some gaps, it's like, okay, like let's get the list of those things. Let's try to make some improvements on them, right? So I can ask it um recommend uh the changes we should go make to the agents MD file. Call out some of the tools that might be broken that we should go fix.

Call out if we have test coverage gaps like but you can basically do this analysis orchestration and have it look at different areas of what it was doing and report back. Right? And then then from there you have a whole bunch of ideas and you can go ask claude or co-pilot like go implement these now. Then the next time you do this really big loop hopefully it's moving things in the right direction and it won't always it won't like necessarily like guarantee to fix problems. But I I'm finding what's really fascinating with all this is because we can move so fast with iterating that like if we just keep making incremental improvements with this stuff in terms of guidance, in terms of making the tooling more solid, whatever it is that you're having done through this analysis, as long as it keeps getting a little bit better each time, that's great to me at least.

Um, could could it be more efficient at getting better faster? Yeah, 100%. But if we if we keep moving in a positive direction, that's great. And you will periodically find things that are like systematically it's just like it's doing a job on it. And you might have to intervene a little bit more. Right? So, I'm trying to think of what a good example was. Um I mean early on with this kind of stuff um even before doing these loops one of the things I was noticing was like cool it's writing tests the tests are passing but it's uh it's writing code that doesn't uh in the test that doesn't actually exist in production. So it's like here's a scenario that I think is important. Oh it's hard to test. Like let me make up the code to test against. Now the test passes we have a scenario.

Great. And it's like, yes, but that's not real. You're testing something that's just simply not real. So, you know, intervening to be like, cool, maybe maybe we need to introduce uh some code coverage tools so that you have some evidence. And I'm going to come back to this evidence word in a second. Uh maybe we change some of the structure around how we're guiding things. Like, um I'm not someone who is hellbent on doing TDD. Personally, it's just not how my brain works when I'm building a lot of software. But I do find it's been pretty helpful with agents. And even if it's test driven, that's not going to solve all the problems. I like telling it too, like I want to see like report back to me when you had a red test and then it was green after, right? Report like I want you to try doing a manual mutation in the test or sorry in the in the code that you think it's supposed to be testing.

Report back to me like I want evidence. So giving it some better guidance and then also this evidence concept. But point being um understanding that some of these uh orchestration flows where you're just like leaving some agents to go do a bunch of work, they're not going to be perfect. um trying to figure out what the guard rails are, but then also using like this second level of analysis back over top of it so that you have, you know, another set of eyes that are critiquing the whole flow. And I find that's been um it's been interesting. I don't have it like perfected by any means, but like I am trying to invest more and more time into into these concepts. Um, I wanted to come back to this uh evidence word and it's it's pretty neat.

Um, so a couple of the scenarios that I'm thinking about are one, um, this live site investigation tool that I was talking about for work because, uh, I noticed when it was trying to pull some data and it was making suggestions or like um, giving me its analysis, sometimes it was really difficult because it would say things and I'm like that seems very very believable. And then if I if I went out of my way to go look for data and I found something that was conflicting, I could go back to it and say, "Hey, you told me this." Like where's the evidence that you have of that? And sometimes it would be like actually like I was totally speculating and it's like ah okay. So total speculation is not necessarily zero help, right?

If you're all out of ideas and you're like, "Hey, I here's at least something to consider." That could be super helpful when investigating, but I don't want you to treat it as a factual statement because that's extremely misleading to me. And that can also be extremely misleading to the agent or follow-up agents that are using that data. So, that's one thing. Um, when I was doing some uh iterations on my my blog performance, same thing, right? It was doing like page speed insights. It was saying, okay, like here's where things are slow or fast. Here's some other things to report on. And it was making a bunch of suggestions. And I was doing this a whole bunch. And I started to go, wait a second, like we've already fixed this. Like two iterations ago, we we had fixed this. And so I would say to it, you know, where is where's the evidence of this?

And it's like, oh, you're right. Um, that's actually not in the page speed insights. That was just something that like I was thinking up. And these are all new sessions, by the way. So it's not maintaining the context. I should have said that earlier. Um, so it's it's quite fascinating that like it's not that the suggestion itself was a bad idea. It's that it was being presented as a fact. You have this gap. We need to go fix it. And it's like, but but I don't. And when I push back, it's like, you're right. I don't know if you do. Um, but I thought that maybe that would be a helpful thing. So, it's a second part, and I'm going to kind of uh tie it all together.

The third is like again when building actual code um having things like uh it's saying that it it it's finished testing or whatever and it's like okay but like what if I were to say to it like sometimes I would go run the application it like crashes on startup and I would be like you told me you ran the test and if you ran the test like it shouldn't crash and it's like oh I did but like I only ran like these two tests or Yeah, I did, but like uh I only wrote tests for these things over here and I I didn't test this stuff over here. So, like I I I said we were all done. I said we had all the test coverage, but we don't. So, when you push back and you start asking it for evidence for the claims it's making, a lot of the time it's like, "Oh, you're right.

I don't have that." And so this in my in my experience so far, which is of course limited, haven't been doing this for years, um it seems like this problem gets worse and worse when you're chaining agents together because they're not doing a good job asking these questions back, if that makes sense. Right? They're going, "Oh, agent before me said X. I guess it's X and then they build on top of it and it's a false assumption early on in the whole flow when it's it it can just derail things. So to tie all these things together, like one of the things I've been trying to do, and it's like uh especially helpful when you run this like analysis loop back on top of it, is like I want you to emit data as you're going through the orchestration. So, for example, I want you to emit when you're making you're collecting data on things or making suggestions, whatever it is, like I want you to admit the uh emit the evidence for it, right?

You made this claim. Cool. What is it based on? Like, show me the query you ran. Show me like give me something about the results of that so that when inevitably when I'm looking at the result in the end and I go, that's not right. I can see where it made the error or if it's like you could have the evidence uh depending on how you want to do this kind of stuff you can say like it's it is just an assumption so at least that is tracked the whole way through like this is based on an assumption it's not grounded in evidence but based on you know what I was focused on here's an idea and that way downstream whether that's to another agent or to me at least we know it's not wait a second like someone made this this claim that seems pretty critical and like where's the data let's go back and find it.

Um so I find that's been working really well. Um another flavor of this too is like uh for emitting data not just evidence is like some of these orchestration runs take a long time and so like where is it spending time right the one I was doing at work some of the queries were like outrageous it's like trying to analyze traffic patterns for like millions and millions and millions of requests billions in some cases is um so and then over some ridiculous time period and it's like well sure maybe you maybe you do need to do that but it's doing 10 of the same call. Should we should we cash that or it's doing it and it's timing out pardon me and then trying to do it again and then it times out before it gives up. So like having some data to to talk about the performance characteristics is super important.

I need to do this like I actually I have it instrumented. But I haven't gone back and uh reanalyzed it. But uh the stuff that I was doing for blog performance, the page speed insights calls, whatever it was doing to analyze the data after that is take like it's not even just the page speed insights call that that takes some time cuz Google's doing the um the collection, but like the analysis was taking like 15 minutes for an agent and it's like why what are you what are you possibly doing? It's already it's literally already done the data collection. So what what are you doing? So instrumenting those steps to be able to emit data to give you ideas around performance then you can go back over either yourself or with help from AI to be like cool let's let's optimize this. You got to let me in.

Thanks very much. Um, so that's been like super helpful in terms of supplementing this uh sort of this analysis step that happens after the fact when I'm trying to improve for the next time is like how do I give it better data to make more informed decisions. Right? I can say when I have an orchestration to analyze what the first orchestration did, I can say I want one of the agents to focus purely on performance. I want one of the agents doing this analysis to focus purely on like did we have the right evidence? I want one of the agents to focus on like are we missing tools? Are we do we need to have better guidance in the agents MD? Like are there gaps? Um all this sort of stuff and giving giving it better data to work with helps it make better decisions or at least better recommendations.

So, um these are some of the concepts that I've been playing around with and trying to refine, but obviously, like I said, it's not um certainly not perfect. Will it ever be? Probably not. But um it's it's quite interesting because uh for a bunch of this stuff it would be I think the the goal state for me would be like I want to have agents that run that can build my code out but it's done in a way that like models what I want. Um not only from like the behavior obviously I want I want stuff to be uh built that meets my expectations in terms of like functional requirements and non-functional requirements but like I am interested in the architecture right I I am like I do care about how things are split up because I might have plans later to go move some

of that around and I don't want to be in a position where I'm talking to an agent for a week around trying to you know to refactor or rewrite things and stuff like that. I want to avoid those types of problems. So if I can guide it early on around some of the philosophies I have in software development and it it kind of starts building things out in the way that I expect and sticks to it like that would be ideal. And we're we're not there yet, which is okay. But part of me thinks that like a critical aspect here is really having this feedback loop because let's pretend um let's pretend I wanted to go build a whole new thing from scratch, whatever the thing is. And I'm sitting there going, "Okay, I know that to make this work effectively, I have to go write a really good agents MD file, try to get down all of my thoughts, follow these patterns, do these things, blah blah blah.

I got to get it rock solid so when it starts, it's good." Inevitably, like I can say with a 100% guarantee, I will have missed some things that I care about. It's just reality. And even if I magically could write them all down, right, there's going to be situations where like the agent deviates from that because it was make it was doing the thing that seemed to make the most sense in that context. So having feedback loops built in where it's like, hey, look, you took these actions. Does that actually align with what we need? Um I I think they're necessary. I think they're necessary and I I'm hoping that over time like what I really don't want to do is for every time I'm doing a project start from scratch on like what are my engineering philosophies. I mean they might change from project to project.

They'll evolve over time. It would be super cool to be able to just define this stuff and the next time I work on a project, it's like it's starting with everything I've tried to document before to guide agents so that every new project um I have a some type of loop that I can say go build this stuff and have a higher and higher degree of confidence. So, it's definitely like this stuff is moving very fast and it's very fascinating. Um, so I'm excited to keep playing with it and learning about it, but it's uh it's also tricky to to know like are there things I'm trying to solve for in my workflow that like ultimately the problem itself kind of goes away? Maybe. I don't know yet though. But I think it's good practice and good exploration. So that's mostly my thoughts on what I've been doing recently.

Some type of harness, if you will, or like iterative approach. The the the tricky part is like making everything I just talked about there automatic. You know, do the work with your orchestration. run this um this thing after that evaluates the performance of how the agents did in the orchestration. Um has some way to like score it to make sure that we're improving. Again, what does that look like? No idea. Has the sets of improvements, implements the improvements for itself, and then the next time it does a batch of work, like are we getting better at doing this kind of thing? So, I guess is that called an eval? I don't know. I don't know what the right terms are. But, um, having something like that wired up where every time it's doing work, it just like automatically gets a little bit better. That's where I'd love to head to.

So, anyway, it's been a lot of fun. Uh, I got to do a few more. I haven't done videos. I've been pretty swamped and kind of regret it. Um, even for code commute like as I'm filming this. It's Wednesday morning and I I didn't even post code commute videos and like I have a bunch recorded. I just like been kind of uh I guess exhausted from from other things. But we'll get them uploaded. But I got to record some some dev leader videos so I can show you guys like some of the stuff that I'm trying to build out. And uh it's a little easier to talk through some of it when there's working examples and not just me in a car waving my hands and talking. So we'll see how that goes. Um but yeah, we'll wrap it up there. I'm just getting to work.

Um if you have questions, thoughts, leave them in the comments, of course. And uh otherwise uh if you want to be kept anonymous, just go to codecommute.com and uh there's a form there that you can submit. If you do have questions that you submit that way, uh please the the more detail that you're willing to provide, the I find the easier it is I can try to write an answer or make a video response that uh that's helpful. when it's pretty generic, it's pretty pretty challenging to to give advice that I feel like is helpful specifically for you. So, that's my ask, but I do appreciate you being here. Um, yeah, my other YouTube channel is Dev Leader. That's my main one where I uh I'm not in a car and there's edited videos, tutorials, stuff like that, especially around how I'm using AI, building stuff in C.

Uh, and then I do live streams on Dev Leader Podcast YouTube channel. Um, and those are every Monday at 7 p.m. Pacific. So, hope to see you on one of those. Take care. I'll see you in the next one.

Frequently Asked Questions

These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.

How do you use AI to improve software development through iterative feedback loops?
I use AI agents to build code and then run analysis loops where the AI critiques its own work, identifies gaps, and suggests improvements. This iterative process involves reviewing commits, checking test coverage, and refining guidance to gradually improve the quality and alignment of the code with my expectations. The goal is to have AI incrementally get better at following my engineering philosophies and producing functional, well-architected software.
What challenges do you face when letting AI agents work hands-off on complex tasks?
When AI agents work hands-off, they often produce results that are mostly good but include some errors or assumptions without evidence. For example, they might make wrong decisions, generate overly large test files, or produce code that passes tests but doesn't reflect real scenarios. To manage this, I implement tighter feedback loops and have the AI emit evidence for its claims so I can review and correct mistakes, ensuring better outcomes over time.
How do you incorporate evidence and performance data into AI agent workflows?
I require AI agents to emit evidence for their claims during orchestration, such as queries run or results obtained, so I can verify the basis of their suggestions. Additionally, I instrument the workflows to capture performance data like timing and resource usage, which helps identify inefficiencies or redundant operations. This data-driven approach allows me to analyze, critique, and optimize the agents' work, leading to more informed decisions and continuous improvement.