Agentic harnesses are REALLY proving just how powerful they are when you try to recreate them in code.
📄 Auto-Generated Transcript ▾
Transcript is auto-generated and may contain errors.
Hey folks, we're going to talk about AI because we do that. Um, yeah, I figured I'd talk about some of the stuff I'm playing around with on the AI side of things. Um, maybe some ideas for some videos. Um, this wants me to turn here. What the heck? Sorry, I have I'm on my way to a doctor's appointment instead. And so this is telling me to go a different route than I was expecting, but maybe that's cuz of traffic. Um, so yeah, I've been doing a bunch of stuff with uh I guess like Evals is the best way to put it. And so trying to do a better job of putting together some like agentic workflows, but like one of the challenges is of course that when you're using LLMs, they're non-deterministic. So, we end up finding ourselves in situations where we're trying to build something that's repeatable because we're building software systems and then you're like, "Cool, like this is working.
I want to enhance it. I want to change it." Whatever. You make an adjustment and all of a sudden everything breaks and it's like, "Okay, well, what the hell?" Like, then you realize like a small prompt change um ends up causing like some type of cascading failure. uh in terms of like the LLM's ability to do the right thing. And it's like it's more nuanced than that, too, because it's not just here's a prompt, let me go fire it off. It's not just you're in C-Pilot CLI or uh Claude Code and you're going back and forth with a skill or something because the harnesses are the harnesses are so good. uh they're just so good at trying stuff. Even if there's a script, they're like, "Oh, you know, I ran this script and it errored. Like, let me just modify the script or let me make a copy of it and fix it and like keep going." And so, that kind of shit's awesome.
But when you're trying to do it more programmatically uh without, you know, some of these kick-ass harnesses, it's uh it's a little bit it's a little bit more difficult. And so I don't know. I'm I don't know how far off we are. Maybe it's because of the the libraries and stuff I'm using, but I would love to have the equivalent to what feels like C-Pilot CLI or Claude Code as a harness inside of my code. And I know there's like Copilot SDK. I know there's Microsoft agent framework and I'm using both of these things, but um I'm not I'm not going to be shipping code that relies on C-Pilot CLI. Like to me, that doesn't make sense. And so I need a harness that's like that without shipping co-pilot CLI. And by the way, uh you'll hear me talking a lot about co-pilot CLI, not because I'm against clawed code or anything like that.
Uh if you're new to my channel, uh just a disclaimer, right? Like I'm not I'm not like against claw code. I'm not like everything. Copilot CLI is the best. I just happen to be using it. Uh I have used both. I've started with Cloud Code before Copilot CLI. Uh, but I just I happen to gravitate towards the co-pilot CLI and that is what it is. It's kind of where I'm at. So, if that bothers you, I'm so sorry. Um, but I'm trying to be very transparent that I'm not I'm not like married to one of these harnesses. It's just I happen to be using Copilot CLI. I I've also figured out my my token uh issue limit uh conveniently thanks to being a Microsoft employee. So I'm uh navigating that with co-pilot CLI a little bit more effectively. So given that I'm able to build skills and workflows this way where the harness works super well and now I'm trying to get to a point where I can do it programmatically.
You can cross the road, buddy. There you go. Um, I'm facing a lot of challenges because the harness is so good and because of the libraries I'm using are are just not replicating that harness capability as well. Um, so that means that I'm spending an awful lot of time like tuning these things. I'm trying to not even on the highway really. This is crazy. It's going to be the weirdest drive. Um, yeah, a lot of time, like I have a a skill that does sort of like an orchestration of some agentic flows and like I said, works super well in the harness. I can literally like oneshot stuff, walk away, uh, come back and feel like very confident about what it produces. And it's I'm not sorry, I'm not telling you this to to brag about what it does. I'm telling you this to say that the harness supports really complex things.
So, it can orchestrate multiple agents in a pipeline. They'll do like fan out. They'll come back together. And I would love to sit here and claim it's because I'm I'm so good at prompting and stuff like no. Uh this is just basically going through workflows um having copilot CLI kind of look at what it was doing and then saying make a skill for this and then rerunning it you know and then over iterations and iterations kind of refining it more to be like wait we can do these steps in parallel blah blah blah. So yeah, going from the harness running the skills to something more programmatic. You're not going to take your advanced degree, buddy. That's awkward. Um, it it's just it's night and day.
And so I have tried this is honestly it's been over a month uh of like my outside of work dev time being dedicated towards trying to migrate some of this stuff and I I'm I'm at the point where I feel like I'm losing my mind because the overall progress seems like it's negligible. like a a month of coding stuff outside of work for something that's already working in a hardest try and like replicate it. And I have to keep I have to sneeze so bad. I'm sorry. I have to keep reminding myself that um it's not just about the final outcome of this thing. It's actually like there's a lot of stuff that's happening along the way. And one of those things has been like back to like this idea of eval. So to back up a little bit as I've been doing some of this work to port stuff over from the harness to more codified view, we have things like tools, right?
So you can give agents tools. So the harness is able to go call uh let's say that let's say a skill is kind of like a tool, right? So you can you can give these harnesses these skills to go use which can go call scripts whatever even you know read file write file like these are tools to go interact with the file system web search but these heart um sneeze is right there. Come on. Oh there it is. I'm so sorry for sneezing into the mic. Had no choice. There's another one ready. Maybe not ready, but it's it's going to sneak up the coded uh codified harness, right? So, not clawed code, not copilot CLI, but doing this in C with Microsoft agent framework. These uh these tools don't exist, right? Like they're not My goodness. I'm so sorry. I don't know. I can't like mute the mic as I sneeze.
I probably could, but it's a lot of work. The It's not that the tools are impossible to make. Like in C, it's, you know, pretty straightforward to go read and write a file. So, the challenging part is that you do have to go make the tools. So finding what tools to go make and then the reality is like your tools are going to work differently than the tools in the harness, right? For example, if I wrote a crappy version of an edit file tool, um, or like a write file tool, then the way that the LLM might use that is that anytime it wants to write a file, it's basically going to go say, "Here's the entire contents of the file. Let me go blast it in." Okay. Now, what happens when you have an LLM? So, the model that you've picked only has so much context and it can't write the contents of a file in one in one call tool call.
Well, you'd say, well, duh. Just it will it will do multiple calls to the tool. Yeah, but if your tool only writes a file and doesn't support appending, then then your tool's not going to work, right? So, this is a pretty simple example. You say, "Okay, Nick, that's really simple to go make another tool that appends." Okay, so now you have what? A a write file, a append file. What about like is there a way that you could do um you know more surgical insertions or surgical replaces? Right? So like the point is that these really trivial things that we very much take for granted when we're using a harness. these very trivial things like we have to go build all of them and then because they're so simple sometimes we just forget that like how you go to use them may look different so
the you know file editing is just one one example like that uh then we layer on some constraints like I have some situations where I certain files um I don't want going over a certain size so how do I enforce that right I feel like with my harness, if I talk to co-pilot CLI or Claude, if I was like, hey, you know, go do some types of edits like this, but I don't want this file to go over a certain size. Like, I get that it's not programmatically enforced, or maybe it would write a script that does it programmatically enforced, but point is that the harness is good enough that I I have a lot of trust that it will do the right thing the overwhelming majority of the time. and I don't have that trust or that reinforcement when I'm doing it in code. So these seemingly trivial things uh stacked up continue to make my life a living hell.
And so as I'm tweaking things, it's like, you know, you're tweaking a a tool that's shared across a bunch of agents in different parts of uh different workflows and all of a sudden this thing that you've been trying to make work just falls apart because you were kind of uh whatever you built was relying on the tool to work this very specific way and now it just doesn't. Or let me give you another example. You introduce another tool and these other agents automatically are able to see it, let's say, because of how it's set up and then all of a sudden they become super ineffective because they think that's the right tool to use and because maybe the way it's named or something else and all of a sudden they're falling apart and breaking. So, I keep having for like a month, I'm not exaggerating, I keep having these situations where I feel like I'm taking a step forward and I'm like, "Oh, like my um what's a good recent example?
My my edit file with a size limits call like it's been working, but at the same time, I realize that the error that it returns to the LLM is confusing." So whatever operation it was trying to do um it ends up failing because the the file size restriction and like it the LLM doesn't really understand why that's happening based on the error from the tool call. So then it's like okay like I'll just try it again like maybe it was intermittent or it goes in a completely different direction because it's like confused as to why such a result is coming back from that tool call. Um it's like oh for example I I just wanted to replace some text. I'm calling the replace text function and then it's like file too long and it's like well okay and sometimes it's like recording in its chat like okay I you know I replace the text even though it didn't work because it doesn't know how to associate file too long with the text not being replaced.
Just like like this. I don't think you have a stop. Oh, you're turning. It's a late signal there. Um, so what I've tried to lean more into is like, okay, well, for tool calls and stuff like that, I can unit test my tools, right? I can write I can do TDD. I can get all my tools and stuff set up so that I'm giving them parameters. I'm seeing how they work. That's all goodness. But like how the LLM uses the tool, um the prompts that I have, all of this stuff is non-deterministic. And so what do we do about that? And I guess like the industry standard type of thing. By the way, if you're like more of an expert on this kind of stuff, like comment, man. Like I'm happy to hear it. I'm sharing with you like what I've been doing. What is this?
Oh no. This is one of those. We have sort of spots around here that are like and it might have already gotten dinged for it. There's some school zones and you have to go less than 20 m an hour. Otherwise, if there's a camera, then you'll just get a speeding ticket in the mail. Like if you're going 23, you'll get a ticket. It's painful. Um, so these evals you can run, you can do programmatic evals. So you go run your LLM scenario and then you check some output programmatically like you like you would with a normal assertion in a unit test or something. Um, so you can do that. Is this guy gonna go? Um, or we have uh like LLM based assertions. So like you can have an LLM as like a judge which is really cool too.
So you run your agentic workflow or whatever needed in LLM and then when it comes to asserting things at the end, you can check programmatically for some stuff or you can run yet another LLM to go score whatever was happening again based on your criteria, whatever prompt you give it. But the problems can compound, right? Because if you have shitty prompts for the things that you're building and you're like, "Okay, like I need a an LLM to judge the output." If you have more shitty prompts into your into your LLM judge, then all of a sudden it's like you have a dumb LLM doing stuff and then another dumb LLM judging it. And it's like, how do you trust any of it, right? Like it gets kind of silly. So, it's it's just one of these things where I don't know like getting it right is not uh what's a good way to say this?
Getting it right is so much harder than getting something running. And and yeah, like the the brittleleness that can be associated with this kind of stuff is painful because you'll you'll find yourself in a situation where you're like, you know, you have an LLM as a judge running and you're like, okay, well, that was dumb that you scored it this way. So like, okay, in this case, can I do a programmatic check to like to also look for this other thing? And so you're kind of mixing like programmatic checks with an LLM. And when you start, in my opinion, when you start getting into this territory, it's a slippery slope of like, are you writing programmatic checks for things that an LLM truly should be able to do if you could just prompt it more effectively, right? Like if you're looking for like semantic stuff in the output to go validate and assert on and you're like, well, I'll just go write a reax that can do that.
Like, is that is that the right move? Is it the right move? Um because now you have this, you know, maybe mostly working busted ass reax that is giving you some confidence, but like at the end of the day, you know that if an LLM were to read the sentence, like it could tell you the right thing. It's just that in some other context, maybe your prompt like was not effective or something. So anyway, these types of things are plagging me and I think if I'm honest with myself, I I think that I assumed that especially with the help of AI writing a lot of this. I I just think that I assumed I would have been um good at it, right? I I think I assumed that I would have had it working. It would have been simple. Could have flown through it. But if I'm honest with myself, I'm literally using a new tech stack.
I'm using new concepts that I've never used before, like this way, and and then expecting that, you know, it's going to be uh a cakewalk, and then being frustrated when it's not. So, I think I'm saying this out loud to like remind myself and if you're I don't know if you're in a similar situation where you're getting caught up on some stuff like yeah, I don't know, makes make some space for yourself to be to be kind. Um because yeah, at the end of the day, it's been about a month for me working on some of this stuff, this some particular things I'm building and I feel like I don't have anything to show for it, but there's been a ton of learning. There's been a ton of like the some of the pieces that I'm building up, you know, whether they were simple tools, right?
like the concept of working in a an in-memory workspace instead of a file system, right? So, I had this from the beginning. Um, and yeah, like it's actually gone through a ton of revisions, right? How I how I make and compose tools for LLMs to use has gone through a ton of revisions. So like instead of having uh tools that you know maybe they're different tools but across all of them I need some kind of special check. Well instead of coding that into every tool I can I can add a decorator right I can decorate all of the tools with one line and like that's a cool win. Now does it break a bunch of when I go to do it? Yeah. Yeah, it does because it never worked that way. But then so I'm paying this tax of like I have to go back and now like I broken a bunch of stuff.
But going forward now all of these tools are decorated the exact same way. Now they'll all enforce the same standards. Um and that means that as I go forward, if I add a new tool, I don't have to go remember to do the file size check. I don't have to remember that um you know I'm writing a tool call that has like broken URLs being inserted. Um I can validate those things. I can validate them with a decorator across all of the tools. So there's like there's been patterns like this that I've been finding for myself that like I said if I if I'm honest the maybe the outcome of having this whole thing working end to end is not met but some of the learning has been has been really important and some of the infrastructure around it has been improving. Um what else? I think you know having full diagnostics in your um in your LLM flows is is just been so such a critical thing for me.
I think that um I think making assumptions about what's getting you know logged or output or tracked that kind of stuff. Um come on man. uh making assumptions about it is like a terrible move because I've wasted so much time where um you know I'm doing these runs, I'm I'm having AI look at the analysis and we're like tuning things and then it reaches a point where it's like you know some things aren't really making sense. like we're touching something and it's breaking some other stuff and then I'll be like wait a second like you know to the LLM how did you arrive at this conclusion like show me the data for that and it's like oh we actually we don't have that like you know whatever we've been running doesn't actually output that and I'm like well what do you mean it doesn't output
that like we've been you know we've been doing iterations of this thing touching these other variables my assumptions been that you would have told me if something regressed, but like we weren't even tracking it. So, no that we that we weren't aware that it regressed. Like there isn't even data for it. So I think yeah as much you know the one of the things I just want to reinforce here is when you're doing these uh these workflows programmatically and trying to tune them like if you don't have uh a ton of visibility as much as you can possibly get for your your agentic flows. Um it it's it's just really hard to tune them because you might make progress in one direction and then in some other direction like you've completely destroyed it, right? When things are more programmatic, it's a lot easier for us to assert on these things and check them.
And you know, this is this is why a lot of the time we put tests in place, right? You have good a good regression test suite. you can go touch different parts of the code and you're like, I know that I didn't break the rest of the Um, so you can tune stuff, you can modify and and have more confidence. Um, not always, but you know, this is one example. And so, yeah, I think without the diagnostics, you're you're you're really in the dark. And when you're working with AI and it's giving you this this sense of confidence in the analysis, um it might be showing you data, right? Showing you data that is confirming some other things, but it's not showing you. It's almost like it doesn't show you what it doesn't know. So like you have to know. you have to know so that you can tell it like hey man you're not showing me this or we don't have data on that.
So I think that's another big learning that I've had is like you know before going too far in the iterations really go through the data that's being output. Um I've been using the in co-pilot they call it like the rubber duck agent. It's like an experimental feature. Um, I haven't looked at what prompt it uses or whatever, but it's been super helpful. It's not perfect. Sometimes it has suggestions that are kind of stupid, but um, if you're not familiar with this, I think Claude has a similar thing. It's not quite like the code review agent, but it's it's almost like a de devil's advocate agent. So when your LLM is doing stuff and presenting results or even planning, you can ask it and sometimes it will do it itself if you have the feature turned on. Um it will go talk to a rubber duck agent and get like this sort of uh not adversarial is too strong of a word um but kind of like devil's advocate perspective on things.
And so I found that, you know, before I'm going to get a bunch of analysis on data done to go tune some of these uh agentic flows, I'll say like go talk to the rubber duck agent and make sure we have all of the data we need. And that's caught a lot too. Um so yeah, I've been I've been doing a lot of that. And I think another just to kind of switch gears a little bit. Um that's been mostly like in the brand ghost space. So some of the stuff that I've been building is just it's probably extra frustrating because we can move pretty fast in brand ghost and I just feel like this has been really difficult for me. Um it's almost like uh what's a good word for it? Shame is probably too strong of a word. Like I don't think that I'm ashamed of it.
Maybe embarrassed. uh something between like ashamed, embarrassed, and frustrated. Probably some of those emotions. Um that's kind of how I feel about my my current development on this. Um, and I guess one more thing to mention before I totally switch gears is uh that the nice part about going through this too is that I'm abstracting these lessons that I'm learning into my um my Nougat package called Neler, which I use for dependency injection, but it has a bunch of like setup stuff for Microsoft agent framework as a dedicated package too. So, um, you know, if I if I find that I'm building out these patterns over and over again, uh, with some of this agent work, I'm pulling it into Needler so that I can compose them more straightforward. I kind of have it like at a framework level. Okay.
The other thing that's been uh fun and this is sort of like it's just on the side and especially because I think for me when I'm not making progress on stuff like I need a way to kind of reset my brain. You know, you'll hear people saying like, "Hey, go for a walk or go take a shower or whatever." Like get away from your computer. I think for me, I just need to like switch what I'm thinking about it. I can still be programming and it's fine. And um so I I mentioned in a couple of videos recently, I've been having AI revibe code one of my uh old sort of RPG games that I was putting together. Um one of these things that like will never be finished, but it's just kind of fun to like see the pieces. And so I have it I I showed it on a live stream a couple weeks ago.
It's nothing it's nothing impressive, but if you consider that I've not seen a single line of code for it, it becomes I think a little bit more uh interesting, right? It's like it's got a it's like client server base. It's got a Unity uh Unity 3D like engine support. The entire framework itself is platform agnostic. So there's a reusable RPG framework for building games. Then on top of that is my game specific stuff that's still engine agnostic. And then on top of that is like the engine specific uh game. So kind of designed it with this uh separated out architecture which for me like that's the that's the fun part like I like thinking about things and designing them that way like actually you know actually going to implement it I'm like less less intrigued by uh which has been a good learning with AI that like understanding what parts I genuinely enjoy.
So, this thing's been iterating for me. Um, you can walk around in the game world. It will load and do like procedurally generated maps that are they're terrible looking, but you get some, you know, some dungeon rooms that are stitched together. Uh, it's got, you know, lighting. I just had it doing some really basic AI stuff over the weekend. Uh, not not LLM AI, but like having, you know, enemies that will pathfind you and attack and have uh sequences and stuff like that. So now when you go into the dungeons, you'll actually have uh like sk like they're named skeletons. They don't actually look like skeletons right now, but um you'll have them kind of chase after you and attack you. So that's cool. Um I was trying to get some projectile stuff working. That has been a nightmare. Um, I think that got extra bad because it broke something along the way.
And so it's like it did all of this projectile work and when I went to go run it and play with it, it was actually um like other things were broken. So like the I think even right now the heads up display like that has your life and mana and experience bars and that kind of stuff, your skill slots, that's all completely busted. And so it's it's infuriating because um for a lot of things it's able to like it it will tell me like oh I can't validate that and I'm like no you can like uh it's like you need to go run Unity to validate that and I'm like nope you have play mode tests like you can absolutely go assert all of this stuff and it's like oh you're right I can um and then it will go iterate it will find and fix bugs and then I'll go back to Unity to try it.
And there's still there's some discrepancy between how I'm loading and playing scenes in Unity versus what it's doing. So, I have to get to the bottom of that and make that consistent and then enforce it programmatically um so that it can't break that pattern. I have to provide instructions in the agents MD so it knows and then I need um these glob matching instructions to to make sure that it knows when you're touching files like this you can't make these types of decisions because you'll break So that's been really slow. But at the same time, I was like, "Okay, like that's chipping away at that stuff." Like the other part that I I like never focus on when it comes to games is like I need the actual game content. So the last time a few years back when I was trying to play around with this, uh, I was trying to capture more like game world, story, lore, that kind of stuff.
um just into into documents and this is really around the time when we were getting things like chat GPT. So I was like hey this would be really cool to help me do some more writing just to get some content right. But this is at the same time where anything that you put into chat GBT was like in the vast realm of the digital landscape like you know exactly what I'm talking about. Um it's just like absurd AI slop. So, like that didn't go very far cuz I'm like I can't go pollute all of this game content with like this complete Um, but now like the models and stuff are way better. So, um, you know, started saying, okay, like can I just even sit in the, uh, the C-pilot CLI while other stuff is running to go build out in Unity and whatever else? Can I have it work through some of the the lore?
Can I have it make sure that the timeline uh for some of the game world is actually consistent? Like can I explore that? Um started having it build out a content editor. So again, I haven't seen any of the code for the content editor. Um but it's a little app. It's a Blazer app and um I have it so that there's like a a concept art pipeline. So that was kind of cool. I hadn't done any local uh any local models like successfully ever. Uh for whatever reason, anytime I've uh attempted to run local models, they just like they really don't work. uh you know if I get results it takes like 30 seconds on like my my gaming rig and I'm like this is so I'm doing something wrong here cuz there's no way anyone's using this and so these are image generation models
and I thought that they would be significantly less performant but they're working and it's pretty cool because I have um I have this concept of like can I take prompt templates and like layer them together so for example If I wanted to do concept art for uh landscape or scenery, right? I want a prompt that has something about landscape and scenery, then can I layer on like the biome that we're in? So like desert or jungle or whatever else um Okay, cool. Then can we go one step further? like I want to add uh a a park like basically an alakart selection of like now give me the I want to have like some buildings or I want to have whatever and and just getting it to layer in these different pieces and so like it's been working. It's pretty cool. Then we played around a little bit with um repeatable characters, right?
Like if I want some NPC or I want to have a a figure from lore, right? Like can I generate a character and then can I basically use the LLM to repeatedly generate that character in different situations? Uh and I know it's AI. It's not going to be deterministic, but can it do a good job like repeating the face so that at least in concept art, you're like, "Yes, I know that's the same person." Uh, so that's not perfect, but it was working. So that's pretty cool. Uh, and then the other part, and this is really why I I started doing it, was that I need inventory graphics because when you're dropping loot in the game and you want to equip the items or look in your inventory right now, it's just like, is it not here? What the heck? Oh, no, it is here. Um, right now it's just like these placeholders and it looks pretty stupid.
Um, like I I want it to like if I'm going to do this, I want it to look like a game. And so right now it really it looks like someone who has never touched Unity in their life um is putting together a game. And like I have worked in Unity. I'm just terrible at art. So, um I I wanted to get this like inventory icon pipeline put together. And again, I wanted it to have these composable features. So, I could say things like, I want a shield or I want a crossbow or I want a sword and I want it to be made of metal or I want it to be made of crystal. Uh and I want it to have an ice effect. and I can basically pick and choose these different pieces. And so that's been working really really well. Um I am I'm so impressed with how that works that uh it's Yeah, it's been super exciting.
I think I missed my turn somehow. That's confusing. Okay. Well, I guess I'm at where I need to go. You got to go, buddy. I can't turn in there when you're there. So, let me get in here and I got to get parked. I'm so confused by this parking lot, though. Cool. Well, yeah, I will probably do a couple videos on Dev Leader showcasing that cuz I think it's pretty cool. And, uh, yeah, thanks for watching and sticking through some of the AI stuff. And I will see you in the next video. Take care.
Frequently Asked Questions
These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.
- How do you handle non-determinism when building repeatable agentic workflows?
- I find that LLMs are non-deterministic, and when I'm building repeatable software, a small prompt change can cause cascading failures. I rely on evals, including programmatic evals and LLM-based assertions, to test and score the outputs. This helps me understand and manage the brittleness as I iterate.
- What challenges do you face when migrating from harness-based workflows to programmatic code?
- I've been trying to migrate from relying on a harness like Copilot CLI or Claude Code to something more programmatic, and it's night and day. The harnesses are incredibly capable, and the libraries I'm using don't replicate that capability well, so I end up tuning a lot. I've been at it for over a month outside of work with little progress, and I feel like I'm losing my mind.
- How have diagnostics and tooling patterns helped you manage agentic flows?
- I've found that having full diagnostics in my LLM flows is critical. I've started decorating all of my tools with a single decorator to enforce the same standards across tools, which makes adding new tools easier. This approach reduces duplication and helps ensure consistency, so when I add a new tool I don't have to remember to apply the file size check.