My Agent Orchestration Is Finally One-Shotting From One-Liners

My Agent Orchestration Is Finally One-Shotting From One-Liners

• 69 views
vlogvloggervloggingmercedesmercedes AMGMercedes AMG GTAMG GTbig techsoftware engineeringsoftware engineercar vlogvlogssoftware developmentsoftware engineersmicrosoftprogrammingtips for developerscareer in techfaangwork vlogdevleaderdev leadernick cosentinoengineering managerleadershipmsftsoftware developercode commutecodecommutecommuteredditreddit storiesreddit storyask redditaskredditaskreddit storiesredditorlinkedin

In this video, I touch on the first milestone where my agent orchestration is finally getting from simple prompt to built app in one shot.

📄 Auto-Generated Transcript

Transcript is auto-generated and may contain errors.

Hey folks, I'm just driving to the office here. Have uh employees visiting from Mexico and Costa Rica today. Well, this whole week, which is awesome. So excited to see them. I just stopped at Crumble Cookie cuz that's the only the only thing I know how to do for getting something to bring in for snacks and stuff like that. other people like bake things or I don't know. I'm I'm not good at it. So, I I go to crumble cookie, but that's what I'm doing. That's why I'm a little off my normal schedule this morning. But, I figured I did a couple videos that I recorded earlier to and from CrossFit about some one was from Reddit, the other one was a submitted question. I got to go straight here. And uh I figured we just use this drive-in to do a bit of a an AI recap on some things I'm working on.

Bad timing for a light to change. Come on. And so some of the recent videos that I was putting together were around um Ralph loops. I'm starting to try looking at that on the side. Um, and then something called Symphony, which my understanding is this is like a a spec that was put together by I think by Open AI around like orchestration patterns for for multi- aent development environments. And so with Symphony, the idea was that the spec was open and they said like build your own, which is cool, right? And it's like here's here is this highle spec. You could build this in any language or whatever you want. Um and then they I guess had I think they said it was an elixir. They had an offering like they put a sample one together. But basically here's the spec build your own. Have fun with it.

And so what I had been doing was I started with putting a Ralph loop together, like a a bit of a harness that um using Copilot CLI. And for for context, right, for folks that don't know, uh I'm not against clawed code or anything like that. Uh I use Copilot CLI a lot. Um the primary reason right now is just for tokens. uh I have a bit of a a path to have uh lots of tokens. So, I'm using co-pilot CLI primarily. And so, I started with this Ralph loop harness where I could basically say uh very in a very straightforward way like go do a Ralph loop on this task and then have an agent kicked off uh in in its own little harness to basically keep iterating until it delivers or or runs out of time or you know fails to meet uh the conditions.

And as I was putting this together, came across Symphony and was like, "Hey, wait." Like to me, this is actually the more interesting thing. I thought it would be fun to to put together this Ralph loop harness. And then when I saw Symphony, I was like, "No, like that's that's really the thing that I was hoping to kind of put together." So, I started bridging them together. And so over the last little bit um kind of on the side like literally uh outside of work outside of brand ghost development I just have another co-pilot terminal that's kind of uh vibe coding uh this Ralph loop and then gave it the uh symphony spec. Is that a cop? It is. What are they doing there? Maybe they pulled someone over and they're they're just waiting now.

Anyway, um and so finally the way that uh like over this past week, I guess over this weekend is really where I I've had my first um end to end experiment work with my symphony and Ralph kind of loop put together. And the way that I was doing this was like um for me to be able to call it successful is that I need to be able to put together a a heavier spec. And when I say me, like I'll have the LLM do it. Put together a spec and basically hand it over to this this big harness that says like here's the work. Here's how it's scheduled. and like go divvy it up and let let these workers run as Ralph loops to go do this work.

And so in order for it to be successful, the LLM that's uh that's running the whole thing for me is not allowed to interfere because you know as you're vibe coding and developing stuff in your terminal working with agents if it's running tests and stuff for you like sometimes you know the idea is like it's trying to make forward progress and and help and so it's like oh there's there's a bug like let me let me fix it and then we'll like keep running the experiment. And I had to keep telling it no like as soon as you need to intervene it's great you found an issue. Yes, we should fix it but we have to restart the whole experiment. And the reason is that for it to be a success in my mind it needs to be completely handsoff. That is the entire reason I want to do this is that I'm combining these uh this concept of like Ralph loops or whatever just having agents kind of iterate in isolation.

uh having an orchestration layer on top of it. And then the other things that I've been talking about recently are like my my coded templates and uh instructions that I share. And so the idea being that not only because you could you don't need any of this and you could probably have success oneshotting stuff and get lucky and it's cool, but I want the code that's produced after to be something that uh is built on like sort of my years of engineering experience. And that means like I don't just want random patterns and scattered everywhere and just because it ended up getting it to work because we all well maybe we don't all know this but from my lived experience uh yeah like you can get some stuff tossed together pretty quick but if you change direction at all right you want to add new

features or you want to you just want to change any direction that you didn't perfectly plan for in the beginning if your code wasn't designed designed in such a way that it's extensible, that it's modular, that you have, you know, the easy ability to test different aspects of it. You end up in this spot where it's like you touch anything, people are slamming on their brakes. I watch someone have to swerve around a car. Um, you end up in this spot where you touch anything and the whole system's brittle. And then this is where to an exaggerated point when you have this in companies, people are like, man, like we really need to rewrite this thing, right? So I I'm not saying that this is going to solve the need to rewrite things ever, but I want software built based on my engineering experience. Like this is very opinionated, right?

based on my engineering experience, following the guard rails that I want, so that if I were to go look at the code after, I would feel like, yep, this feels like it's built on, you know, what I want. Not I open up the code and I go, "Holy shit." Like, I have no idea how anyone got this to work. Like, this is a nightmare. And that's been a lot of my experience, right? If I have something really big that's vibe coded end to end, it doesn't have to be a full application. And it could be could be a big feature. And I've I've had this problem in brand ghost where I'm like cool like let me go have AI build this thing end to end and then it does it and I look at it and I'm like man like it literally works cuz I

I'll run it and I'll try it and I'm like it works but but no like there's no way I could let this into my code base because it introduced like you know three new patterns for something we already have a wellestablished pattern for. Um it uh the tests it wrote are completely because uh it introduced these new patterns that don't lend themselves well for being tested. So then it the only way it could do it was like to to like use reflection to like as a net developer. So reflection so that I can kind of like bastardize the type system just like just not good. And so then I spend a lot of time trying to put it back on track because I'm like, you had the right concepts sometimes and how you choose to implement those individual pieces. You just went off the rails picking patterns.

So I'm doing all of this because I genuinely believe um because like it's already happening. I am not writing much code anymore. And I don't mean that like the typical doom and gloom like oh software developers are replaced. No, I mean like it's uh I'm still heavily involved in the architecture. I'm looking at a lot of code reviewing it making sure the patterns are in the right direction because I am spending so much time looking and seeing these patterns like go off into completely divergent paths. agents running into issues where they're like, "Oh, like, you know, the scope of this work is going to be crazy because we have to go touch all this or it's a surprise that the scope is so big." Like, there are ways around this if you put some thought into it up front and then like stay the course. It's a very cool color.

Purple, like a Mer, is that called Merlin on BMWs? Merlin purple. I don't know. It's nice. It's really hot in here. Holy crow. So, the experiment that finally I had run end to end was just basically um kind of a a little bit of meta concept here, but I'm having a uh an example of the harness go build a mini dashboard for the harness, which is kind of neat. So, uh, there's a basically go build me a front-end app that could do things like monitor agent development. And I should have said this earlier cuz I mentioned it in the other videos. I'm not building this because I think that I'm the first person to think about all this stuff. By the way, I am confident that many people have already built working versions of this and they're awesome and amazing. I'm doing it because this is how I like to learn and explore.

Um, so I finally had it do an endto-end loop without intervening. And then I asked co-pilot, I said, I like I want you to go review what it built, right? Cuz the the idea with the the Ralph loops and then this orchestration on top is that the Ralph loops will get to a point where they're like, I've delivered based on the evidence I need to supply. And so I I based on what was asked of me, here's evidence. I have proven it works. Based on what was asked, I'm done. Thank you. Next. And then the orchestration is supposed to keep feeding the system like the next bit of work to do. Um, and then there's some other kind of interesting concepts like if you were to give it work that's impossible or it gets stuck like can it repl like just some meta things on top but side note.

So it it went through this plan built a dashboard co-pilot reviewed it and it was like yep like it built what we need. So I said, "Cool. We've proven end to end, like this is just one time, by the way. Maybe we give it something else that's a little bit more complicated or whatever." And then we start to uncover some other issues in the harness. But the whole point is like I want the concept to be proven that it works. And having one success tells me like it's a non-trivial thing. And having one success tells me like the concept works. Is it? Do I think it's perfect? Absolutely not. Are there bugs? Absolutely. Can it be optimized? Absolutely. But like, let's do baby steps, right? So, it does this end to end trial, evaluates it. Things are good. So, I go, awesome. Um, we're going to try doing the real thing now.

So, I sat down with co-pilot and man, there's so much traffic. Sat down with co-pilot and said, "Okay, yeah, it was great. We did like a small dashboard, but like I want a completely featurerich, professional, like stylized dashboard with modular panels. Like I want I want the real thing. I want a control center for this orchestration. Like we're going to have this thing built. So, uh, you know, really cool stuff. If you have, I'm sure most people have now at this point, but if you haven't used like planning mode in co-pilot or in claude code, like highly recommend you do it. Um, you can you can really put together some interesting like designs and specs and stuff. Uh, even just doing it and thinking through it with someone else, in this case, the someone else is an LLM is so valuable. Um, yes, I trust that you are a smart software engineer.

I like to think that I am as well. But the value that we get out of just discussing ideas with people is so tremendous because it gives you different perspective. So I sat down with it. We planned out a bunch of features. It asked I think really good questions about like like it got me thinking about features I wasn't thinking of. Um and yeah, so built out this whole thing and then it sat there for a while cranking out like specs. Um, this might be out of anything I've sort of like vibe coded uh all at once probably the biggest set of like spec files that I've I've ever seen. That's not to say that it's going to be the most complicated thing I've had AI build, but in terms of like a oneshot set of specs, I it might be, which is pretty neat. So now my goal is not, hey, go run this thing and don't intervene.

I actually wanted to go build this thing, right? I've proven the concept. I don't, in this case, I don't need to keep running the trials and say, "Oh, no, like if you touch it, restart it." The concept works. That's awesome. So if something breaks along the way, like say it stalls or um you know it's getting through different phases and like it's collecting data and says this step had to retry like 10 times. Like to me that suggests something's a little funky. So I wanted to keep track of all these things. Yes, we can go fix them. Um and the most recent example I think on I don't know how many total phases there are but it was on phase 11. The other um sort of trial that I had done had uh eight phases in total uh that were less complicated. This one was at phase 11 with much more complicated like sets of work to do and then it got stuck.

And it got stuck. I don't know the exact reason, but what it got stuck on was it needed input. It was doing something and the LLM was like, "Hey, I can't get this. Like I I need someone else's opinion on this." And so the orchestration layer has uh like a human in the loop concept. Um it's the the abstraction of this is named poorly because a human in the loop is one variation of it. But it's essentially like I need an adult to tell me what to do. If you've used like autopilot modes and co-pilot CLI, you'll you'll see in the chat sometimes the LLM is like, I need some input. And then the autopilot mode responds automatically and it's like used your best use your best judgment. Um, and so like that's one flavor of like this human in the loop concept.

So, when I wired this all up to go have it built, I said, "Hey, I'm not going to be the human in the loop." Like the there's going to be an LLM responder that's a human in the loop. And if you've used uh I can't remember what Claude has, uh but Copilot CLI has like a rubber duck agent, right? So, basically, go ask some other agent. I realize you're stuck and that's cool. Like yes, it needs to be figured out, but I'm I'm being hands off with this. Is this the best way to build software, by the way? Like probably not, but let's see. So, you go ask some other agent. So, this LLM responder was wired up. It did trigger in some earlier phases, but on phase 11 in this particular case, it got stuck. And so, when I checked it this morning, it was like, "Hey, here's how far we got.

Got stuck on this." So, I said, "Okay, cool. like I need you to go investigate why it got stuck. It seems like uh something kind of flaky like the the session that was starting up the LLM just broke and there's no retry, there's no time out like there's no resiliency around it. Uh and co-pilot when it was investigating said like a previous four other attempts in the whole run, they all worked perfectly fine. So, it's not like the feature itself is broken to have an LLM jump in and be the human in the loop. It's just that there must be some resiliency that we need to put in place or maybe there's an edge case or something else. So, this thing got stalled. So, I said, "Cool." Um, worked with co-pilot to come up with some fixes to to add some resiliency, all this good stuff.

And then I said, literally before I left for work, I said, "You're gonna go fix this. You're going to commit it." And instead of running that trial experiment from from the start, I said, "This isn't a trial anymore." Like, we're building this thing. So just have it resume, which is actually a use case in and of itself because when I hopefully go to use this in the future, the goal isn't let me go orchestrate a really complex thing to build and if there's ever an issue, just uh restart the whole thing from scratch. I I want to be able to resume. So, um, what we're going to see is that hopefully it starts off at phase 10 where everything was completely good to go and, um, we'll have delivered the fixes to the harness. And I'm hoping that when I go home, either it's still running cuz there's a lot of work to do or it at least got further um, than phase 11, which is just the the very next step.

Um, now what we might realize is that the the reason it was asking for a human in the loop in the first place, maybe the spec that was put together is just wrong. Maybe um, you know, like like with anything, I was oneshotting all of this huge spec that was being put together. It's entirely possible that co-pilot originally put something into the spec that's in conflict with itself and it's like physically impossible to meet all of the requirements and as a result needs, you know, a human in the loop. Now, maybe that's something that once we review it, we go, cool, like we just have to do better planning. Maybe that's something where I can say, you know what, my my orchestrator, um, especially when there's an LLM as the human in the loop needs to be more intelligent and actually uh maybe I need to tell it to go look at bigger parts of the plan to go figure things out.

Like, I don't know. My point is maybe we learn that there's some part of the harness that we can make more intelligent in that way. uh if you you know put something in your spec that's kind of at odds with some other part you can either let a human like jump in like say it stops everything and it's like hey we need a human input on this and I suspect for real things I build that's probably how I would want to do it unless I get to the point where I can trust my agents so much to make decisions how I would that um that I don't need to be the move into the loop. Okay, someone's got to let me in. There we go. Um, but we're not there yet, right? So, I that's why I'm doing this other stuff with putting guard rails and things in place, uh, these globased instruction files, all of this stuff.

I'm trying to get to a point where I can let an agent just do work and it has enough framework around it to to kind of operate how I would navigate and hopefully better, right? So, we'll see what comes up out of that. Uh, but my, like I said, my hope is that it gets at least further than this step and then once all of this is done, right? So, say we get to the end of building this dashboard. Cool. Like I want to integrate the dashboard literally back into the project itself so that my my orchestration framework has a cool UI that I can like work with. Maybe I need to go build out programmatic APIs. Maybe I need an MCP server. I don't know. It's it's fun to keep building on top of it. But the idea, if it isn't obvious yet, the idea is that I want this system to be able to eventually um what's a good way to say it?

Like I want it to be able to build its own features kind of like recursive if that makes sense. So right now I'm having it build a dashboard. It's doing it in a dedicated repo. All that kind of stuff. That's fine. And what I'd like to do after is I will I told co-pilot and I'll tell it after, hey, great. If this works, like basically merge this code back in, but in the future it would be really cool if it could basically spin off um you know, I don't know how it would do it necessarily. Does it need to just make a branch of itself or does it need to clone the repo separately? I just don't know. But I'd like to have it basically where I can say, "Hey, I want you to self-improve and build this feature out." And then it uses the orchestration framework to go do that.

Um, one of the I wanted to, as I'm getting closer to work, just kind of talk about how I envision using something like this in the future because even if it's not, you know, the the system that I'm putting together right now, it probably won't be long term because I'm sure other people will have done this better because it's their entire job or what their company does. Um but my kind of goal is that it's sort of this hybrid between what uh co-pilot sorry GitHub copilot cloud agents were doing and how I was using that mixed with uh sort of this you know this this power that I'm seeing with having uh the harness in your terminal.

Um cuz if for those of you that don't know, I was saying last year I did so much development with like just kicking off uh GitHub copilot agents and having them build stuff in the cloud, but then the capabilities of the harness were like uh in a terminal were so much better that it was like really hard to go delegate work to the cloud like that. And they'll converge, right? Um I think they must. So, what I'd love to see is that in the future as I'm building software, I'm regardless of, you know, if I'm driving in my car or sitting at my desk or somewhere on my phone, I can describe. Oh, man. Did I miss part of my head shaving? No, I can feel it. Oops. Um, I can describe the the feature I want, the bug fix I want, and I can put it into a queue of work and basically agents will will pick away at that work.

And you're, as you're hearing me say that, you're probably like, "Yes, there are like you can literally already do that, right? Make a GitHub issue, assign it to C-pilot, and so on and so forth." Um, yes, but uh I'm kind of talking about it less from the perspective of like oneoff things. Uh, I would very much like to just basically continuous like be in a position where I can continuously fill a queue with work to do and you know orchestration just continues to pick up work. um has enough intelligence that if I uh basically don't give a lot of detail that I can trust it to go spec out the right thing. Uh that it it can split up work appropriately across agents and deliverables as it needs to.

So it's really like what I'd like to move to is more than obviously more than the Ralph loop itself, more than just the orchestration of can you assign you know workout and one more I guess it's like kind of uh two separate layers on top. One is sort of like the interaction and visibility with what's going on with this entire fleet as it pertains to the work I'm trying to assign. And then this like the part that replaces me, which is like I don't want to have to figure out the optimal way to divvy up work. I don't want to do that. Uh I have to spend a lot of time doing that. That's like that's what I even do. Um a lot of my job when I like for real work, I talk about prioritizing, right? Part of prioritizing is like there's technical limitations and things we have to figure out and navigate that.

So it's not just what's the most important thing. It's like what's actually feasible on timelines and putting this all together. And so I'm I'm hoping and this is it's a hypothesis, right? I don't know this, but I'm hoping that we can get to this point where I can think about it's less about um the technical challenges like how are we going to restructure code to make sure that we can get this deliverable in. I'm like let AI figure that out. I mean I kind of like part of me likes that but not under pressure. It's fun to do in side projects because I get to explore and and learn, but for real work like I don't want to do that. I want to work with stakeholders, understand like truly what's going to make a platform better, help drive business, like whatever our goals are, I want to focus on that.

And then sort of the downstream part is agents kind of figuring out, cool, given given what we have to accomplish, how do we go do it and how do we like piece these things together and then fan out the work, do it, figure it out. Um, but like sort of that that next level of like of of planning and how things mesh together. I think that that's where I want to get to after this like sort of like more thin orchestration layer. I should also mention like this thing that was built uh based on the Symfony spec like I can I can hook it up to a repository and I can send issues like I can file the issues and then it will monitor and and pick up the work. So it doesn't have to be something that you like oneshot in a terminal. Uh it very much is something that you can hook up to a repository and we'll watch.

So, um, yeah, just couple of interesting things, I guess. But that's what I'm looking at. Uh, literally completely on the side. It's got I don't know. It's all in the background. Is there a fire drill? A lot of people outside. I've never seen so many people. weird. Guess I'll find out. Yeah, that's the kind of stuff that I feel like I I need to spend time on it in some capacity because if I only focus on purely on brand ghost and like delivering on that, which I think is obviously important, um, and I only focus on what I'm doing in the office, I feel like I'm not getting enough pure exposure to some of these other concepts. So, got to play around with all it. That's my update. See you in the next one.

Frequently Asked Questions

These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.

How do you keep the LLM from intervening during end-to-end runs?
I keep the LLM running the whole thing completely hands-off. As soon as it needs to intervene, it's great you found an issue, but we have to restart the whole experiment. The goal is to have the end-to-end run be completely hands-off so the orchestration can operate without manual intervention.
What happened when the orchestration stalled at phase 11 and how did you address it?
It got stuck on phase 11 because it needed input from someone else and the session that started the LLM broke with no retry. I wired up a human in the loop responder so another agent could provide the input when needed. We added resiliency and I told it to commit the fixes and resume rather than restarting from scratch.
What is your long-term vision for this orchestration system and how do you want AI to handle work?
I see this as a hybrid between what GitHub Copilot cloud agents were doing and the terminal harness I use. I want to continuously fill a queue with work and have agents pick away at that work, describing the feature or bug and letting them spec it out. I want to minimize my involvement in the low level technical decisions and focus on stakeholders and business goals while the orchestration handles planning and distribution. I also want it to self-improve and spin off features, and eventually integrate dashboards and APIs.