What's involved with being a developer on a platform team compared to other types of teams?
📄 Auto-Generated Transcript ▾
Transcript is auto-generated and may contain errors.
Hey folks, I'm going to pick a random topic today. I guess I didn't check experience devs. I don't think I have any pending questions. Apologies if I have missed yours, but we're going to talk about uh like being a a platform team when you're supporting a lot of live services. And so wanted to talk through this because it is I guess like it could be a unique experience for for some people depending on on what they're building already in their careers. So for example, if you're building um a particular product or service, let's say if it's a mobile app, right? You have your product that you're actually shipping to customers. Very likely there's a back-end service for that as well. So you have a product and service and depending on the scale of the the product and service you have. It might be that you have many services.
It might be that you have some type of platform for other people to to build more service integrations and stuff with it. And so depending on what you're doing, you may be only exposed to certain parts of this, right? And to just to continue the example um if I think about some of my prior experience to Microsoft I was building desktop applications uh for digital forensics. So I spent almost a decade doing that and there is no live service. We had we had some things that uh were considered like live service but you know the core of our our offering was very much a desktop product. And so if you were working at that company, you might have been on a team that is only ever thinking about, I'm building a product and shipping a product on on some type of regular cadence, not a live service.
It's not even cold. There's no excuse for this thing to be beeping. So there's Yeah. And there's all sorts of things in between, right? Um so I I wanted to talk specifically about not even oh I have a live service. I wanted to talk about when you're the platform that the live services depend on. And so this often comes with some amount of scale because if you're talking about a platform in general like like why what is a platform? Why do you need a platform? Isn't it just a you know client and many one or many services, right? Like what where does a platform kind of fit into this? And when we talk about platforms, at least the way that I want to kind of frame it here is that it's some shared offering that um is provided to uh to services essentially so that they don't have to go duplicate the effort.
Okay. So just to make up some examples, right? Maybe there is a uh a company and they have a bunch of services uh for their product or products right so there's a also a bunch of live services that are running and they go wait a second like we have you know uh 10 20 teams whatever they're all building these services and like every single one of them has addressed authentication in like a different unique way and like when we think about that that doesn't really make sense Because like from our company we are we are one our customers are sort of in the same uh grouping like it makes sense that we should have O done a consistent way. So instead of having everyone do their own, what we should do is have like an authentication platform, right? People can build on top of that and you can keep extending this to to other concepts, right?
Maybe there is uh like an analytics platform for the company, right? So that every single service is not building their own uh analytics uh pipelines ingestion um dashboards like maybe maybe you create a platform such that it makes it very easy for everyone to do it in a consistent way and then the only manual effort is really like is kind of tuning it like you have all of this ingestion you have all this data awesome uh it's compliant awesome and you want to see particular things based on your data. Cool. You just customize that. You pick what you want to send for analytics. You pick what you want to or how you want to see it and like we'll handle the rest with our platform. So again, the idea being that platforms are uh they are also, you know, live services themselves, but they're kind of like infrastructure for the live services.
And the side effect of that if you're kind of thinking about this or trying to you know maybe think of other examples like what might be a platform uh the side effect of this is that if there's an issue with the platform it can have sweeping impact across all of the services that depend on that platform right so uh what we don't want in really big distributed systems is to have single points of failure so platforms have to be extremely resilient because uh you know I mean everything should be ideally resilient but if a platform is going to have an issue it's going to have an impact that is not just you know the platform has one service oh no that's having an issue it's the platform's having an issue and the 10 20 50 100 hundreds of services are also now impacted. Okay. So hopefully that makes sense so far at least for framing.
So when we think about being on a platform team, there's not only this sort of uh you know shared experience with other live service teams where it's like hey there's an issue with your service. You're going to have angry customers, right? We want to make sure that we can do our best work to uh have our services up and running, be performant, low latency, all of these different things that we're optimizing for in our services. We want that so that our customers are able to use the products and services we have and they're happy. So that's a shared experience with platform teams. But there's an extra layer to it which is like well and this gets kind of funny when we talk about this even internally like when we say customers who are the customers for a platform team there's like multiple levels of it
right because you could say well the immediate customers immediate customers are actually the other services in our company that use our platform right they are customers to us we are building a platform for them so that you know their their services their job is easier um more effective all these things. So internal stakeholders are sort of your customer but at the same time like ultimately for a business you all have the same customers right like uh it's not just okay we only care about internal teams that's the immediate customer that bike is going very fast they're the immediate customer but um you know it it's still important of course to think about the end user, right? So, almost like several layers of customers and depending on what your platform is and how this uh how this looks in your company like you may have multiple levels of like who you consider customers.
So sometimes like I find the word customer can be a little bit confusing in conversations because of this for platform teams and so I try to pick um different words just to to be a little bit more specific about what we're talking about right like so sometimes I will say uh like end user as a customer like a very specific group of people so if I'm if I'm talking about our platform which is a routing plane Um, if I want to talk about customers that are external to the company, I would often say something like end users, right? People that are using whoever else's product or service beyond our platform. Those people are the end users. So, I try to pick more specific terminology. I might say um service owners for our uh sort of immediate customer. So the other people building on top of our platform are uh service owners, right?
So um we have that and I will sometimes like mix the language a little bit like uh we use like the word partner a lot when we're working. So for me the terminology I will say partner whether it is a um what's the right word like a like a sibling team like a different platform team. So let's go back to the analytics example. If we're the routing plane team, there is a team that does like a lot of like uh aggregate analytics in common ways. So like they are to to us they are a partner team, right? So we don't necessarily like have to depend on each other for certain things. But the relationship is like there might be initiatives where we're partnering together as a uh as a larger platform to go uh work on the same types of challenges. Um, and then I might use the word partner as well.
Like I might say partner service owner sometimes where I'm talking about again a service owner using our platform, but we're working together to try and like address uh, you know, a feature we're building uh, for them and they're like a good uh, user of it uh, or you know, working through particular incidents. I I will often use the word partner. Um, by the way, I'm saying all of this just to to kind of frame up like some thinking. And I think that it's helpful to get uh like ter like it might sound like it's maybe a waste of time and you're like, who cares? But I think it's really helpful to get common terminology because when we start talking about things that are at this scale and we're trying to think about like who we support, who we interact with, I think that it's really important to make sure that we're talking the same way, right?
I was just in a conversation uh yesterday and we were from our side on our platform. We're saying the word client, right? Client. It's an over it's a overloaded term because for us we're thinking client this is quite literally um whoever's calling our routing plane we're thinking about this like it's we don't know uh what we don't care what but it's coming from some IP address right that's a client to us but a client to the in this case the partner team that we were working with uh also service team that we're working with so partner service owners um to them a client means a different thing. It's more of a specific application and less about the IP. So again, just a quick example to to explain that like common terminology is very helpful for having clear, meaningful conversations.
It's just too easy to to not have the same terminology, be talking about things, and find that you're talking past each other, miscommunicate, and it's the root of all problems we basically have in software engineering. So if this is some framing um one of the things that I wanted to spend a little bit of time on is talking about like as platform teams. Yes, we are a live service like the live services that we're supporting with our platform. Yes, we have some shared challenges like the fact that it is a live service. We need to make sure that we have basically, you know, 247 global support, right? It's it's big. And at the scale that we're at, like when I say it's big, you might have different perceptions of big. Like when I'm saying big, I'm talking about like quite literally across the entire planet. Um, trillions of requests being routed uh in a day and it's not slowing down, right?
So it's it's very very big and there's hundreds of services that we support and that means if we think about end users there's uh you know there's hundreds of millions of end users that are that are calling in given the traffic that we have that is in the trillions. And uh one more thing to note is like there are end users sending requests into our you know hitting our platform and there's also services that are hitting our platform talking to other services right so there's lots of different pieces moving around here. I wanted to like I said I wanted to do a little bit more focus though on the live support of things like this. And what's really challenging is that when you have a large platform where you you have to support you know whether it's 10 20 uh 100 hundreds of of consumers of your platform.
So in our case, let's say service owners that are using our platform, when you have to support them, the the unfortunate sort of reality is that if there I'm just going to make up numbers. If there's a 100 teams you support with your platform and each one has 10 people, right? That's a thousand people. You have a thousand people that you support, but your team is not your team is not a thousand people, right? You may also have a team of 10 people. Maybe it's a little bigger than the other teams, right? Maybe you have a team of 20 people, but you have a team of 20 people that's supporting an entire platform that is then uh responsible for hundreds 100 teams in this case and then you know 10 people. So there is a a very big discrepancy in terms of the number of people that could need support from you directly versus the number of people you have to work with.
And so it's really challenging to try and optimize like effectiveness with support. So what do I mean by that? Well, coming from a place that was desktop software, right? We have end users. So, we don't have other services that depend on us. We're not a platform to other services. Um, but we are making desktop software. And really, the people that we care most about are going to be the people. I would say even for the digital forensic stuff, it's it is still two levels. Um, you could argue or think about this in different ways, but the two levels that I would often think about are the examiners and investigators, different roles, but people that are interacting and using the software sort of that is the direct customer. Um, the part you could really argue is like they're not the ones paying for it. It's usually like management or or something else.
that that aside, the other layer is what I would say are the people that they're uh they're either defending or prosecuting, right? And for us, a lot of the time, unfortunately, based on the the cases and stuff like that is that's victims, right? So, in many cases, that would be children. really unfortunate and dark, but that's that's I mean that's one of the reasons why doing some of that work was so empowering like to actually to really feel like okay we're going to make a difference here but you have in that example two layers of customers or end users right so it always felt like whatever we can do like any it's almost like drop anything we're doing doesn't matter about feature delivery whatever else like if there's something that we can do that is going to directly translate into helping save lives, right? Um, and you can, when I say helping save lives, you can kind of think about this however you choose to.
Um, when I say helping save lives, it could be like, you know, putting some helping prove someone that is bad, get them behind bars, or uh the opposite, right? Someone's being prosecuted and they're they're innocent. making sure that you're not sort of damning them for the rest of their life kind of thing. Um, so it it it just felt like, okay, something's going on. It was it could be very worthwhile to have very specific big impact to go pivot and just make sure that we're sort of like doing the right thing. And I like I feel very strongly about that because I I've spent a a large part of my career doing that. And yes, it's chaotic, right? It's it's absolutely chaotic, but especially going from a startup. One second. Especially going from a startup where uh this wants me to get off here, but that's not that's not right.
I'm not doing that. going from a startup where like you're kind of like fighting to stay alive and then having also that kind of that mission and impact like that's kind of personally that's kind of ingrained in me where like I I have a belief like that that that's the most valuable thing to do is that kind of work and so when we come over to a platform team if there's a partner like a service owner right that's using our platform and they're having a problem one of my Like my go-to internal feeling is like what can we possibly do to help them, right? Like we need to do something to help them. Like this is our customer. We got to do something. The for me the big one of the biggest challenges like it it simply does not scale and it's impractical. It's truly impractical to be able to jump into everything like that because we would never do anything.
We would like without exaggerating we would just be doing that all the time because there are so many partners there are so many people with questions there like uh there are always going to be given the surface area some type of you know live site thing going on just it's the surface area is too big for for there to not constantly be things like that. So we had this challenge which is like how do you try to give the best possible experience supporting people that use your platform when when it doesn't scale to kind of give them the best kind of support and it means you kind of have to make trades right and so I think I find myself in situations where you know prioritizing that kind of thing is very difficult. I want to make sure as much as possible that as an engineering manager, I'm trying to I use the word like shield uh shield my team from randomization.
So if there's something that comes up and a partner is like, hey, we need help with this. Maybe it's not even like an actual incident, but like hey, we're building something new or like we've rearchitected something and we need, you know, different kind of support here. Um, I don't I don't want to randomize my team because again, these types of things come up all the time because we are a platform and we do support hundreds of services. So, I don't want to randomize my team, but I also want to make sure that I can help do the right thing for partners. And so, uh, the the role of an engineering manager, which is something we talked about in previous videos, is like becomes like significantly more weight on the prioritization part because there is an unbounded amount of work and things to focus on. And I in my opinion that becomes the number one thing to try and uh optimize for, right?
It's like you you literally cannot do it all. Right. If I could just keep adding people to the team, I I still don't think that that like sure, yes, more people on the team that were ramped up and effective, yes, that would be helpful. We could do more. Um but it's like at some point it stops being a it stops being something that scales that way. And so one of the one of the issues to consider here is like how much can you jump into fight fires give attention to these partner teams versus listening to signal coming from multiple teams and going okay there is a class of issue concern question whatever it happens to be right there is a class of this instead of us going one by one by one to try and help every individual this way. Like we know that doesn't scale.
If we tried to, we might be able to keep up very briefly and then we're we're drowning. So, do we see classes of things like this? And if so, can we start prioritizing that work? Right? So the the shift comes or goes from at least my mental model of like I want to try and help every partner that we can and jump into it to more like I have to pump the brakes pretty hard. I have to do a lot more shielding from my team to keep them from being fully randomized on because then they're just kind of in like this this doom loop of like what's the next fire to put out? I have to shield them from that which unfortunately means like in what I perceive is probably a worse experience for partner teams where they're like hey like we just need support here.
Um, so finding this balance between like how do I not um how do I not like throw like immediately throw resources in like time, effort, people includes myself, right? How do I not just do that? Make sure that we can at least like unblock partners so they're not just like we're screwed like we literally have a live service and we're screwed. Like how do we get that kind of balance? Um, obviously quick wins wherever we can to make sure that it's a good experience, but still listening to classes of things coming up and then going, "Okay, we've heard this kind of thing coming up, you know, n number of times over the past x period. Um, this much like variance in it, these different factors like we should be talking about like what we do as a platform to address that." And those types of things can vary dramatically, right?
Like sometimes sometimes it's like uh just I'll give you a couple of examples, right? Sometimes it's us changing an internal process, right? It could be um I'll just I'll make up an example. We have our on call rotations. Maybe one of the things we're observing is like uh there are classes of partners that are reaching out to us in certain ways that um based on how they do that. It's we don't have visibility like they're kind of waiting for help and we're like there's no one on the other side that's like kind of getting the call. So like are there things that we can do in our operations that make that better? Are there things where it's like literally get documentation in place? So, um that we can put that in front of partners when they're having challenges, right? Like help them solve their own problems.
And that class of uh issue is very interesting to me because um I don't have like stats to prove this but my my feeling is that there is an overwhelming number of situations where it's like people like say for all of the partner teams that contact us and they're like help us with a problem like you know we're we're using your platform we don't know what's going on. We we suspect it's your platform. Right? There is an overwhelming number of issues where it's like someone just has a question about the platform. Someone is doing something with the platform incorrectly because they're kind of guessing or making assumptions. Um there is an like a I'm not saying that we're perfect and we never have issues but comparatively there are far fewer things where it's like look uh I guess two things like one like there is
sort of a an immediate issue with the platform like something has very obviously regressed and like we need help to turn off whatever immediately like there is that class of issue and this happens with every live service as much as we try to prevent it. And then the other is like, hey, we've observed over time like something has changed or something is changing. Like there are trends in data that say something's not quite right anymore. Those classes of things exist for sure and like in my opinion those need the most attention because they require hands-on time to go investigate, to go help. But the other class of issue that I'm talking about originally, one sec, uh is in my opinion something that is very much solved or improved at least with better documentation and I feel like having uh better agentic tooling around that.
Um I've seen different flavors of this kind of thing where like I think a lot of people are as I say this you might be jumping to like you know support chat bots. I think that's one thing. Um I think there are like whether it's dashboards, different types of tools like basically just giving people the right tools to help them solve their problems because a lot of the times like for this class of issue we spend a lot of time just trying to ask questions back like tell us actually what you're observing. So there's a lot of this kind of going on and then at the end of it we end up like kind of checking some dashboards and systems and we go oh like you know you're you're doing this wrong or you made a incorrect assumption about this and like here you go like make you here's how you would make a change uh go do that and we're pretty confident you'll be good.
So do they need us for that? Like obviously the answer is currently yes or has been historically yes and I I really think that this class of issue is dramatically improved with like better documentation tooling uh and AI support. Um on the AI support part I haven't seen this done perfectly yet but I'm I'm hopeful. What I what I don't think is uh you know ideal is jumping right to let's have AI automatically like basically just try to resolve people's issues. It's like trying to do too much at once and then so it's trying to do too much at once and then doesn't do it well and then the expectation you have of it is like well it was supposed to go fix the whole thing and it's not. So you're always disappointed. Um I think that there's like intermediate steps, right? Like you could just to make up an example.
If every 99% of the time we get an engagement from a team and they're like we're having an issue. If 99% of the time we're like okay like we need this information from you cuz you didn't provide it yet. If that happens 99% of the time, could we not take one initial step through whatever tooling, automation, AI, whatever you choose, can we not take the initial step to say we will improve this very incrementally by making sure that we get this upfront, right? We we literally cannot help you until we have this kind of information, right? Can we take that incremental step? Then you might uncover another problem which is great. Okay. uh we see that you're asking for that but um we don't even know how to get that information. Okay. Can we take one incremental step to like improve the documentation or make tools whatever dashboards to help people get that?
So basically like can we take the incremental steps to to kind of give a paved path for people? uh and at some point I'm sure along the way in that paved path you have things like can we use AI chat bots AI workflows whatever it is but I think personally that is one of the things from a a platform team that like would make dramatic impact. I'm thinking about our our own team and the types of issues we have to deal with like prioritizing that kind of thing I think is One of the the areas that I'm hopeful will be like instead of me always feeling bad like we can't just jump in to solve everyone's problem, if we go a little slower for a shorter period of time and build better things like that. Can we help solve like can we help other people solve their own problem?
Um and kind of get some of our uh own team efficiency back. So anyway, that was a lot of rambling, but those are some thoughts that I have uh especially recently thinking about platform teams, thinking about people engaging with us and just thinking about these scenarios where it's like we definitely want to offer as much support as we possibly can, but like how do you do that without grinding progress on everything else to a halt and randomizing everyone on a team to go jump into that kind of stuff? Anyway, there's some perspective. So, if you got questions related to this, anything in software engineering, career development, let me know in the comments or go to code.com. Happy to make a video response for you. Codecommute.com. You can submit anonymously. I have no idea who you are unless you literally write it into the message, which you can do if you want.
Um, but I I basically will take whatever you write, talk about it in a more anonymized way. So, even if you're saying company names and stuff, I won't mention it. Uh, and I'm just hopeful that if I can share some experience or even different perspective that that's helpful in some way. I don't mean to come across like I have answers to everything because I absolutely don't. So, I hope to see you in the next video. Take care.
Frequently Asked Questions
These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.
- How do you define a platform and its role in supporting live services?
- I frame a platform as a shared offering that is provided to services so that they don't have to duplicate the effort. Platforms are also live services themselves, but they're infrastructure for the live services. The side effect is that if there's an issue with the platform it can have sweeping impact across all of the services that depend on that platform.
- What are the challenges of prioritizing work when supporting a platform with many dependent services?
- I find it's very challenging to optimize effectiveness with support because there is an unbounded amount of work and things to focus on. It can be hard to scale when there are many teams involved, so I have to shield my team from random fires and avoid constant firefighting. So I prioritize by classes of issues and look for scalable improvements to unblock partners.
- What improvements or tooling could help platform teams reduce support load and empower partners?
- I think better documentation and agentic tooling around that would dramatically help. I also see potential in incremental AI support rather than trying to automatically resolve everything right away, so we can build a paved path and improve upfront information before bringing in smarter automation.