Tips For Navigating On Call Rotations As A Software Engineer

Tips For Navigating On Call Rotations As A Software Engineer

• 242 views
vlogvloggervloggingmercedesmercedes AMGMercedes AMG GTAMG GTbig techsoftware engineeringsoftware engineercar vlogvlogssoftware developmentsoftware engineersmicrosoftprogrammingtips for developerscareer in techfaangwork vlogdevleaderdev leadernick cosentinoengineering managerleadershipmsftsoftware developercode commutecodecommutecommuteprogrammerdevelopersoftware developerson call engineeroncall engineerday in the life

The dreaded on-call rotation! It's scary business being an OCE for the first time, so let's chat through some tips on how to get started.

📄 Auto-Generated Transcript

Transcript is auto-generated and may contain errors.

hey folks I figured I'd do a bit of a talk on on call rotations um and it's kind of top mind for me right now because I'm on call this week I had a bunch of videos filmed prior to this week and um I think I had like six ready to go out five or six so if you're watching like even this one will be posted later in the week it's only Tuesday right now um but I'm I'm trying to catch up because I'm not driving to work and um that's when I film all my videos right so I'm not driving to work because I am on call uh and I just don't want to have any time in the car where I might get paged or anything like that so and it's it would be different if it was like a f minute drive

or something so it's not it's a lot longer I could get stuck in traffic or whatever could be like up to an hour if you watch some of my videos they're truly like 50 minutes um so stop mind for this top of mind for me this week I figured I would kind of go through it just to share some different thoughts especially around like when I'm getting people on my teams to get ready for on Call shifts for the first time there's a lot of like people get really nervous right it's like oh no like I've never had to do this before like you know what's involved so I figured I'd talk through that just a reminder if you want questions answered leave them in the comments please and thank you um you can send a message to Dev leader that's also my main YouTube

channel but it's the sort of the The Branding I use for all my social media send me a message add as much detail as you want I'll keep it Anonymous to answer your questions uh it's kind of like how I drive content on this channel in particular uh and then worth noting too I started doing resume reviews on my main Channel if you go to Dev leader and uh you look for the resume review videos um you'll see in those videos how to submit yours if you're interested so with that said let's talk about on call rotations um so for those of you that aren't familiar with on call rotations I'm going to describe it a little bit um I must add the caveat that like every single place you go like from Team to team even at the same organization it could look different

so the idea is that if you have a live service of some sort that you are available to be able to assist in helping with issues in the live service in a nutshell right so pretty uh pretty simple definition but the complexity around that can can grow like tremendously depending on what's going on so for example um you could have uh like your shift length in terms of like how many days you're on uh primary backup kind of setup you could have um within a day like how many hours for the shift you have so uh just in terms of the scheduling and all of that like you can have like or even the frequency right are you on call once a month is it once a quarter um so there's a whole bunch of different variables before I move on like I will share

that I've done I've done some overnight on calls um I have done I've never had like a 24-hour shift but I've done like 16 hour shifts 12- hour shifts um and for the most part the way that I've had on call shift set up is there's at least one other geography like a different time zone that is typically set up so that's been like at Microsoft in particular the teams I've been on that's one of the things that's really beneficial is that the teams I've on we've at least structured between like North America and China which is great um so I've had that I've had periods where uh the on call shifts are uh I would say like more frequent like at least once a month and that's like I've had periods where I'm managing two teams one of the teams needs help with the

wrong call shift in terms of Staffing that up and as the manager like I step in as much as I can to try and help there um so had that I've had like far less frequent and there's pros and cons to these right like I I feel like a lot of people don't love the idea of on call um because it's just extra uh but in terms of like learning and experience and stuff it's actually it's you know Silver Lining I feel like it's one of the best ways to learn and get ramped up in an area it's like a lot of stress in the beginning because you're probably Panic panicking like how do I actually help here but you're kind of forced to like learn things which is helpful seems kind of funny but so I've had a whole bunch of different experiences doing

this over the past five years um I'm sure for people that have had longer careers than me in a area that has live Services you I'm sure you've had more experiences on call than I have and um that's like I spent a lot of my first part of my career shipping desktop based software so no on call lots of late nights fixing bugs but no on call so um okay with that said um I think some of the the major advice that I like to give individuals before getting ready for their on call shift is like your your responsibility when you're on call is not that you are the person that fixes all of the bugs um and I think that's one of the reasons why people get very anxious about on call shifts they're like okay so I work in this part of the

team but I'm on call and responsible for all of this stuff they're like I don't know any of that stuff I haven't seen that code before what if there's a problem um so I think a lot of people get stressed out about this because how could they be responsible for going to to take care and fix issues like having to go into the code and fix the bugs in these areas they don't know so um at least in the teams that I've worked on that has never been the expectation of course like if it comes down to it and you're able to then great um but a lot of the time it's like we can get the subject matter experts involved um biggest pieces of advice though I would say are like making sure that Partners so if you have like a partner team reaching

out to you um because I've worked on platform teams so there's a lot of partner teams that own services that rely on us um so if we have that or you know let's start with that if we have teams that are reaching out to us one of the things that I say is like make sure that you're acknowledging them fast there's like from my experience doing this most of the time people just want to make sure that like they're that someone's listening because they're on call too and they're panicking they're like hey something's wrong with my service or something isn't going as expected or I have questions and they're sitting there going like please please please respond right especially if they're new to it too um so in my opinion and my experience like getting back to people fast and just being like hey I

acknowledge this even if you don't have the answer right away but being like hey red received like looking into it for you I think that goes a tremendous distance so I would highly recommend leaning into that because the other approach I've seen people take is like they see the email come in or they see you know the message come in and they're like okay like I got to go I got to go get the answer first I really got to go do a deep dive and investigate this and then like hours will go by and then like that team is like they're trying to escalate or they're trying to reach back out and you know reach out to other people and it's it's like all you had to do was let them know like I'm looking into this so one of my pieces of advice

is like acknowledge uh early because people really do appreciate it um I would say like as a beginner a piece of advice is like there's and this is again from my experience there's something to be said about like if you get called uh like so a team pages you and they need help there's something to be said about acknowledging that like hey look I'm new to this or hey like I'm trying my best here uh I have seen that have it doesn't sound like it's a calming effect cuz you're like what do you mean like you're the new person but I've been in situations where and like many of them many situations where I'm like hey like I haven't had to deal with this type of problem yet so please bear with me I'm I'm reaching out to other people for help I'm looking up

the resources and acknowledging that with other people I've noticed that they do really appreciate it I've noticed in some situations by the way this is all internal I should have clarified that internal teams I could imagine with a customer they might be like what the hell right like get me the expert right now uh so my apologies for not clarifying that sooner but I found with internal teams I've been in situations where it's like hey this is this part's new to me like please bear with me I'm trying and I've had people be like hey like I'm new to this too so like thank you so much for just helping like I really appreciate it like I'm learning here with you I don't know like I've statistically had like a very large number of situations like that so that's something that uh I don't know

at this point I kind of encourage just transparency around like hey look I'm looking into this for you I might need a little bit of extra time to get the right expert involved or to to find the right resource but like I am trying my best I think people do appreciate that at least where I have worked so that's another piece of advice um I think something else to add is like people will say to me like okay well on call ships are generally like you might have them during your working hour so for me like my current shifts are 12 hours long and if my working hours are like 9 to 5 let's say uh I would do like a let's say 6:00 am. to 6 PM shift someone might say well hey or on the weekend because it'll go on the weekend too so say on the weekend at 6:00 a.m.

to 6: p.m. someone might say like well does that mean I can't like leave home am I stuck like what if I have to get groceries like what if I have to take a shower like what do I do so my my piece of advice here is um if you are and we have primary backup your sit your setup or situation might look different um and we have like paging mechanisms that will fall back to your uh to your backup so um if if it's something like you're taking a shower or whatever I would just be like turn your phone on like so you can hear it ring and then uh because we have like a couple of mechanisms for paging I would say just like you know if you miss the the initial page like just you know try to get out of the

shower enough to dry your hand off or something to like respond to the next page and then at that point if you're like okay like I'm literally like you know I'm naked because I just had a shower like get dressed like do what you got to do but at least you acknowledge the page and then I would say like you know within a couple of minutes like 5 10 minutes it's try to make sure that you can get to a computer to go you know help um and this is the same thing like with uh like leaving home and stuff right so I've definitely had situations in the past where um it's the end of my shift or like towards the end and because I used to go to the gym at night and my on call shifts used to be 16 hours then it

would mean that like I can't get to the gym until like 10:00 and I would be like look like I'm going to go at 9: and the gym is only a couple minutes away so I've been in situations where I get to the gym and then I have been paged so okay I'm at the gym I get paged I respond to it on my phone I check initially what's going on and then I can say oh like this Situation's already understood or whatever I can transfer what's happening to the right owning team I could do a quick investigation uh because sometimes especially on platform teams you might find situations where someone's like hey we think this is your platform form and depending on your experience and the amount of information that's provided you might be able to say actually no like you're you're you're observing

something that looks like our team but like based on this information it's actually this other team it's so not an uncommon thing um and I've had situations like that where I'm paged and instead of me being like Oh man like now I have to go home drive the whole five minutes home um it's like oh actually no I can transfer that that's fine or it's understood it's fine and then it's great I just keep going the gym um and then I've had other times where I'm like nope I got to leave and the reality is that say say during the week just to give you an example that was Monday to Friday instead of me saying well I definitely can't go to the gym any day this week it would be like okay maybe the one day I got paged and had to leave and

I was still home within five minutes and on a computer right like um so there's flexibility I would say but you don't want to find yourself in a situation where you cannot help because then you're kind of dropping the ball um I would say if you're like if you need that kind of flexibility so for example you're like I have to grocery shop on the weekend or I'm going to starve like I need to be able to step away to do that um most on call shifts have like some type of backup most of the time like if you're smaller company or like your team doesn't have that then I mean you probably want to sort this kind of thing before your rotation but for us and like the just under 5 years I've been at Microsoft we've always had like primary backup at least

and um you know there's times where I would say to my backup like hey I got to step out for a minute like are you okay to to watch for alerts like I just have to like you know we have to go to Costco like we don't have any food it's like no problem or I have to step out like um I'm just making up like dog got into something or dogs are barking outside and I have to go like take care of something or uh my wife called for help I have to go help her like can can you you know there's an hour left in the shift can you kind of just watch for the alerts and like I I'll you know be doing that um but communication with your backup and stuff is just key because if you walk away and then

you don't respond and people are expecting you to like you end up being like the face of your your team at that point people are depending on you so um yeah just making sure that you can communicate with the backup but otherwise like there's some flexibility there if you are the backup I would say you have a little bit more flexibility right and for us as an example like weekends are generally more quiet there's less teams people at work doing stuff so things are more quiet but um it doesn't mean like oh I'm back up I'm just going to like I'm going to take a vacation this weekend and like you know not have service or something like that no um but I feel I personally feel a little bit more relaxed I just make sure that I'm available so if my primary messages me

and they're like hey I need help I can say no problem like you know I'll be online within like within 30 minutes like tops because it's like hey I'm just I'm at the grocery store and uh like let me know like give me any details and I'll be home within like 30 minutes or something and I've always found like something like that seems okay but um the more like this is kind of nuanced right like you have to sort this kind of stuff out on your team I encourage you to have those conversations ahead of time um because I think you would appreciate the same thing right if if your backup was like oh like I'm actually I'm in the car for another three hours and you're like uh well what kind of backup are you like that's not that's not a lot of help

right have those conversations ahead of time I'll give you another example like I swapped rotations with someone and I said I can swap but on this particular evening I need to finish a little bit early and they said no problem we'll sort that out like easy conversation to have just have the conversation okay so couple things we talked about so far just due to a little check-in um and some things that I should kind of drill more into uh when you're on call you are not solely responsible for fixing all the code so we should probably come back to this um I do recommend you engage early um I was talking about like sort of a response time um and then also like this transparency around like if you're new to it like I think acknowledging that's kind of helpful uh so okay well what

happens right you get paged like what do you need to do now there's a lot of different things that like different teams will handle this in different ways um the teams that I have worked on try to build up repositories of like how you sort through things that come up when you're uh running a service right so if you can imagine you have automation that's kicking off um alerts to you saying like hey this doesn't look good over here or you have a team that's like hi like we have this issue what I like I would recommend this based on my experience now because this has seemed to work pretty well is trying to document the processes that people follow um to be able to sort through things right so if you see these patterns where like this type of issue comes up right there's

a comp computer that's running in a data center somewhere things aren't going so great on it okay like well what do you do because that's not a situation that like may like won't ever happen again it's entirely possible and likely there's going to be a situation in the future where you're getting signals from a machine where it's like hey there's a lot going on here someone should look at this okay well how do you get more information how how do you figure out the steps or like what you should do next to go investigate so these are the types of things I would recommend um either putting into like a Wiki um some type of repository of information I think now like these days especially with like all of the AI tools there's probably a lot of awesome stuff you can do to have like

sort of this repository of information set up and you could you could ask AI like hey I'm encountering this type of thing like this would probably be something that your company runs not like hey chat gbt here's all of our intern documentation tell me what to do but um I I can imagine like leveraging AI tools here right to be like hey here's what I'm observing could you suggest where I should go look right like do we have documentation on this can you point out like maybe a good starting point um if you don't have ai for that then like Yeah by all means like having something like a Wiki that's easily searchable something else to think about um and another piece of advice is like when people are writing up like here's how I navigated solving this thing um this is kind of might

sound counterintuitive but I've noticed sometimes when you have experts write like here's how to go solve this problem what happens is that they probably call out some critical things which is great because they're the expert but they end up making some assumptions that they don't include in the steps or like what to be thinking about like they kind of do little shortcuts because in their mind they're the expert they've done it a million times so I would recommend um if you are someone that's more Junior right and you're on call you had to go troubleshoot something and there wasn't documentation I would say like you should go write that and what I recommend is like you don't just say to the more senior person like hey you're the expert in this area you go write at all I would say you should try to take

a stab at writing that at a minimum if the other person's going to write it I think it's IAL that someone who's less experienced goes through Andes to make sure that it makes sense to them and that's because the next person that's on call that's going to need that is probably not the expert they didn't need it in the first place so you want to kind of like sanity check that stuff and have someone that's less experienced go through and say okay I can follow these steps um so yeah you could have someone more experienc write it but have someone less experienc kind of EXP size it and make sure that things feel good um but this has been something that works really well for us um I would say that if you have alerting systems and things like that um you can even you

know depending on the complexity of your systems and stuff like if you can call out like you know here's here's the resource for going to like troubleshoot this like great like the more that you can streamline trying to help the people that have to go put out the fires the better um and just kind of think about it right if you're an on call engineer and you're it's always going to feel some level of stress when you're on call right like you have your normal work stuff that might already be stressful for you and now it's like on top of that you're going to have either teams reaching out to you you got you get these paging alerts coming in and you're like man like I don't want to do this um you're stressed out about it it's going to be harder to like to

problem solve and think through things just because of that elevated stress so think about like what little things you could do just to alleviate some amount of stress right so like I said having the stuff documented could you imagine how much more comforting it is when you're like I don't know what the hell this this issue is like I got this paging alert it says something's broken and you're like I just don't know and then you take a piece of that and you search it in your Wiki and there's this stepbystep guide and you're like oh man like thank you um and then it doesn't work partway through um but yeah it's like it's a lot more reassuring and can calm people they can think through things more clearly so um trying to think some other pieces of advice um this is maybe like a

team Dynamic kind of thing so I've worked on teams uh where there are there's a pretty pretty big dichotomy of like expertise so there would be some people on the team that have been around for ages and they're truly like experts like you know call them like historians because they know all of the the ins and outs like the Legacy parts of the systems all the quirks and stuff like that and they also just have experience navigating all these things and um you can kind of put like any type of on call issue in front of them even if they haven't seen something like that before they can probably figure it out pretty quick um so we got people like that there's a so that's a small population there's also a small pop of like folks in the middle and then I've noticed like on

the teams I've been on personally is that there's a larger like a much larger population of like newer developers and that could be new to the team but also like Junior so a combination of the two and of course people uh when they're on call and they start to feel anxious and it's like oh something comes in if you don't know the answer right away this kind of is like if you've watched my other videos about um like Junior developers asking questions and stuff um we fall into this trap sometimes of like like I don't know what to do please help when someone ends up doing the work for you it feels like hell yeah like we got the problem solved right like the fire has been put out the problem is that it kind of perpetuates this so in my example that I'm giving

you when you have the experts on the team and you're like hey so and so is the expert I'll just reach out for help they'll definitely be able to help you know it and then their help to you is they go oh like this is the exact issue here's exactly how you solve it and sometimes it's like and I already did it it's like it feels good because the problem solved but the the challenge is like next time it happens like you might feel like oh it was just easy you just do these steps but you didn't practice it and I have absolutely found I'm sharing this with you because I found myself in these situations where I'm like oh like thankfully issues been averted um or good and then it happens the next time I'm on call and I'm like I can't remember the

other step that was taken like I have an idea and then I have to reach out to the person again I'm like I remember we did something regarding this and I have to like basically what I'm trying to say is and if you're the person asking the questions I urge you to kind of think about this kind of stuff this way is don't ask people to solve your specific problem I actually did this today and I wanted to kind of share something I'm not going to give you the details obviously but um I was trying to troubleshoot something and I know the person that I reached out to is an absolute ninja with all of the things related to like our live site service like livesite issues for our service they can navigate anything they know where all of the tools are they know where

all the signal are all the dashboards they can they're amazing and I was working with my uh sort of partner in terms of our on call rotation we were looking at something and we were like this isn't really adding up between the two of us isn't really adding up and I said I should reach out to this person but I know from my experience so far with people like this that are so good I'm like I need to get better at this so what I did was I said Hey in a situation that looks like X I said number one I think that I would expect this to kind of let them know like i' I've been thinking about this I'm not just like you know pinging you to say like please do this for me but I'm like I would expect this and also

like um if I wanted to find more where would I go and kind of hinted like I can't seem to find like documentation on this what I didn't do was like P all of the information give them all the details CU I know that they would go solve it immediately because they can and they could do it super fast in this in this situation at least I was like this isn't like the world is on fire and I need someone to go like put it out immediately I was like I know there's an opportunity here where we can learn and like I'm not saying like take our time for a week but we could at least like go poke around for a bit and try to learn so I asked it this particular way so that I could get the reason resources and then I

could go do the work that was the intention so I structured it that way on purpose now in this case it was kind of ironic because I didn't tell them the information and they happened to go they happen to go look up the information they knew exactly what I was talking about and I uh whatever it's this just a tangent but I ended up taking a screenshot of it to the other on call and I said I didn't even tell this person what the issue was and they they they figured it out so that's how good they were um the point is that uh I I encourage you if it's not something where the entire world is on fire and you need like to immediately like resolve everything because it's so bad I would recommend that if you have the opportunity try taking the steps

yourself if someone's like hey give me a call we'll screen share and I'll walk you through it you walk through it you be in the driver seat it's it's a little slower and it might feel clumsy cuz someone's it's like par programming right someone's telling you like how to go drive and you're like oh like where's this button I have to click it seems like it's so insignificant but I strongly encourage you do it because that compared to just watching someone do it I can tell you at least from my experience you you don't retain it not not as well at least so that's my my suggestion is like try to do the driving yourself um and if you know that you're talking to an expert on something try to frame your uh your situation a little bit more generically um so that you have

the opportunity to go follow up and dive in so um I think that's where I'll wrap it up um I wanted to say like if you have questions about on call shifts and stuff like ask them below this is the first time I think I've talked about it so um a lot of the other topics are variations of similar things so I've kind of thought a lot more about them this is kind of the first time I've been thinking about this because it's top of mind for me so um yeah if you have questions leave them below um quick reminder right like this is going to be different everywhere that I switched teams last year our on call is different some very similar pieces but it's different um so just keep that in mind and um yeah at the end of the day at least

of the teams I've worked on everyone's extremely supportive so I think sometimes people go into these shifts feeling like you know it's me against the world but like I I've reminded people that I've worked with like whole team's here to support you you're not alone you're going to be the face of the team so the alerts come to you Partners will reach out to you you're the face of the team but it's okay we're all here to support you um and I I've even told some folks on my team that are newer like hey I know you're going to be like this is your first rotation over the weekend like if I'm around like hey just so you know I'm around this weekend ping me if you need me like I try to make myself available because I want people to not be going into

that panicking the whole time so um hopefully you have a team that you're surrounded by that's there to support you and I hope that's a good reminder that like you know you get some backup I'll wrap it up there thanks folks I appreciate it I'll see you next time

Frequently Asked Questions

These Q&A summaries are AI-generated from the video transcript and may not reflect my exact wording. Watch the video for the full context.

How should I handle communication when I'm on call and receive a page from a partner team?
I recommend acknowledging the page quickly, even if you don't have an answer right away. Let the partner team know you received their message and are looking into it. This helps reduce their anxiety and shows that someone is actively working on the issue.
What advice do you have for new engineers who are nervous about their first on call rotation?
I tell new engineers that they are not expected to fix all the bugs themselves, especially in unfamiliar areas. Being transparent about your experience level and letting others know you're trying your best can actually help. Most internal teams appreciate honesty and collaboration when you're learning.
How can I effectively learn from on call incidents instead of just relying on experts to fix problems?
I suggest trying to troubleshoot issues yourself first and only asking experts for guidance rather than complete solutions. When working with experts, try to be in the driver's seat by having them walk you through the steps. This hands-on approach helps you retain knowledge better and prepares you for future incidents.