
Taking oncall responsibilities is one of the best ways to fully understand a system. While it can be stressful at times, it teaches you a lot on building reliable software systems.
I’ve taken on-call responsibilities for most of my career. Here, I have listed some tips that I have found useful while handling oncall and while training others to take oncall.
Before Outage
- Be aware of all the touchpoints your clients interact with your system, for example REST API layers, HDFS Data, etc
- Know each and every components of the systems you handle, and periodically ensure that you have required access to each of them
- Maintain a runbook to refer to, even for trivial fixes. Engineers often tend to forget how to perform trivial fixes if they haven’t done them for a long time
- Maintain links to all system related dashboards, graphs, log analysis tools etc in a single place so you don’t go hunting for these during an incident.
During Outage
Identify the “blast radius” i.e the extent of the impact. Is it affecting 1% of your users or 50% of your users? A single region or multiple regions? How long has it been broken?
Ensure you communicate clearly and frequently to keep all stakeholders informed on steps being taken to resolve the incident and the estimated time of resolution.
Don’t hesitate to escalate based on the impact. Escalation are of two types ->
- When a member of another team is unable to provide a fix within SLA, ask them to engage additional members from their team.
- When you are unable to fix an issue that lies within your team’s responsibility, ask help from other team members.
Don’t fiddle with critical systems you don’t fully understand, for example production database, payment systems etc. One wrong move can lead to irrecoverable data loss , financial loss etc
Always priortise quick mitigation of impact over deep root cause analysis. This can include turning off feature flags, shifting traffic to another region, traffic shedding etc
Try to stay calm during the incident. This sounds easier said than done. It is completely normal to feel tense if you are taking oncall for the first time or handling a new system. Some things only improve with experience, but you can focus on actions than help you stay calm. Identify what’s making you nervous during oncall and come up with a plan to address those factors. For eg. stop any distractions affecting you while you work toward mitigation of incident. This can be as simple as asking a teammate to handle communication on the voice bridge or chat while you focus on mitigation.
After Outage
Briefly document what went wrong during the outage, the root cause and steps taken to fix it, immediately after resolution of outage. You can fill in rest of the details later as time permits. These documents are extremely valuable for postmortems , improving systems and serving as a knowledge base for current and future teammates.
Read postmortems of all kinds - old and new , from your team and from other teams or companies. I used to enjoy reading postmortem documents from older incidents in my team, which helped improve my understanding of system. Reading postmortems of other team and companies helps us look at our own systems with a fresh perspective and identify ways to improve reliability of our systems.
An oncall engineer should strive to improve the reliability of the system, not just troubleshoot and fix issues.
This post will be updated as and when I think of something new.