A beginner’s guide to tackling oncall responsibilities like a pro
Taking oncall responsibilities is one of the best ways to fully understand a system. While it can be stressful at times, it teaches you a lot on building reliable software systems. I’ve taken on-call responsibilities for most of my career. Here, I have listed some tips that I have found useful while handling oncall and while training others to take oncall. Before Outage Be aware of all the touchpoints your clients interact with your system, for example REST API layers, HDFS Data, etc Know each and every components of the systems you handle, and periodically ensure that you have required access to each of them Maintain a runbook to refer to, even for trivial fixes. Engineers often tend to forget how to perform trivial fixes if they haven’t done them for a long time Maintain links to all system related dashboards, graphs, log analysis tools etc in a single place so you don’t go hunting for these during an incident. ...