Fires and alerts management¶
The primary responsibility of firefighting is to handle pager alerts, as well as work on unscheduled urgent issues that arise during a sprint, and couldn't have been properly planned.
When an alert goes to the Opsgenie pager, either sent by a server or a client through email, it follows the following escalation path. It successively pages people in each level, either being taken by one of the people being paged, or escalating to the next level after a few minutes:
- Firefighters: people on rotation during the current sprint.
- Normal Working Hours: any other member of any of the non-support cells
- Firefighting managers
By default, generalist and project cells have two firefighters for each sprint, designated as Firefighter 1 and 2, and they usually share the same set of responsibilities.
How to assign firefighting hours within the cell is up to each cell, as long as the total number of hours allocated for a sprint matches the firefighting needs of the cell and the distribution of firefighting hours between cell members allows for (near) round-the-clock coverage.
The general recommendation is to designate about 7.5% of a cell's total capacity to firefighting. For example, if the cell's total capacity equals 10 FTs (i.e., 10 people working 40h/week), it would dedicate 30h per sprint to firefighting, and split these hours between 2-3 firefighters.
To allow each cell to make adjustments based on changing circumstances such as cell size and average volume of firefighting work, they can define their own firefighting regimen, which must respect the general constraints listed below and be documented under Cells > Cell-Specific Rules.
Custom firefighting rules for individual cells are always subject to the following constraints:
- They must be reviewed and approved unanimously by all members of the affected cell as well as the CEO during the cell's inception, and at any time thereafter when a change is proposed.
- Changes to cell-specific firefighting rules must be in line with OpenCraft's general process for making decisions, i.e., they must be submitted as a merge request to this handbook.
- The default OpenCraft client-facing SLA of 24 hours always applies. It can only be replaced by a stricter SLA. Firefighting regimens should be built with that in mind.
Project cell-specific constraints¶
In cases where firefighting help from a generalist cell becomes necessary, time should be logged in such a way as to use the project cell's budget.
Incidents management documentation¶
If you are handling an incident and don't know where to start, have a look at:
- General technical documentation to deal with incidents
- How to fix the mail server when it’s blocked by OVH
- Troubleshooting issues with the load balancers
- Incident triage guidelines
For anyone in the escalation queue¶
For anyone in the pager escalation queue, or handling alerts:
- I will keep at least 1h for each sprint to be available for handling fires
- I will add myself to the pager rotation, making sure that:
- My on-call times cover at least my working hours.
- (Optional) If I want to help cover any additional hours, I will adjust my pager schedule accordingly.
- My on-call times include each day I will be working over the course of the entire sprint.
- My on-call times cover at least my working hours.
- I will make sure that the pager is always able to interrupt me while I am on call, or working.
- 💡 For example, you should keep your phone within audible radius or have it on your body if it is on vibration. Further measures for minimizing the likelihood of missing pager alerts will depend on your notification settings.
- I will snooze pager alerts rather than acknowledge them.
- 💡 Unlike acknowledged alerts, snoozed alerts will start sending notifications again after the snooze period ends, making them less likely to be forgotten and left without proper resolution.
- Handle alerts that escalate to me via the pager, reported by other team members or clients, with Braden arbitrating priorities
- I will help triage and prioritize these issues,
asking other team members for advice as needed, and escalating to Braden/Xavier if all else fails.
- If a client misuses
firstname.lastname@example.org requests that turn out to be non-urgent, I will politely remind them that this e-mail address is reserved for critical and major incidents as defined in OpenCraft's SLA.
- If a client misuses
- I will inform clients about critical issues immediately, or reply to them if they reported the issues themselves.
- I will remove
email@example.com the list of recipients when replying to clients, to avoid triggering repeated pages for the same issue.
- I will remove
- I will update clients on the current status of relevant investigations regularly, and look out for additional messages (containing follow-up questions and/or information) from them.
- I will ask clients for final confirmation that all issues have been fixed before closing any alerts.
- I will help triage and prioritize these issues, asking other team members for advice as needed, and escalating to Braden/Xavier if all else fails.
- If I am the first team member to act on a given pager alert, I will assign the alert to myself and monitor for escalation (or at least make sure that another team member does the same).
- I will not record any time on my firefighting ticket for the current sprint; instead I will always use a dedicated ticket (or appropriate epic) for logging my time.
- I will be subscribed to the
- 💡 Be sure to filter messages from these lists to a separate folder to look at them only when you need to - but also make sure that if any such emails explicitly include you in To/CC, they *will* arrive in your inbox).
- I will mind triage guidelines when handling incoming emergencies.
- I will reserve the required number of firefighting hours for my cell for fulfilling the duties listed below, and will proactively pursue those duties.
Before the sprint¶
- If available, I will pick up some CAT-2 tickets1 and add them to the upcoming sprint,
with a total estimated effort that roughly matches my firefighting hours.
💡 This will help me make additional room for firefighting more easily during the sprint
in case the amount of firefighting work that comes up exceeds what I can handle
within the number of hours allocated to my firefighting ticket.
- To limit sustainability impact of this practice, I will generally prefer CAT-2 tickets from client epics over CAT-2 tickets from internal epics.
- I understand that in some situations there might not be enough CAT-2 work available to allow all firefighters to do this, and/or the amount of CAT-1 work that needs to be completed in the upcoming sprint might be too large to fit in one or more CAT-2 tickets (without jeopardizing deadlines of important projects).
- I will assign tickets for the rest of my committed hours to myself as I normally would,
but I will also make sure that a few additional tickets are left either in Stretch Goals
or in the following sprint.
💡 This will keep me from running out of work in case there isn't enough firefighting work to fill my hours.
- I will assign these tickets to myself and find a reviewer for them so that they are ready to pull into the sprint if necessary.
- If I am Firefighter 1 (FF1), I will check the OpenCraft calendar for the time and day of the next Social Chat for my cell and make any necessary adjustments. I'll also post a topic for the social chat on the forum.
- If I am going to be on vacation during a sprint where I am scheduled to be on firefighter rotation:
- I will ensure to have a backup who will step in while away. Even if the timezone is already covered, there is safety in numbers and it helps ensuring that an alert will be more likely to be picked on early, or that the firefighter will have availability at that time.
- I will make sure backups are clearly indicated on the rotations calendar, so we all know quickly who to contact during alert escalations.
- To also help with coverage, I will try to get a backup from the timezone with the least coverage during that sprint – though again if that’s not possible, redundancy will still be helpful.
1: CAT-2 tickets generally don't have an end-of-sprint deadline, so they can usually be swapped out of the sprint without a lot of discussion and/or coordination with other cell members.
During the sprint¶
- I will work on the following, listed by decreasing priority:
- Handling emergencies from other team members or from clients, as described above.
- Handling critical bugs reported by QA teams.
- Deploying hotfixes and security patches on client instances.
- Following up on issues affecting periodic build instances as necessary (when prompted by the team member responsible for Community Liaison).
- Providing reviews for tasks from the current sprint that are missing a reviewer.
- Completing any personal spillover from the previous sprint.
- Working on client requests that can't wait until the next sprint, in particular in the first week of a sprint.
- Ensuring a clean sprint by helping other team members with their tasks, in particular in the second week of a sprint.
- Additional tasks from Stretch Goals or the following sprint that I lined up for myself before the sprint.
- I will only pull these tickets into the current sprint if I am confident that I will have time to finish them in addition to the firefighting.
- I will document incidents as they happen in the DevOps review document1.
- If necessary, I will swap some CAT-2 tickets out of my sprint to make additional room for addressing fires.
For firefighting managers¶
Firefighting managers are mainly coordination roles, ensuring that firefighting is handled properly, and dealing with alerts that come up outside of the hours covered by firefighters or escalate. Their goal is generally to find someone to handle the alert rather than actually firefighting it, with the only exception being when nobody could be found to handle a truly critical alert.
There are two firefighting managers for the whole team across all cells, on timezones far apart, allowing to split the hours during which to manage escalations, to keep them mostly during one’s day.
Other responsibilities include:
- Keeping an eye on the #Hosting channel on Mattermost
- Checking that the people who should be on the pager escalation path for any given sprint have added their hours to OpsGenie
- Keeping an eye OpsGenie alerts’ list, ensuring that every alert is being addressed
- Making sure that clients are notified and kept informed of the incident response
- Ensuring that a ticket and/or a follow-up task is created for each issue, and that it has been fully investigated
- Checking that an entry on the DevOps review document1 is created for each incident, and posting a summary to the OpenCraft Ops Review forum thread1 after each sprint.
- Being the point of contact for firefighters, and provide long-term knowledge and consistency about firefighting to the team (the firefighter role is rotating very frequently, making it hard to have a long-term vision while addressing issues)
Rough estimation per sprint:
- 1h on the first Tuesday to check the FF rotations on OpsGenie and checks recurring every sprint
- 5-20 minutes every day to check and address issues with pages
- 1h to reply to pings, answer tickets and nudge firefighters as needed
Total: 5h recurring task
The firefighting managers (or their backups) log their time normally on individual tickets, like firefighters - with the exception of time spent dealing with an escalation reaching their level at the top of the queue outside of their work hours, which is logged a second time, to be paid 2x. This should be rare, as work on the role such as prior coordination, checking the opsgenie roster every sprint, etc. could be done async in normal work hours. But such escalation time is logged twice: once in the normal task (to ensure it’s also billed when appropriate), and a second time in a specific internal "sprint manager escalation" task, which is also used to track how much firefighting managers are being paged, and the corresponding budget.
Procedure to handle escalations¶
When an alert escalates to a firefighting manager, the general procedure to deal with the alert is the following:
- Go to the devops channel, mention that the page has escalated, asking if anyone online is available to look at it, and pinging explicitly the firefighters who might have missed the page on their phone
- If nobody answers immediately, check if the page needs urgent attention, or if it's something that can wait for the firefighters to come back online:
- if it can wait until then, snooze until the time where the next firefighter rotation starts in opsgenie
- if it can't wait, snooze for 5, 10 or 30 minutes depending on the alert urgency, to give time for someone to see either the chat pings or the pager alert
- If the alert re-escalates after the snooze, repeat the previous step, but this time also ping @here or @channel depending on the issue's importance and urgency, to widen the number of people being pinged
- If there are still no answers in the chat, that it re-escalates, and that it's an important and urgent issue, then it's time to start either looking at solving the problem personally, or use the contacts spreadsheet to find the right person to call on the phone to help.
The process of creating JIRA tickets for firefighters and assigning the appropriate number of hours to them is automated: SprintCraft takes care of it based on information from the Weekly Rotations Schedule.
This document supports configuring the number of firefighters as well as the number of hours to be allocated to each firefighting ticket separately for each cell, which means that the automation as a whole can be easily adapted to match cell-specific firefighting regimens. Detailed information on how the automation works and the required format of the data in the Weekly Rotations Schedule can be found in the Completing sprints section of the SprintCraft documentation.