Skip to content

Firefighting

By default, generalist and project cells have two firefighters for each sprint, designated as Firefighter 1 and 2, and they usually share the same set of responsibilities.

How to assign firefighting hours within the cell is up to each cell, as long as the total number of hours allocated for a sprint matches the firefighting needs of the cell and the distribution of firefighting hours between cell members allows for (near) round-the-clock coverage.

The general recommendation is to designate about 7.5% of a cell's total capacity to firefighting. For example, if the cell's total capacity equals 10 FTs (i.e., 10 people working 40h/week), it would dedicate 30h per sprint to firefighting, and split these hours between 2-3 firefighters.

To allow each cell to make adjustments based on changing circumstances such as cell size and average volume of firefighting work, they can define their own firefighting regimen, which must respect the general constraints listed below and be documented under Cells > Cell-Specific Rules.

General constraints

Custom firefighting rules for individual cells are always subject to the following constraints:

  • They must be reviewed and approved unanimously by all members of the affected cell as well as the CEO during the cell's inception, and at any time thereafter when a change is proposed.
  • Changes to cell-specific firefighting rules must be in line with OpenCraft's general process for making decisions, i.e., they must be submitted as a merge request to this handbook.
  • The default OpenCraft client-facing SLA of 24 hours always applies. It can only be replaced by a stricter SLA. Firefighting regimens should be built with that in mind.

Project cell-specific constraints

In cases where firefighting help from a generalist cell becomes necessary, time should be logged in such a way as to use the project cell's budget.

Triage guidelines

General guidelines

To keep the sustainability impact of firefighting incidents in check, firefighters should distinguish between urgent and non-urgent issues:

  • Urgent issues are those that fall into the critical and major categories of incidents as defined by OpenCraft's SLA. They should be handled immediately.
  • Non-urgent issues are those that fall into the minor category of incidents as defined by OpenCraft's SLA. They should be handled asynchronously, within the default response time of 24h.

💡 The Service Level Agreement (SLA) describes the contractual terms we generally apply to handle emergencies.

In addition to categorizing issues as urgent/non-urgent based on the criteria mentioned above, firefighters should use the following guidelines when triaging specific types of issues:

  • Is the issue urgent? For example, are one or more instances down? Is there risk of them going down any moment?
    • If so, whichever FF encounters it first should begin working on it immediately, irrespective of the cell that they belong to and any budget or sustainability concerns.
  • Is the issue less urgent? For example, is there a pending certificate rotation in a few days?
    • If so, pass the issue to the DevOps cell.
    • If the DevOps cell does not have capacity, it should adjust task priorities for the current sprint and delay non-urgent work as necessary to be able to address the issue in the current sprint.
    • Non-DevOps cells are generally not expected to handle non-urgent infrastructure-related issues. However, a non-DevOps cell can decide to handle a non-urgent issue if the following applies:
      • The cell is in a sustainable state.
      • The issue needs to be handled before the end of the current sprint.
      • The DevOps cell is unable to make the necessary adjustments to fit the issue into the current sprint.

Note: Cells that are currently not sustainable should generally refrain from handling anything but the most urgent infrastructure-related issues.

Client budgets will often cover these types of issues, so sustainability is not necessarily a concern. However, it still makes sense to evaluate issues based on the following criteria, and proceed as appropriate:

  • Is the issue urgent? For example, is the client's instance down, or does the client need to immediately scale their instance, or do they have some other urgent request?
    • It should be handled by whichever FF encounters it first, irrespective of the cell that they belong to. (This may be harder/impossible in case of more specialised clients like Yonkers or LX.)
  • Is the issue less urgent? For example:
    • Did the client request a minor change, fix, etc. that needs to be completed in the same sprint that it was raised?
      • It should be completed by an FF from the appropriate cell.
    • Did the client request a minor change, fix, etc. that does not need to be completed in the same sprint that it was raised?
      • An FF from the appropriate cell should create a ticket for addressing the issue and ping the client owner on it.
      • The client owner should adjust planning for upcoming sprints and schedule the ticket as appropriate.
Budget checks

Before handling minor incidents and support requests, firefighters should make sure that the work will be covered by client-approved budgets.

Client owners should generally be able to answer relevant questions about existing budgets that would be appropriate to use for the pending work, so firefighters should turn to them for help and information.

If there is no existing budget that would be appropriate to use, firefighters should come up with a rough estimate for the number of hours required to complete the work (or at least a first investigation), and get on-the-spot approval from the client for spending that time. Note that:

  • Approval must be available in written form.
  • Depending on how things develop, additional budget might need to be requested more than once over the course of a sprint.

Upstream issues

For example: One of OpenCraft's XBlocks needs a fix for an issue that is currently breaking edx-platform:master.

These types of issues are rare, but upstream generally expects us to be accountable for parts of the Open edX code base that we maintain, in particular for any code that we merge into upstream repositories. (With greater control over the project comes greater responsibility 🕴)

So when these types of issues do occur, core contributors from our team should step in and fix them -- irrespective of whether they are on firefighting duty for the current sprint or not.

Also, the time spent working on this type of issue should be counted as core contributor hours. Client maintenance budget(s) should be employed, whenever possible, to fund those hours.

Guidelines for security issues

Security issues include vulnerabilities, breaches, and leaks of any type of non-public information. They may affect specific clients, OpenCraft's internal infrastructure, or both.

In general, critical/high-severity security issues should be treated like urgent client/infrastructure-related issues: Whichever FF encounters a given issue first should begin working on it immediately, irrespective of the cell that they belong to and any budget or sustainability concerns.

For detailed instructions on how to deal with security issues refer to the following sections of our security policy:

Dependency upgrades addressing security issues

GitHub scans repository dependencies and sends security alerts based on severity levels of vulnerabilities that need fixing. (By default, all public repositories are scanned; for private repositories, admins have to explicitly permit the scanning.)

These types of security alerts should be triaged as follows:

  • Critical/High level severity: As there could be a risk of compromise or significant downtime for users, these vulnerabilities must be patched as soon as possible. The firefighters should create the necessary tickets(s) for applying the patch(es) and start working on them right away.
  • Medium level severity: Firefighters should report these vulnerabilities, making sure that team is aware of them, and create tickets for patching them. These tickets can be scheduled for the next sprint. However, if any firefighters from the current sprint have time, they can work ahead on these tickets (unless doing so would involve neglecting the sustainability-related considerations listed above).
  • Low level severity: Firefighters should report these vulnerabilities and create tickets for patching them with a lower priority. These tickets can be scheduled for a future sprint and prioritized by owners of their parent epics as appropriate.
  • Undefined or unclear severity: Vulnerabilities or security fixes whose severity is unknown or unclear should be reported and discussed with the team before taking any action.
  • Dependabot PRs: Dependabot PRs for security fixes and/or bumping versions of vulnerable dependencies should be handled by the firefighters based on the process for various severity levels described above.

Note that the severity level mentioned in a vulnerability report may not always match our own assessment. For example, a vulnerability categorized as critical may not be deemed critical by us if it affects a disabled feature.

Firefighters should take this into account when following up on security alerts, and discuss with other members of the team to resolve any doubts they might have.

Guidelines for false positives

If a false positive triggers the pager, firefighters should treat it as a proper issue.

This means that when encountering a false positive (which may come from OpenCraft's internal infrastructure or client-specific infrastructure), the scope of the resulting ticket is to fix what created it, in order to keep the percentage of false positives low.

This is important because if the percentage of false positives increases over time, it can lead to alarm fatigue and make the team less effective at identifying and properly addressing real issues.

Tooling

The process of creating JIRA tickets for firefighting and assigning the appropriate number of hours to them is automated: SprintCraft takes care of it based on information from the Weekly Rotations Schedule.

This document supports configuring the number of firefighters as well as the number of hours to be allocated to each firefighting ticket separately for each cell, which means that the automation as a whole can be easily adapted to match cell-specific firefighting regimens. Detailed information on how the automation works and the required format of the data in the Weekly Rotations Schedule can be found in the Completing sprints section of the SprintCraft documentation.


Last update: 2022-08-05