Firefighting Issues - Triage Guidelines¶

General guidelines¶

To keep the sustainability impact of firefighting incidents in check, firefighters should distinguish between urgent and non-urgent issues:

Urgent issues are those that fall into the critical and major categories of incidents as defined by OpenCraft's SLA. They should be handled immediately.
Non-urgent issues are those that fall into the minor category of incidents as defined by OpenCraft's SLA. They should be handled asynchronously, within the default response time of 24h.

💡 The Service Level Agreement (SLA) describes the contractual terms we generally apply to handle emergencies.

In addition to categorizing issues as urgent/non-urgent based on the criteria mentioned above, firefighters should use the following guidelines when triaging specific types of issues:

Is the issue urgent? For example, are one or more instances down? Is there risk of them going down any moment?
- If so, whichever FF encounters it first should begin working on it immediately, irrespective of the cell that they belong to and any budget or sustainability concerns.
Is the issue less urgent? For example, is there a pending certificate rotation in a few days?
- If so, pass the issue to the DevOps cell.
- If the DevOps cell does not have capacity, it should adjust task priorities for the current sprint and delay non-urgent work as necessary to be able to address the issue in the current sprint.
- Non-DevOps cells are generally not expected to handle non-urgent infrastructure-related issues. However, a non-DevOps cell can decide to handle a non-urgent issue if the following applies:
  - The cell is in a sustainable state.
  - The issue needs to be handled before the end of the current sprint.
  - The DevOps cell is unable to make the necessary adjustments to fit the issue into the current sprint.

Note: Cells that are currently not sustainable should generally refrain from handling anything but the most urgent infrastructure-related issues.

Client budgets will often cover these types of issues, so sustainability is not necessarily a concern. However, it still makes sense to evaluate issues based on the following criteria, and proceed as appropriate:

Is the issue urgent? For example, is the client's instance down, or does the client need to immediately scale their instance, or do they have some other urgent request?
- It should be handled by whichever FF encounters it first, irrespective of the cell that they belong to. (This may be harder/impossible in case of more specialised clients like Yonkers or LX.)
Is the issue less urgent? For example:
- Did the client request a minor change, fix, etc. that needs to be completed in the same sprint that it was raised?
  - It should be completed by an FF from the appropriate cell.
- Did the client request a minor change, fix, etc. that does not need to be completed in the same sprint that it was raised?
  - An FF from the appropriate cell should create a ticket for addressing the issue and ping the client owner on it.
  - The client owner should adjust planning for upcoming sprints and schedule the ticket as appropriate.

Budget checks¶

Before handling minor incidents and support requests, firefighters should make sure that the work will be covered by client-approved budgets.

Client owners should generally be able to answer relevant questions about existing budgets that would be appropriate to use for the pending work, so firefighters should turn to them for help and information.

If there is no existing budget that would be appropriate to use, firefighters should come up with a rough estimate for the number of hours required to complete the work (or at least a first investigation), and get on-the-spot approval from the client for spending that time. Note that:

Approval must be available in written form.
Depending on how things develop, additional budget might need to be requested more than once over the course of a sprint.

Upstream issues¶

For example: One of OpenCraft's XBlocks needs a fix for an issue that is currently breaking edx-platform:master.

These types of issues are rare, but upstream generally expects us to be accountable for parts of the Open edX code base that we maintain, in particular for any code that we merge into upstream repositories. (With greater control over the project comes greater responsibility 🕴)

So when these types of issues do occur, core contributors from our team should step in and fix them -- irrespective of whether they are on firefighting duty for the current sprint or not.

Also, the time spent working on this type of issue should be counted as core contributor hours. Client maintenance budget(s) should be employed, whenever possible, to fund those hours. (Note that the two aren't exclusive of each other - this type of upstream work can always count as core contributor time, from a community commitment standpoint; but the budget to fund the time on our side can either come from client budgets, or from an internal unbilled "Contributions" budget.)

Guidelines for security issues¶

Security issues include vulnerabilities, breaches, and leaks of any type of non-public information. They may affect specific clients, OpenCraft's internal infrastructure, or both.

In general, critical/high-severity security issues should be treated like urgent client/infrastructure-related issues: Whichever FF encounters a given issue first should begin working on it immediately, irrespective of the cell that they belong to and any budget or sustainability concerns.

For detailed instructions on how to deal with security issues refer to the following sections of our security policy:

Reporting security vulnerabilities
Reporting and responding to security breaches

Dependency upgrades addressing security issues¶

GitHub scans repository dependencies and sends security alerts based on severity levels of vulnerabilities that need fixing. (By default, all public repositories are scanned; for private repositories, admins have to explicitly permit the scanning.)

These types of security alerts should be triaged as follows:

Critical/High level severity: As there could be a risk of compromise or significant downtime for users, these vulnerabilities must be patched as soon as possible. The firefighters should create the necessary tickets(s) for applying the patch(es) and start working on them right away.
Medium level severity: Firefighters should report these vulnerabilities, making sure that the team is aware of them, and create tickets for patching them. These tickets can be scheduled for the next sprint. However, if any firefighters from the current sprint have time left towards the end of the sprint, they can work ahead on these tickets (unless doing so would involve neglecting the sustainability-related considerations).
Low level severity: Firefighters should report these vulnerabilities and create tickets for patching them with a lower priority. These tickets can be scheduled for a future sprint and prioritized by owners of their parent epics as appropriate.
Undefined or unclear severity: Vulnerabilities or security fixes whose severity is unknown or unclear should be reported and discussed with the team before taking any action.
Dependabot PRs: Dependabot PRs for security fixes and/or bumping versions of vulnerable dependencies should be handled by the firefighters based on the process for various severity levels described above.

Note that:

The severity level mentioned in a vulnerability report may not always match our own assessment. For example, a vulnerability categorized as critical may not be deemed critical by us if it affects a disabled feature.
- Firefighters should take this into account when following up on security alerts, and discuss with other members of the team to resolve any doubts they might have.
Just like other issues related to OpenCraft's internal infrastructure, dependency upgrades for internal infrastructure components must be handled by the DevOps cell.
- Tickets for applying these upgrades should therefore always be created in the Serenity (SE) project.
- Dependency upgrades for critical/high-severity issues are the only exception: If firefighters from other cells become aware of them while all members of the DevOps cell are out-of-office, they should start working on applying them immediately (as mentioned above).
  - Tickets for logging time spent applying these upgrades should still be created in the Serenity (SE) project, and members of the DevOps cell should take over any work that remains by the time they are back to work.

Guidelines for false positives¶

If a false positive triggers the pager, firefighters should treat it as a proper issue.

This means that when encountering a false positive (which may come from OpenCraft's internal infrastructure or client-specific infrastructure), the scope of the resulting ticket is to fix what created it, in order to keep the percentage of false positives low.

This is important because if the percentage of false positives increases over time, it can lead to alarm fatigue and make the team less effective at identifying and properly addressing real issues.

False-positives coming from OpenCraft's internal infrastructure should be handled by the DevOps cell (and corresponding tickets created in the Serenity (SE) project).

Last update: 2024-01-11