On Call Process :: Discuss/Sign up until Feb, 22

There is a proposal for on-call process for Status.

Please:

(1) Read the process proposal;
(2) Leave your comments/questions until Feb, 22;
(3) If you want to participate, either leave a comment here or ping me at igor[at]status.im or PM me on Status.

FAQ
q1. We will get rid of cluster, does it make sense to monitor it?
a1. Yes, I think so. While we have it, let’s run it in a more proper way. Also, on-call for security incidents will be useful even w/o a cluster.


On-call process

To keep our infrastructure always accessible and to keep our service
and organization running, we need to make sure that there is always
someone monitoring it and ready to take action if an incident happens.

1. What do we monitor?

1.1. Status Cluster

First important part of the infrastructure is the status cluster.
This has our bootnodes and mailserver. The first ones are necessary
for Status client to connect to other P2P clients.
Mailservers store offline messages.

1.1.1. Handling incidents

If an incident happens, one should try to use our tools to resolve it.
If that is not possible, the issue should be escalated.

After the incident is mitigated, within the week, there should be a post-mortem,
explaining what happens, which were the reasons and how can we avoid it
in the future.

1.1.1.1. Monitoring tools

+---------------------------+------------------------------------------+
| Name                      | URL                                      |
+---------------------------+------------------------------------------+
| Dynamic Hosted Inventory  | https://consul.status.im/                |
| Monitoring Dashboards     | https://grafana.status.im/               |
| Log Aggregation Dashboard | https://kibana.status.im/                |
| Canaries Dashboard        | https://canary.status.im/                |
| Metrics Alarming          | https://prometheus.status.im/alerts      |
| Alerts Management         | https://alerts.status.im/                |
| ElasticSearch UI          | https://es.status.im/                    |
| PagerDuty Dashboard       | https://statusim.pagerduty.com/incidents |
+---------------------------+------------------------------------------+

1.1.1.2. Runbooks
The starting point for all runbooks is the
infra-docs repo.

If you want to start diagnosing issues with hosts or services
you might want to look at:

A list of known issues is usually maintained within those runbooks.

1.2. Status high-value contracts

We also monitor the high-value smart contracts that Status holds, as well as
the multisignature addresses that control them.
These contracts/addresses hold all of Status ETH holdings from the ICO,
the control of the SNT token itself, and various dApp functionality
(ENS names, Tribute to Talk, etc).
The other addresses that are watched are the various required
Status Finance addresses required to manage and disburse employee funds and initiatives.

1.2.1. How are they watched?

Currently, all of the related addresses are added to a etherscan.io watch list. When they are used, an email is sent to the [email protected]
Google Group with the details of the attempted transaction.

Eventually, we will replace this system with a function level alerting platform,
which is being developed.

1.2.2. Handling incidents

The process of evaluation incidents is not hard set yet.
Because the level of alert on etherscan is binary (send email or not on txn),
you have to manually check each email that comes into [email protected]
and evaluate whether it is “normal behavior” or “abnormal behavior”
for that particular address.

Corey will share the credentials via Lastpass to log into the Etherscan account.
Each address is labeled with its purpose in the “watch list” section of the account,
which helps to justify what normal behavior is.
Corey will also create documents that specify the escalation process
if something is deemed to be “abnormal” for a given address,
as it is differentiated depending on the context of the contract/address.

https://notes.status.im/m6BQO64dTNKcKoUR_Ge8FA

1.3. HackerOne validated reports

We are in contract with HackerOne for a long-lasting bug-bounty program.
We define a scope and payout for bounties, they invite people in their
hacker community to try and break anything within the scope and submit reports on it.
Our level of service provides a triage team from HackerOne.
That means they check for disclosure quality, verify every bug,
make sure it is within scope, and then guess at a criticality before handing it off
to us to check and mitigate.

When our private campaign starts with HackerOne (H1), we will start receiving validated bug reports from the HackerOne triage team.
The H1 scope is the current releases for mobile and desktop.
These will be webhooked into PD or VictorOps.

1.3.1. Handling incidents

Monitor PD/VO for reports, when recieved, within 24 hours:

  • Bring up ticket in “Status.im Team” section of H1 inbox
  • Explicit details of H1 ticket process can be found here
  • Create associated issue with appropriate swarm in Status

2. What does it mean being on-call?

2.1. Laptop with all credentials and necessary access and mobile phone should always be in in reach;
2.2. PagerDuty app should be installed on the phone and the phone number should be registered in PagerDuty;

3. Compensation structure

There are two types of payments that are made for being on-call:

3.1. Being on-call compensation

Flat fee per week of on-call assignment. Paid off no matter if there were or weren’t any incident. The payment is $200 in SNT and paid out cumulatively as a part of quarterly SNT bonus.

3.2. Incident resolution compensation

Compensation is paid per hour of incident resolution. Rate is $20 per hour in SNT and paid out cumulatively as a part of quarterly SNT bonus. Requires incident to be resolved post-mortem to be written.

NB! Post-mortem time is not taken into account of incident resolution.

2 Likes

the here in 1.3.1 refers to this link: HackerOne Status.im Team Inbox Management - CodiMD