Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure

Baseline Availability Mechanisms:

Plan capacity for projected demand. Tolerate failures of key components. Redundant hardware, cable, and even power.
Primary-backup for logically centralized control plane component.
Offline approval process for deciding traffic priorities for applications.
Regression test before rolling out the control plane changes. Periodically exercise disaster scenarios. Write post-mortem reports.

I found a few things in the formatting of their post-mortem to be interesting:

Post-mortem report is blame-free and non-judgemental. I’m not sure if they actually execute this rule (e.g. don’t take failures into peer performance review), but I think this is a very nice attitude towards failure, because you can’t learn lesson unless you are honest.
They confirm the root cause by reproducing the failure. From scientific perspective, I think this is very good. But I wonder if these reproductions affect their real business?
Failure is documented in a post-mortem only if it’s new. I wonder how to know if a failure is new? Is it based on the memory of the availability team members? I guess this is manageable for 100 reports, but as the number of reports accumulates, it’s going to be harder to know every previous ones.

In the last paragraph of Section 3.2, the paper mentions that draining traffic for every upgrade could affect their serving capacity. I get the point, but just curious about what is the overprovision ratio?