Today we performed a real-life DR test.
Disaster scenario for 22-01
- All production indexers eaten by an Algoazaur 🦖
- Hot-standby machines (replicated postgres) taken over by AlgoMonkeys 🐒🐒🐒
- Offsite daily backup available but data is one week old (monitoring failed as well)
- No spare baremetal available for restore (who does that nowadays ?)
- Archival algod service is not affected 😅
Procedure
Task completion time rounded up to 5 minutes.
Name | Minutes |
---|---|
Rent compatible baremetal machine from Hetzner | 20 |
Provision standard Ubuntu 20.04 server | 5 |
Deploy standarized AlgoNode docker-compose with indexer and PG14 | 5 |
Download and multicore decompress the daily snapshot (300GB → 900GB) | 55 |
Indexer catchup (5 days) | 15 |
Cloudflare reconfiguration, tests | 10 |
Conclusions:
- RTO : <2hrs, RPO: 0 minutes (no data loss - the beauty of blockchain)
- DR Process would benefit from 10Gbps uplink
- Got lucky with renting new compatible server
- Renting an auctioned server would be less risky (and less costly)
We had only 2 indexers way back than so this exercise had it’s value.
Now we operate multiple regional clusters of indexers :)