Today we performed a real-life DR test.
Disaster scenario for 22-01
- All production indexers eaten by an Algoazaur 🦖
- Hot-standby machines (replicated postgres) taken over by AlgoMonkeys 🐒🐒🐒
- Offsite daily backup available but data is one week old (monitoring failed as well)
- No spare baremetal available for restore (who does that nowadays ?)
- Archival algod service is not affected 😅
Task completion time rounded up to 5 minutes.
|Rent compatible baremetal machine from Hetzner||20|
|Provision standard Ubuntu 20.04 server||5|
|Deploy standarized AlgoNode docker-compose with indexer and PG14||5|
|Download and multicore decompress the daily snapshot (300GB → 900GB)||55|
|Indexer catchup (5 days)||15|
|Cloudflare reconfiguration, tests||10|
- RTO : <2hrs, RPO: 0 minutes (no data loss - the beauty of blockchain)
- DR Process would benefit from 10Gbps uplink
- Got lucky with renting new compatible server
- Renting an auctioned server would be less risky (and less costly)
We had only 2 indexers way back than so this exercise had it’s value.
Now we operate multiple regional clusters of indexers :)