Indexer Disaster Recovery TestBlog

DR test.

post-thumb

BY Skan / ON Jan 12, 2022

Today we performed a real-life DR test.

Disaster scenario for 22-01

  • All production indexers eaten by an Algoazaur 🦖
  • Hot-standby machines (replicated postgres) taken over by AlgoMonkeys 🐒🐒🐒
  • Offsite daily backup available but data is one week old (monitoring failed as well)
  • No spare baremetal available for restore (who does that nowadays ?)
  • Archival algod service is not affected 😅

Procedure

Task completion time rounded up to 5 minutes.

NameMinutes
Rent compatible baremetal machine from Hetzner20
Provision standard Ubuntu 20.04 server5
Deploy standarized AlgoNode docker-compose with indexer and PG145
Download and multicore decompress the daily snapshot (300GB → 900GB)55
Indexer catchup (5 days)15
Cloudflare reconfiguration, tests10

Conclusions:

  • RTO : <2hrs, RPO: 0 minutes (no data loss - the beauty of blockchain)
  • DR Process would benefit from 10Gbps uplink
  • Got lucky with renting new compatible server
  • Renting an auctioned server would be less risky (and less costly)

We had only 2 indexers way back than so this exercise had it’s value.
Now we operate multiple regional clusters of indexers :)

Share: