We’ve come a long way in the last year in the way we operate our sites. We’ve stabilized our applications, improved their response time, and increased their availability.
To accomplish these improvements we’ve done a series of database maintenances that varied from upgraded hardware, to new database servers, to configuration changes that required restart. In each of these operations we had one common goal: minimize the interruption to our customers.
Today we are releasing a small script that has made our lives, and our customer’s lives a whole lot better. We use this script to change the roles of our databases from replication masters to slaves, and vice versa. The fact that the script does all the steps previously performed by a human in a more timely and perfect manner is where we achieve all the gain.
Without this script we used to spend minutes accomplishing these maintenance tasks. With the script we’ve swapped databases under production load with no user noticeable interruption!
The script has lots of hard coded paths and users and other assumptions. But this is too good to keep to ourselves. We’re sharing it with you with the hope that it will improve your operations experience, and that you will contribute back changes that make it even better.
Nick
on 30 Nov 12Why MIT as the license as opposed to BSD?
Davy
on 30 Nov 12I love seeing posts like this. Could you explain how your setup works as far as the proxy? When it is paused and restarted, what is occurring behind the scenes and what are you using for a proxy?
Mikael
on 30 Nov 12You may have left one of your DB passwords in the README file on GitHub, check database_one.slave_password.
Taylor
on 30 Nov 12@Davy No proxy. We just point the apps at the VIP. When we run the script we just restart the app / reset DB connections.
@Mikael No security issue there. Made them the same.
Joseph
on 30 Nov 12@Taylor I was wondering who you prevent dropped writes, is that even possible? What’s “the damage” when running this? A minute or two of errors? What’s the /tmp/hold part doing?
Taylor
on 01 Dec 12@Joseph We usually just accept that a few writes will fail. In most cases that’s a number between 1 and 10 (usually closer to 1). The /tmp/hold part is for another blog post ;)
Davy
on 01 Dec 12@Taylor Looking forward to the post about /tmp/hold. Hope you won’t keep us waiting too long :)
Aaron
on 01 Dec 12Are you using XtraDB Cluster? Or just “standard” master-master replication? Or something else?
Valerie Parham-Thompson
on 03 Dec 12I wonder if you used or considered MMM failover and found it didn’t work in your environment.
“When we run the script we just restart the app / reset DB connections.” Do you mean you manually restart the app? The use of vIPs should make that seamless.
Joseph
on 03 Dec 12@Taylor I think dropping writes for a short time span is completely acceptable. Looking forward to the /tmp/hold story! I’m currently looking at products like Tungsten to help accomplish the same.
This discussion is closed.