A common obstacle we face when releasing new features is making production schema changes in MySQL. Many new features require additional columns or indexes. Running an “ALTER TABLE” in MySQL to add the needed columns and indexes locks the table, hanging the application. We need a better solution.
Option 1: Schema Change in Downtime This is the simplest option. Put the application into downtime and perform the schema change. It requires us to have the application down for the duration of the “ALTER TABLE”. We’ve successfully used this option for smaller tables that can be altered in seconds or minutes. However, for large tables the alter can take hours making it less than desirable.
Option 2: Role Swap This is the option that we have used in the past to perform schema changes on large tables. There are quite a few steps which make this option error-prone. When the changes are made, a short downtime to swap the application between the master and replica is required.
Here’s a sample of the process we follow to change the roles of the current master “A”, and the current replica “B”.
- On B, STOP SLAVE.
- On B, SHOW MASTER STATUS. This requires that binary logging is enabled on B.
- On B, execute schema changes with ALTER TABLE statements.
- On A, CHANGE MASTER TO MASTER_HOST=’B’, MASTER_USER=’...’, MASTER_PASSWORD=’...’, MASTER_LOG_FILE=’...’, MASTER_LOG_POS=nnn; according to the master status reported by B.
- On B, START SLAVE.
- Let B catch up to A.
- Stop applications writing to A, and watch SHOW PROCESSLIST until activity is totally quiet.
- On A, SET GLOBAL read_only=1;
- On B, STOP SLAVE once you’re sure it has applied all the changes replicated from A.
- On B, SET GLOBAL read_only=0;
- Reconfigure your applications to write to B instead of A, and restart applications.
- START SLAVE on A.
- Finally…Have a stiff drink. You need it at this point.
Option 3: pt-online-schema-change We have recently started using pt-online-schema-change to perform our schema updates without needing to take downtime. It was developed by the folks at Percona. During the online schema change the original table is not locked and will continue to take reads and writes. In very basic terms here is what is happening when you run pt-online-schema-change.
- Create new table with same structure as original.
- Update schema on new table.
- Copy rows in batches from original table.
- Move original table out of the way and replace with new table.
- Drop old table.
Copying all rows of a table to the new table creates a lot of data that needs to be replicated to your slave. Fortunately pt-online-schema-change will monitor the status of your slave and will pause the data copy process if replication goes too far behind. We have also found that the data copy process can be disk intensive and will impact MySQL’s performance. Even with pt-online-schema-change we still choose to perform schema updates after hours to limit the impact on our applications. There are a number of settings in pt-online-schema-change which you can tune to maximize performance during the schema change. We also thoroughly test each schema change in staging prior to running production.
Next up: MySQL HA and Failover
Nugget
on 14 May 12It’s amazing to learn that MySQL still suffers from these sorts of immaturity challenges. Does adding or dropping an index still require the table to be duplicated during the process?
Tim
on 14 May 12Or, you can use Postgresql which doesn’t have this limitation.
Mathieu Allaire
on 14 May 12Yeah, postgresql is really awesome. I still don’t get why it’s not on Amazon RDS already…
BillP
on 14 May 12This is where using a commercial database like SQL Server shines. No issues with online column adds (provided the column is allowed to be NULL/does not need a default value) and no issues with index creation.
In Enterprise version (for a bigger fee) you can even reorg/rebuild existing indices online with no system interruption.
Cedric
on 14 May 12What about having a dual master MySQL setup?
A is the active master, B the standby master.
1. Stop B->A replication (run STOP SLAVE on A), but keep A -> B. 2. Perform the ALTER TABLE on B. 3. When done, switch all connections to B (make sure that A->B replication is not lagging before making the switch). 4. START SLAVE on A 5. Wait for A to catch up with B
?
Alexis
on 14 May 12I used the same strategy as option 3. But MySQL was very slow for 15 minutes after old and new tables are swapped. Do you have this issue too, and if so, how did you solved it?
Tim
on 14 May 12Anyone
If anyone is interested in a very deep discussion on this topic and how to (painfully) address it, Facebook has a long blog post on exactly this subject – which also includes issues MySQL has with Replication and Cleanup.
Will
on 14 May 12@Tim: Many of us are big fans of Postgres at 37signals, if we were starting from scratch and didn’t have so much MySQL infrastructure already we might have chosen Postgres.
@Alexis: Likely due to invalidating a bunch of data in the buffer pool. John ran the change at the weekend during a low traffic period minimising the impact of an increase in load. The DB is also stored on Fusion IO, hitting the permanent store isn’t as much of a problem as with disks and SSDs.
Tim
on 14 May 12@Will
You just rewrote Basecamp from scratch.
Just say’n.
Gil Hildebrand
on 14 May 12We use “Multi Master Replication Manager for MySQL” (aka MMM), also developed by the folks at Percona, to automate a lot of the steps in option 2. Assuming you have a multi master setup, the advantage of this approach over option 3 is that the replication thread is not impacted in any way. Here’s what the steps look like:
1) Run ALTER TABLE on all of the slaves individually. Use SET SQL_LOG_BIN=0; just to be safe. 2) mmm_control move_role writer db2 (another way of expressing this is mmm_control set_offline db1) 3) SET SQL_LOG_BIN=0; on db1. Then run ALTER TABLE. 4) mmm_control move_role writer db1. Done!
Will
on 14 May 12@Tim
We have close to 30 MySQL databases + the supporting infrastructure for things like backing up, restoring, failing over, provisioning etc. There was more to consider than just which database we preferred when starting the new project.
Nu Park
on 14 May 12@Will
Just throwing good money after the bad. It’ll just be that much more expensive to switch later.
Douglas F Shearer
on 14 May 12Are you integrating pt-online-schema-change with ActiveRecord’s migrations, or managing schema changes outside of Rails?
lian
on 15 May 12@Tim why ” Postgresql doesn’t have this limitation.” ?
John
on 15 May 12@Douglas: We perform the schema changes outside of AR’s migrations.
Alice Young
on 15 May 12If you used Java+Hibernate, you would not have any of these problems.
Chad
on 15 May 12When I worked at OpenFeint, we used the role swap method. It was slightly easier than in the article, because we had master-master replication. Of course, when you’re running a global gaming network that doesn’t really have an ‘off-peak’, no kind of maintenance is ‘easy’.
AC
on 16 May 12@John
Because Postgres isn’t limited like MySQL and instead supports schema changes without locking the table
Not that, SQL Server and Oracle also support this functionality.
Tim most likely brought this topic up because many people bitch about hard workaround people have to do due to MySQL limitations instead of just pitching a vastly superior product like Postgres … which by the way, is even more open than MySQL given it’s BSD (free as in beer).
dstroma
on 18 May 12The way to take the pain out of MySQL is to switch to PostgreSQL.
@lian, Postgres uses a Multiversion Concurrency Control (MVCC) model to prevent excessive locking while maintaining ACIDity.
This discussion is closed.