Taking the Pain Out of MySQL Schema Changes

A common obstacle we face when releasing new features is making production schema changes in MySQL. Many new features require additional columns or indexes. Running an “ALTER TABLE” in MySQL to add the needed columns and indexes locks the table, hanging the application. We need a better solution.

Option 1: Schema Change in Downtime
This is the simplest option. Put the application into downtime and perform the schema change. It requires us to have the application down for the duration of the “ALTER TABLE”. We’ve successfully used this option for smaller tables that can be altered in seconds or minutes. However, for large tables the alter can take hours making it less than desirable.

Option 2: Role Swap
This is the option that we have used in the past to perform schema changes on large tables. There are quite a few steps which make this option error-prone. When the changes are made, a short downtime to swap the application between the master and replica is required.

Here’s a sample of the process we follow to change the roles of the current master “A”, and the current replica “B”.

On B, STOP SLAVE.
On B, SHOW MASTER STATUS. This requires that binary logging is enabled on B.
On B, execute schema changes with ALTER TABLE statements.
On A, CHANGE MASTER TO MASTER_HOST=’B’, MASTER_USER=’...’, MASTER_PASSWORD=’...’, MASTER_LOG_FILE=’...’, MASTER_LOG_POS=nnn; according to the master status reported by B.
On B, START SLAVE.
Let B catch up to A.
Stop applications writing to A, and watch SHOW PROCESSLIST until activity is totally quiet.
On A, SET GLOBAL read_only=1;
On B, STOP SLAVE once you’re sure it has applied all the changes replicated from A.
On B, SET GLOBAL read_only=0;
Reconfigure your applications to write to B instead of A, and restart applications.
START SLAVE on A.
Finally…Have a stiff drink. You need it at this point.

Option 3: pt-online-schema-change
We have recently started using pt-online-schema-change to perform our schema updates without needing to take downtime. It was developed by the folks at Percona. During the online schema change the original table is not locked and will continue to take reads and writes. In very basic terms here is what is happening when you run pt-online-schema-change.

Create new table with same structure as original.
Update schema on new table.
Copy rows in batches from original table.
Move original table out of the way and replace with new table.
Drop old table.

Copying all rows of a table to the new table creates a lot of data that needs to be replicated to your slave. Fortunately pt-online-schema-change will monitor the status of your slave and will pause the data copy process if replication goes too far behind. We have also found that the data copy process can be disk intensive and will impact MySQL’s performance. Even with pt-online-schema-change we still choose to perform schema updates after hours to limit the impact on our applications. There are a number of settings in pt-online-schema-change which you can tune to maximize performance during the schema change. We also thoroughly test each schema change in staging prior to running production.

Next up: MySQL HA and Failover

John wrote this on May 14 2012 There are 19 comments.

Nugget

on 14 May 12

It’s amazing to learn that MySQL still suffers from these sorts of immaturity challenges. Does adding or dropping an index still require the table to be duplicated during the process?

Tim

Or, you can use Postgresql which doesn’t have this limitation.

Mathieu Allaire

Yeah, postgresql is really awesome. I still don’t get why it’s not on Amazon RDS already…

BillP

This is where using a commercial database like SQL Server shines. No issues with online column adds (provided the column is allowed to be NULL/does not need a default value) and no issues with index creation.

In Enterprise version (for a bigger fee) you can even reorg/rebuild existing indices online with no system interruption.

Cedric

What about having a dual master MySQL setup?

A is the active master, B the standby master.

1. Stop B->A replication (run STOP SLAVE on A), but keep A -> B. 2. Perform the ALTER TABLE on B. 3. When done, switch all connections to B (make sure that A->B replication is not lagging before making the switch). 4. START SLAVE on A 5. Wait for A to catch up with B

Alexis

I used the same strategy as option 3. But MySQL was very slow for 15 minutes after old and new tables are swapped. Do you have this issue too, and if so, how did you solved it?

Anyone

If anyone is interested in a very deep discussion on this topic and how to (painfully) address it, Facebook has a long blog post on exactly this subject – which also includes issues MySQL has with Replication and Cleanup.

Will

@Tim: Many of us are big fans of Postgres at 37signals, if we were starting from scratch and didn’t have so much MySQL infrastructure already we might have chosen Postgres.

@Alexis: Likely due to invalidating a bunch of data in the buffer pool. John ran the change at the weekend during a low traffic period minimising the impact of an increase in load. The DB is also stored on Fusion IO, hitting the permanent store isn’t as much of a problem as with disks and SSDs.

@Will

“if we were starting from scratch … we might have chosen Postgres”

You just rewrote Basecamp from scratch.

Just say’n.

Gil Hildebrand

We use “Multi Master Replication Manager for MySQL” (aka MMM), also developed by the folks at Percona, to automate a lot of the steps in option 2. Assuming you have a multi master setup, the advantage of this approach over option 3 is that the replication thread is not impacted in any way. Here’s what the steps look like:

1) Run ALTER TABLE on all of the slaves individually. Use SET SQL_LOG_BIN=0; just to be safe. 2) mmm_control move_role writer db2 (another way of expressing this is mmm_control set_offline db1) 3) SET SQL_LOG_BIN=0; on db1. Then run ALTER TABLE. 4) mmm_control move_role writer db1. Done!

@Tim

…and didn’t have so much MySQL infrastructure already…

We have close to 30 MySQL databases + the supporting infrastructure for things like backing up, restoring, failing over, provisioning etc. There was more to consider than just which database we preferred when starting the new project.

Nu Park

Just throwing good money after the bad. It’ll just be that much more expensive to switch later.

Douglas F Shearer

Are you integrating pt-online-schema-change with ActiveRecord’s migrations, or managing schema changes outside of Rails?

lian

on 15 May 12

@Tim why ” Postgresql doesn’t have this limitation.” ?

John

@Douglas: We perform the schema changes outside of AR’s migrations.

Alice Young

If you used Java+Hibernate, you would not have any of these problems.

Chad

When I worked at OpenFeint, we used the role swap method. It was slightly easier than in the article, because we had master-master replication. Of course, when you’re running a global gaming network that doesn’t really have an ‘off-peak’, no kind of maintenance is ‘easy’.

AC

on 16 May 12

@John

“why ” Postgresql doesn’t have this limitation.” ?”

Because Postgres isn’t limited like MySQL and instead supports schema changes without locking the table

Not that, SQL Server and Oracle also support this functionality.

Tim most likely brought this topic up because many people bitch about hard workaround people have to do due to MySQL limitations instead of just pitching a vastly superior product like Postgres … which by the way, is even more open than MySQL given it’s BSD (free as in beer).

dstroma

on 18 May 12

The way to take the pain out of MySQL is to switch to PostgreSQL.

@lian, Postgres uses a Multiversion Concurrency Control (MVCC) model to prevent excessive locking while maintaining ACIDity.

Taking the Pain Out of MySQL Schema Changes

Nugget

Tim

Mathieu Allaire

BillP

Cedric

Alexis

Tim

Will

Tim

Gil Hildebrand

Will

Nu Park

Douglas F Shearer

lian

John

Alice Young

Chad

AC

dstroma

This discussion is closed.

About John