MySQL DBA for non-DBAs - Why do I need replication?

Unless you are working on a small, internal application for an organisation, eventually you’re going to have to consider how to scale your database service to handle more traffic. If you consult the Internet for advice, the two most frequent answers you’ll get will be “Use {other system} because it is WEB SCALE” and “Set up replicated servers”.

Migrating from one database system to another, although potentially a painful and complex action, may actually be the best solution – although MySQL is capable of being used to build things like social graph databases and document stores, it will never offer the same performance as an actual graph database system or document store. Use the right tool for the job.

If MySQL truly is the right tool for the job, and you’ve optimised your schemata and queries as far as they’ll go, then replication is one way of scaling up operations.

Even if you don’t have plans to use replication straight away, it’s easier to set up from a new, empty installation than to add to an existing, running system. Here are a few good reasons to use replication.

Spread query load across several servers

One of the most common uses for replication is to create a farm of database servers to deliver data to applications. Whenever data needs to be retrieved, any of the slave herd can be used to service the request.

Servers within the herd may be chosen using methods as simple as picking from a list in a round-robin fashion, or using systems that actively monitor the load on each server and select the least loaded one. In addition, having a herd of slave servers to pick from can help reduce the impact of server failure, with bigger herds offering more benefits – the failure of a server in a two server herd will result in the loss of 50% of query capacity, but the failure of a server in a herd of one hundred means a loss of only 1%.

Selective replication can be used to create servers that only handle certain schemata and/or tables – some with fast CPUs for queries requiring heavy calculations, some with large disks for BLOB storage. Map/reduce systems can use an entire herd to parallelise requests, splitting the query into selective ranges and then combining the results into a single data set. Deliberately pausing the replication process can be used to create snapshots of the database, frozen in time at a particular point, which can be used in report generating processes that are otherwise unable to use transactions.

Provide tiered levels of service

It’s common for web services to provide either a “free tier”, or time-limited access to users, so that they can evaluate the service and see the benefits of upgrading to a paid account.

The “free tier” may be serviced by a smaller/older/cheaper herd of slaves on a “best effort” basis, whereas the “premium tier” may use a slave herd capable of higher performance/uptime. Both herds would contain the same data (making it easy to migrate users between the tiers as they pay – or don’t!) and allows organisations to better manage resources (the premium tier may have an expensive 24×7 response team, the free tier handled by regular staff during office hours)

Once the user has been authenticated, the application can decide which herd to connect to. If the user upgrades their account to a (higher) paid tier, the application can immediately switch to the better herd without needing to migrate any data.

Create a local database mirror

Depending on how applications access data, there are cases where network latency can have a significant impact on the performance of an application; having a “local” mirror (or herd of mirrors) can be desirable, if not essential.

Having a database mirror in the same country/data centre/rack as the application servers will generally always have a positive effect on performance, and provides additional resiliency in the event of failures. In addition, as these servers are “local” in nature, access to them can be restricted to “local” IP addresses – there is no need for the server to listen to, or even have, a publicly accessible IP address. This can increase the security of a system by reducing the attack surface it presents to the rest of the world.

Another common scenario is to have an “office mirror” of remote databases to provide local backups, and to service data requests without having to place a burden on live, customer-facing systems. “Heavy” operations, like detailed report generation or ETL processing by BI systems, can be left running on the office mirror, resulting in no visible impact to customers.

Secure access to sensitive tables/fields

Sometimes, it may be appropriate to limit access to tables/fields to create “write only” data on exposed servers. This data isn’t truly write-only (a different MySQL user account with greater privileges would be able to access it) but it does allow for a certain amount of additional security (and security should be layered, in any case)

Storage of payment details, for example, may benefit from this pseudo “write only” access – once committed to the database, it should not be accessible again from the frontend. If the application’s standard database credentials lack the privileges to read these details, it would be harder for an attacker to coerce the information back out using simple SQL injections – they would need to persuade the application to use credentials it doesn’t normally use. Once this data has been replicated to a more secure environment, a different application (using different credentials that do have access to the fields) may be granted access to it. Care would need to be taken to secure the binary logs both with their storage and their transmission – they would contain this “secure” information, and so will be vulnerable to reading either on disk or in transit to slaves.

Create hot-spare servers to replace a failed master server

Even with careful selection of hardware and networking, there may be a failure that takes the master server offline. Depending on the nature of the failure, and the availability of things like hardware spares or on-site technicians, the amount of downtime involved may exceed service level agreements.

There may come a point where the decision is made to abandon the previous master server completely, and replace it with a new one. Without replication, this would involve restoring from the most recent backup, and potentially losing all transactions from the time the backup was taken.

Using replication, any one of the slaves may be considered as a candidate for the new master. These servers would not require any backup restorations, already holding a copy of the data on the master, and will generally only result in the loss of a few transactions that didn’t manage to get transmitted through the binary log system.

This provides “hot-spares” for the master – servers that can be quickly promoted to take over in case of master server failure. Once the failure is detected, systems can (either automatically, or by being triggered manually) select one slave to promote, and configure the remaining slave herd to use the new master. Service downtime could be measured in seconds, rather than hours.

Create consistent database backups

One of the difficulties facing database operators is to generate regular, consistent database backups. Some “hot” backup solutions deliberately create “dirty” copies of the database and then roll them back until a good position is found. This generally works for non-replicated systems, but when replication is being used the recorded master position may be ahead of the transactions actually committed to the tables – the rollback performed by the backup tool may not roll back the master server replication transaction ID. If this backup is used to rebuild the database, some transactions may be lost.

The easiest way to obtain a consistent backup is to stop the server (thus making sure all transactions are committed or rolled back, all dirty buffers are written, and all undo/redo logs are empty) and copy the files. Normal filesystem backup tools can be used, as the entire database state would be safely frozen, and the data and configuration files can be easily taken to other servers to clone the slave. In many cases, if replication on a slave fails, it may be quicker and easier to just wipe the failed slave’s data and re-clone it from a consistent backup – the newly cloned slave should automatically connect to the master and bring itself back up to date.

Stopping the main database server on a regular basis is unlikely to be an acceptable proposal, but given that a properly running slave server will contain an identical copy of the data (up to a given transaction) they lend themselves nicely to this task. If a slave is not part of a regular service herd, or is cleanly removed from the herd (either prior to shutdown, or as part of a slave herd monitoring system) then the impact of regular shutdowns will be minimal.

Okay, I’m sold! Let’s do this!

Before you set up replication, it’s best to properly understand the process of how it works – this will help you plan storage and network requirements, and to diagnose and fix problems if and when they occur. Being familiar with the flow of data within and between servers may also allow you to develop new and interesting solutions to problems, or spot ways in which the replication process can be used to deliver efficiency gains.

Richard Wallman

MySQL DBA for non-DBAs – Why do I need replication?