PHP Load Balancing Part 1 and Part 2

MySQLopcachePerformancePHPSharding

 

 

 

Scaling a PHP MySQL Web Application, Part 1

By Eli White

Tips for scaling your PHP-MySQL Web app based on real-world experiences at Digg, TripAdvisor, and other high-traffic sites.

Published April 2011

Creating a Web application – actually writing the core code – is often the initial focus of a project. It’s fun, it’s exciting, and it’s what you are driven to do at that time. Hopefully however, there always comes a time when your application starts to take higher traffic than originally expected. This is the point where people often start thinking about how to scale their Website.

Ideally, you should be thinking about how your application is going to scale from the moment you first write code. Now, that is not to say that you should spend a lot of effort in your early development process targeted at an unknown future. Who knows what will happen and if your application will ever hit the traffic levels that require a scalability effort? But hopefully the most important lesson you can learn here is to understand what you will need to do to scale in the future. By knowing this, you can do only what you need at each phase of your project without “coding yourself into a corner”, ending up in a situation where it’s hard to take the next scalability step.

I’ve worked for a number of companies and projects that over time had to deal with massive levels of Web traffic. These include Digg, TripAdvisor, and the Hubble Space Telescope project. In this two-part article I will share some of the lessons learned, and take you step by step through a standard process of scaling your application.

Performance vs. Scalability

Before we go much further, we should discuss the differences (and similarities) between performance and scalability.

Performance, in the context of a Web application, is how fast you can serve data (pages) to the end user. When people talk about increasing the performance of their application, they are talking typically about making it take 300ms instead of 500ms to generate the content.

Scalability, in contrast, is the quality that enables your application to grow as your traffic grows. A scalable application is one that theoretically, no matter how much traffic is sent toward it, can have capacity added to handle that traffic.

Scalability and performance obviously are interrelated . As you increase the performance of an application, it requires fewer resources to scale it, making scaling easier. Similarly, you can’t really call an application scalable if it requires one Web server per user for adequate performance, as it would be untenable for you to provide that.

PHP Tuning

One thing that you should look at first is any low-hanging fruit within your Web server and  PHP setup. There are some very easy things you can look into that can immediately increase performance and perhaps relieve the need to scale at this point in time completely, or at least make it easier.

The first of these is installing an opcode cache. PHP is a scripting language, and therefore recompiles the code upon every request. Installing an opcode cache into your Web server can circumvent this limitation. Think of an opcode cache as sitting between PHP and the server machine; after a PHP script is first compiled, the opcode cache remembers the compiled version and future requests simply pull the already compiled version.

There are numerous opcode caches available. Zend Server comes with one built-in and Microsoft provides one for Windows machines called ‘WinCache’. One of the most popular ones is an open source product called APC. Installing any of these products is very straightforward, and doing so will give you immediate and measurable performance gains.

The next step that you should evaluate is if you can remove the dynamic nature of any of your Web pages. Often Web applications will have pages that are being generated by PHP but that actually rarely change. Examples might be an FAQ page, or a press release. Caching the generation of these pages and serving the cached content so that PHP doesn’t need to do any work can save many CPU cycles.

There are multiple ways to approach this task. One of them is to actually pre-generate HTML pages from your PHP and let the HTML pages be directly served to the end users. This could be done as a nightly process perhaps, so that any updates to the Web pages do in fact go live eventually, but on a delayed schedule. Implementing this can be as simple as running your PHP scripts from the command line, piping their output to .html files, and then changing links in your application.

However there is an even easier approach to this that requires less effort: implementing an on-the-fly cache. Basically your entire script output is captured into a buffer saved to the filesystem, into memory/cache, or into the database, etc. All future requests for that same script just read from the cached copy. Some templating systems (such as Smarty) automatically do this for you, and there are some nice drop-in packages that can handle this for you as well (such as jpcache).

The simplest version of this is something fairly easy to write yourself: the following code injected at the top of your typical PHP page will handle this for you (obviously replace the timeout and cache directories to meet your needs). Encapsulate this into a single library function, and you could cache pages easily as needed. As this exists, it blindly caches pages based only upon the URL and GET parameters. If the page changes based upon a session, POST data, or cookies, you’d have to add those into the uniqueness of the filename created.

<?php
$timeout = 3600; // One Hour
$file = '/tmp/cache/' . md5($_SERVER['REQUEST_URI']);

if (file_exists($file) && (filemtime($file) + $timeout) > time()) {
    // Output the existing file to the user
    readfile($file);
    exit();
} else {
    // Setup saving and let the page execute:
    ob_start();
    register_shutdown_function(function () use ($file) {
        $content = ob_get_flush();
        file_put_contents($file, $content);
    });
}
?>

 


Load Balancing

When creating Web applications, everyone starts with a single server that handles everything. That’s perfectly fine, and in fact how things should usually be done. One of the first steps you will need to take to begin scaling your Web application to handle more traffic is to have multiple Web servers handling the requests. PHP is especially well suited to horizontal scaling in this manner by simply adding more Web servers as needed. This is handled via load balancing, which, fundamentally, is simply the concept of having a central point where all requests arrive and then handing the requests off to various Web servers.

white-php-part1-f1
Figure 1
 Load balancing

There are numerous options for handling your load balancing operationally. For the most part they fall into three different categories.

The first are the software balancers. This is software that you can install on a standard machine (usually Linux based) and that will handle all the load balancing for you. Each software balancer also comes with its own extra features such as built in page output caching, gzip compression, and more. Most of the popular options in this category are actually multipurpose software / Web servers themselves that happen to have a reverse proxy mode you can enable to do load balancing. These include Apache itself, Nginx, and Squid. There is also some smaller dedicated load balancing software such as Perlbal. You can even do a very simple version of this in your DNS server just by using DNS rotation so that every DNS request responds with a different IP address, although that is obviously not very flexible.

The second category is the hardware balancers. These are physical machines that you can buy, designed to handle very large traffic levels and with custom software built into them. Two of the better-known versions of this are the Citrix Netscaler and the F5 BigIP. The hardware solutions often provider many custom benefits, and act as firewalls and security barriers as well.

The final classification is a new one referring specifically to cloud-based solutions. When hosting on cloud providers you obviously don’t have the ability to install a hardware-based solution, restricting you to the software-only solutions. But most cloud providers also have their own built-in mechanisms for handling load balancing across their instances. In some cases these are less flexible, but are much easier to setup and maintain.

In the end, all of the solutions, whether software or hardware, all typically offer many of the same features. The ability to manipulate or cache the data as it comes across, the ability to load balance either by random selection or through watching health meters on the various sub machines, and much more. I would recommend that you explore what options exist for your hosting environment and simply find which one serves your situation the best. There are very good setup guides for any solution you would choose online that cover all the basics.

Load Balancing Preparation

As mentioned earlier in this article, it’s important to only do what you need to but to be mentally ready to take the next steps. Rarely will you ever setup load balancing when you are first building your application but you should keep a few things in mind to make sure that you don’t preclude yourself from doing it later.

First of all, if you are using any kind of local memory caching solution (such as the one provided by APC) you need to write your code without assuming that you’ll only have a single cache. Once you have multiple servers, data you store on one machine is only accessible from that machine.

Similarly a common pitfall is assuming a single file system. PHP Sessions can’t be stored in the file system anymore once you are using multiple servers, and you need to find another solution for them (such as a database). I’ve also seen code that stored uploaded files and expected to read them back later. You therefore need to assume that you shouldn’t store anything on the file system.

Now with all that said, you can still use the techniques listed previously if you feel there is a strong need for the rapid development of your initial application. What’s important is that you encapsulate any ‘single machine’ dependant code. Make sure that it’s in a single function, or at least within a class. Then, when the time comes to move to load balancing, there’s only one place you need to update to a new solution.

MySQL Master-Slave Replication

At this point you’ve hopefully got your Web server setup to be scaling wonderfully. Now most Web applications find their next bottleneck: the database itself. MySQL is very powerful but eventually you will run into the same problem as with Web servers where one simply isn’t enough. MySQL comes with a built-in solution for this called master-slave replication.

In a master-slave setup, you will have one server that acts as the master, the true repository of the data. You then setup another MySQL server, configuring it as the slave of the master. All actions that take place on the master are replayed on the slave.

white-php-part1-f2
Figure 2
 Master-slave setup

Once configured in this manner, then you make sure that your code talks to the ‘proper’ database depending upon the action you need to take. Any time you need to change data on the server (update, delete and insert commands), you connect to the Master. For read access, you connect to the slave instead. This could be as straight forward as something like:

<?php
$master = new PDO('mysql:dbname=mydb;host=127.0.0.2', $user, $pass);
$master->exec('update users set posts += 1');

$slave = new PDO('mysql:dbname=mydb;host=127.0.0.3', $user, $pass);
$results = $slave->query('select id from posts where user_id = 42');
?>


There are two benefits to taking this approach for scalability. First of all, you’ve managed to separate the load on the database into two parts. So off the bat you’ve reduced the load on each server. Now that’s typically not going to be an even split; ost Web applications are very read heavy, and so the slave will be taking more of the load. But that brings us to the second point: isolation. At this point you’ve isolated two distinct types of load, the write and read. This will allow you greater flexibility in taking the next scalability steps.

Slave Lag

There is really only one main pitfall when doing this (other than ensuring you are always talking to the right MySQL database), and that is slave lag. A database slave doesn’t instantly have the same data that the master has. As mentioned, any commands that change the database are essentially re-played on the slaves. That means however that there is a (hopefully very short) period of time after an update is issued on the master before it will be reflected on the slave. Typically this is in the order of milliseconds but if your servers are overloaded, slaves can get much farther behind.

The main change that this makes to your code is that you can’t write something to the master and then immediately attempt to read it back. This otherwise is somewhat common practice if you are using SQL math, have default values, or triggers in place. After issuing a write command your code may not know the value on the server and therefore want to read it in to continue processing. A common example would be:

<?php
$master->exec('update users set posts += 1');
$results = $slave->query('select posts from users');
?>


Even with the fastest setup possible, this will not give expected results. The update will not have been re-played against the slave before the query is executed and you will get the wrong count of posts. In cases like this therefore, you need to find workarounds. In the absolute worst case you can query the master for this data, though that defeats the purpose of “isolation of concerns”. The better solution is to find a way to “approximate” the data. For example, in an example where you are showing a user adding a new post, just read in the number of posts first and manually add one to it before displaying the value back to the user. If it happens to be incorrect because multiple posts were being added at the same time, it’s only for that one Web page. The database is correct for other users reading the value.

Multiple MySQL Slaves

At this point the next logical step is to expand your MySQL database capacity horizontally by adding in additional slaves. One master database can feed any number of slave machines – within some extreme limits. While few systems will experience this you do have theoretical limits where the single master can’t keep up with the multitude of slaves.

white-php-part1-f3
Figure 3
 Multiple MySQL slaves

I explained previously that one of the benefits of using master-slave was the isolation of concerns. Combine this with the fact that most Web applications are read-heavy because they add data rarely compared with the need to access data to generate every page. This means that usually your scalability concern is read access. That’s exactly what adding more slaves gives you. Each additional slave you add increases your database capacity just like adding more Web servers does for PHP.

Now this leaves you with a potential complication. How do you determine to which database to connect? When you had just a single database that decision was easy. Even when you had a single slave it was easy. But now to properly balance the database load you need to connect to a number of different servers.

There are two main solutions that people deploy in order to handle this.The first is to have a dedicated MySQL database slave for each Web server that you have. This is one of the simplest solutions and commonly deployed for that reason. You can envision this as looking something like this:

white-php-part1-f4
Figure 4
 Dedicated database slave for each Web server

In this diagram you see a load balancer with three Web servers behind it. Each Web server is speaking to its own database slave which in turn gets data from a single master. The decision for your Web servers as to which database to access is very simple as it’s no different than in a single-slave situation – you just need to configure each Web server to connect to a different database as its slave.

However in practice when this solution is deployed it’s often taken one step further: one single machine that acts as both Web server and database slave. This simplifies matters greatly as connections to your slave are simply to ‘localhost’ and balancing is therefore built in.

There is a large drawback to taking this simple approach however, and it causes me to always shy away from it: You end up tying the scalability of your Web servers to your MySQL database slaves. Again, isolation of concerns is a big thing for scaling. Quite often the scaling needs of your Web servers and of your database slaves are mutually independent. Your application might need the power of only three Web servers, but need 20 database slaves if it’s very database intensive. Or you might need three database slaves and 20 Web servers if it’s PHP- and logic-intensive. By tying the two together, you are forced to add both a new Web server and a new database slave at the same time. This also means that you are adding to the maintenance overhead.

Combining Web server and database slave onto the same machine also means you are over utilizing resources on each machine. Plus you can’t acquire specific hardware that matches the needs of each service. Typically the requirements for database machines (large and fast I/O) aren’t the same as for your PHP slaves (fast CPU).

The solution to this is to sever the relationship between Web servers and MySQL database slaves, and randomize connections between them. Literally in this case you would have each request that comes to a Web server randomly (or with an algorithm of your choice) select a different database slave to which to connect.

white-php-part1-f5
Figure 5
 Randomizing the connections between Web servers and database slaves

This approach enables the ability to scale your Web servers and database slaves independently. It can even enable smart algorithms that figure out which slave to which to connect, based upon whatever logic works best for your application.

There’s also another great benefit that we’ve not talked about yet: stability. Via randomized connections, it’s possible for a database slave to go down and for your application to not care, as it will just pick another slave to connect to instead.

Usually it’s your job to write code within your application that picks which database to which to connect. There are some ways to avoid this task, such as putting all the slaves behind a load balancer and connecting to it. But having the logic in your PHP code gives you, as a programmer, the greatest flexibility in the future. Following is an example of some basic database selection code that accomplishes this task:

<?php

class DB {
    // Configuration information:
    private static $user = 'testUser';
    private static $pass = 'testPass';
    private static $config = array(
        'write' =>
            array('mysql:dbname=MyDB;host=10.1.2.3'),
        'read' =>
            array('mysql:dbname=MyDB;host=10.1.2.7',
                  'mysql:dbname=MyDB;host=10.1.2.8',
                  'mysql:dbname=MyDB;host=10.1.2.9')
        );

    // Static method to return a database connection:
    public static function getConnection($server) {
        // First make a copy of the server array so we can modify it
        $servers = self::$config[$server];
        $connection = false;
        
        // Keep trying to make a connection:
        while (!$connection && count($servers)) {
            $key = array_rand($servers);
            try {
                $connection = new PDO($servers[$key], 
			self::$user, self::$pass);
            } catch (PDOException $e) {}
            
            if (!$connection) {
                // We couldn't connect.  Remove this server:
                unset($servers[$key]);
            }
        }
        
        // If we never connected to any database, throw an exception:
        if (!$connection) {
            throw new Exception("Failed: {$server} database");
        }
        
        return $connection;
    }
}

// Do some work

$read = DB::getConnection('read');
$write = DB::getConnection('write');

. . .

?>


Of course there are numerous enhancements you may want (and should) make to the above code before it could be used in production. You probably want to log the individual database connection failures. You shouldn’t store your configuration as static class variables because you can’t change your configuration without code changes. Plus, in this setup, all servers are treated as equals. It can be worthwhile to add the idea of server ‘weighting’ so that some servers could be assigned less traffic than others. In the end you’ll most likely want to encapsulate this logic into a greater database abstraction class that gives you even more flexibility.

Conclusion

By following the steps in this article you should be well on your way to a scalable architecture for your PHP application. In the end, there really is no single solution, no magic bullet. It’s the same reason that there isn’t just one application solution or one framework. Every application will have different bottlenecks and different scaling problems that need solved.

Part 2 of this article will discuss more advanced topics for scaling your MySQL database beyond the techniques mentioned here.



Eli White
 is a longtime PHP user and the author of the book PHP 5 in Practice. He has worked on many large scale PHP projects including Digg, TripAdvisor, and for the Hubble space telescope program. He frequently speaks at PHP conferences to share his knowledge. More about Eli can be found at eliw.com.

 

Scaling a PHP MySQL Web Application, Part 2

By Eli White

Tips for scaling your PHP-MySQL Web app based on real-world experiences at Digg, TripAdvisor, and other high-traffic sites. In this portion: pooling and sharding techniques.

Published April 2011

In Part 1 of this article, I explained the basic steps to move your application from a single server to a scalable solution for both PHP andMySQL. While most Web applications will never need to expand beyond the concepts presented in that article, eventually you may find yourself in a situation where the basic steps aren’t enough.

Here are some more advanced topics on how you might change your setup to support additional throughput, isolate “concerns” better, and take your scalability to the next level.

MySQL Database Pooling

Database pooling is one fairly simple technique that allows for greater “isolation of concerns.” . This concept involves having a group of database slaves virtually separated into multiple pools. Each pool has a specific classes of queries sent to it. Using a basic blog as an example case, we might end up with a pool distribution like this:

white-php-part2-f1
Figure 1
 Database pooling

Now the obvious question at this point is: Why? Given that all the slaves in the above setup are identical, then why would you consider doing this?

The main reason is to isolate areas of high database load. Basically, you decide what queries are OK to perform more slowly. In this example you see that we’ve set aside just two slaves for batch processes. Typically if you have any batch processes that need to run in the background you aren’t worried about how fast they run. Whether it takes five seconds, or five minutes, doesn’t matter. So it’s OK for the batch processes to have fewer resources, thereby leaving more for the rest of the application. But perhaps more important, the kinds of queries you may typically do when performing large batch processes (and the reason they are not done in real time in the first place) can be rather large and resource intensive.

By isolating these processes off to their own database slaves, your batch processes firing off won’t actually affect the apparent performance of your Web application. (It’s amazing how many Websites get slower just after midnight when everyone decides to run cron jobs to background-process data.) Isolate that pain and your Website performs better. You’ve now created selective scalability.

The same concept can apply to other aspects of your application as well. If you think about a typical blog, the homepage only shows a list of posts. Any comments to those posts are displayed only on a permalink page. Retrieving those comments for any given post can be a potentially painful process, given there can be any number of them. It’s further complicated if the site allows threaded comments because finding and assembling those threads can be intense.

Therein lays the benefit of database pooling. It’s usually more important to have your homepage load extremely fast. The permalink pages with comments can be slower. By the time someone is going to load the full post, they’ve committed and are willing to wait a little longer. Therefore, if you isolate a set of slaves (four in the example above) to be specific to queries about comments, you can leave the larger pool of ‘primary’ slaves to be used when generating homepages. Again you’ve isolated load and created selective scalability. Your comment and permalink pages might become slower under heavy load, but homepage generation will always be fast.

One way that you can apply scalability techniques to this pool model is to allow on the fly changes to your pool distribution. If you have a particular permalink that is extremely popular for some reason, you could move slaves from the primary pool to the comments pool to help it out. By isolating load, you’ve managed to give yourself more flexibility. You can add slaves to any pool, move them between pools, and in the end dial-in the performance that you need at your current traffic level.

There’s one additional benefit that you get from MySQL database pooling, which is a much higher hit rate on your query cache. MySQL (and most database systems) have a query cache built into them. This cache holds the results of recent queries. If the same query is re-executed, the cached results can be returned quickly.

If you have 20 database slaves and execute the same query twice in a row, you only have a 1/20th chance of hitting the same slave and getting a cached result. But by sending certain classes of queries to a smaller set of servers you can drastically increase the chance of a cache hit and get greater performance.

You will need to handle database pooling within your code – a natural extension of the basic load balancing code in Part 1. Let’s look at how we might extend that code to handle arbitrary database pools:

<?php

class DB {
    // Configuration information:
    private static $user = 'testUser';
    private static $pass = 'testPass';
    private static $config = array(
        'write' =>
            array('mysql:dbname=MyDB;host=10.1.2.3'),
        'primary' =>
            array('mysql:dbname=MyDB;host=10.1.2.7',
                  'mysql:dbname=MyDB;host=10.1.2.8',
                  'mysql:dbname=MyDB;host=10.1.2.9'),
        'batch' =>
            array('mysql:dbname=MyDB;host=10.1.2.12'),
        'comments' =>
            array('mysql:dbname=MyDB;host=10.1.2.27',
                  'mysql:dbname=MyDB;host=10.1.2.28'),
        );

    // Static method to return a database connection to a certain pool
    public static function getConnection($pool) {
        // Make a copy of the server array, to modify as we go:
        $servers = self::$config[$pool];
        $connection = false;
        
        // Keep trying to make a connection:
        while (!$connection && count($servers)) {
            $key = array_rand($servers);
            try {
                $connection = new PDO($servers[$key], 
                    self::$user, self::$pass);
            } catch (PDOException $e) {}
            
            if (!$connection) {
                // Couldn’t connect to this server, so remove it:
                unset($servers[$key]);
            }
        }
        
        // If we never connected to any database, throw an exception:
        if (!$connection) {
            throw new Exception("Failed Pool: {$pool}");
        }
        
        return $connection;
    }
}
// Do something Comment related
$comments = DB::getConnection('comments');
. . . 

?>


As you can see, very few changes were needed. This of course was purposeful. Knowing that pooling was desirable, the original code was formulated to be extensible. Any code designed to randomly select from a “pool” of read slaves in the first place should be able to be extended to simply understand that there are multiple pools to choose from.

Of course, the comments in Part 1 apply to this code as well: You will probably want to encapsulate your logic within a greater database abstraction layer, add better error reporting, and perhaps extend the features offered.

MySQL Sharding

At this point, we’ve covered all the truly easy steps towards scalability. Hopefully by this point, you’ve found solutions that work. Why? Because the next steps can be very painful from a coding perspective. I know this personally for they are steps that we had to take when I worked at Digg.

If you need to scale farther, it’s usually for one of a couple reasons. All of them reflect various pain points or bottlenecks in your code. It might be because your tables have gotten gigantic, with tens or hundreds of millions of rows of data in them, and your queries are simply unable to complete quickly. It could be that your master database is overwhelmed with write traffic. Perhaps you have some other pain point which is similar to these situations.

One possible solution is to shard your database. Sharding is the common-use term for a tactic practiced by many of the well known Web 2.0 sites, although many other people use the term partitioning. Partitioning is in fact a way to split individual tables into multiple parts within one machine. MySQL has support for some forms of this where each partition has the same table schema, which makes the data division transparent to PHP applications.

So what is sharding? The simplest definition I can come up with is just this: “Breaking your tables or databases into pieces.”

Really, that’s it. You are taking everything you have and breaking it into smaller shards of data. For example, you could move last year’s infrequently accessed blog comments to a separate database. This allows you to scale more easily and to have faster response times since most queries look at less data. It can also help you scale write-throughput by breaking out into using multiple masters. This comes at a cost, though, on the programming side. You need to handle much more of your data logic within PHP, whether that is the need to access multiple tables to get the same data, or to simulate joins that you can no longer do in SQL.

Before we get into discussing that various forms of sharding and their benefits (plus drawbacks) there is something I should mention. This entire section addresses doing manual sharding and controlling it within your PHP code. Being a programmer myself, I like having that control as it gives me the ultimate flexibility. Also in the past, it’s been the solution that in the end won out. There are some forms of sharding that MySQL can do for you via features such as federation and clustering. I do recommend that you explore those features to see if they might be able to help you without taking the extra PHP load. For now, let’s discuss the various forms of manual sharding.

Vertical Sharding

Vertical sharding is the practice of moving various columns of a table into different tables. (Many people don’t even think of this tactic as sharding per se but simply as good-old database design via normalizing tables.) There are various methodologies that can help here, primarily based around moving rarely used or often-empty columns into an auxiliary table. This often keeps the important data that you reference in one smaller table. By having this smaller, narrower table, queries can execute faster and there is a greater chance that the entire table might fit into memory.

A good rule of thumb often is to move any data that is never used in a WHERE clause into an auxiliary table. The idea is that you want the primary table that you perform queries against to be as small and efficient as possible. You can later read the data from the auxiliary table after you know exactly which rows you need.

An example might be a users table. You may have columns in that table that refer to items you want to keep but rarely use. Perhaps their full name or their address, which you keep for reference about the user but you never display on the Website nor search based upon it. Moving that data out of the primary table can therefore help.

The original table might look like:

white-php-part2-f2

A vertically sharded version of this would then look like:

white-php-part2-f3

It should be noted that vertical sharding is usually performed on the same database server. You aren’t typically moving the auxiliary table to another server in an attempt to save space, but you are breaking a large table up and making the primary query table much more efficient. Once you’ve found the row(s) of information in Users that you are interested in, you can get the extra information – if needed – by highly efficient ID lookups on the UsersExtra.

Manual Horizontal Sharding

Now if vertical sharding is breaking up the columns of a table into additional tables, horizontal sharding is breaking the rows in a table into different tables. At this point, the sole goal behind doing this is to take a table that has become so long to be unwieldy (millions upon millions of rows) and to break it into usable chunks.

If you are doing this manually by spreading the shared tables across multiple databases, it is going to have an effect on your code base. Instead of being able to look in one location for certain data, you will need to look in multiple tables. The trick with doing this in a way to have minimal impact on your code is to make sure that the methodology used to break your table up matches how the data will be accessed in the code.

There are numerous common ways to manually do horizontal sharding. One could be range-based, wherein the first million row IDs are in the first table, the next million row IDs in table two, and so on. This can work fine as long as you always know that you are going to know the ID of the row you will want, and therefore can have logic to pick the right table. A drawback of this method is that you’ll also need to create tables as thresholds are crossed. It means either building table creation code into your application, or always making sure that you have a “spare” table, and processes to alert you when it starts being used.

Another common solution that obviates needing to make new tables on-the-fly is to have interlaced rows. This is the approach, for example, of using the modulus function on your row numbers, and evenly dividing them out into a pre-set number of tables. So if you did this with three tables, the first row would be in table 1, the second row in table 2, the third in table 3, and then the fourth row would be in table 1.

This methodology still requires you to know the ID in order to access the data, but now removes the concern of ever needing a new table. At the same time, it comes with a cost: If you choose to break into three tables, and eventually those tables get too large, you will come to a painful point of needing to redistribute your data yet again.

Beyond the ID based sharding, some of the best sharding techniques cue off different data points that closely match how the data will be accessed. For example, if writing a typical blog or news application, you typically will always know the date of any post that you are trying to access. Most blog permalinks store the month and year within them such as: http://myblog.example.com/2011/02/Post. This means that any time that you are going to read in a post, you already know the year and month . Therefore, you could partition the database based upon the date, making a separate table for each year or perhaps a separate table for each year/month combination.

This does mean that you need to understand the logic for other access needs as well. On the homepage of your Website, you might want to list the last five posts. To do this now, you might need to access two tables (or more). You’d first attempt to read in five posts from the current year/month table, and if you didn’t find all five, start iterating back in time, querying each table in turn until you’ve found all that you need.

In this date-based setup, you do still need to make new tables on the fly, or have them pre-made and waiting. At least in this situation, you know exactly when you will need new tables, and could even have them pre-made for the next few years, or a cronjob that makes them on a set schedule.

Other data columns can be used for partitioning as well. Let’s explore another possibility using the data we used for horizontal partitioning. In the case of this Users table, we might realize that in our application we will always know the username when we go to access a specific user’s information, perhaps because they have to login first and provide it to us. Therefore, we could shard the table based upon the usernames, putting different sections of the alphabet into different tables. A simple two-table shard could therefore look like:

white-php-part2-f4

This horizontal sharding can be used within the same database just to make tables more manageable. However most often this tactic begins to see use when you have so much data that you need to start breaking it out into multiple databases. This could be either for storage concerns or throughput. In this case you can easily, once sharded, move each table onto their own master with their own slave pools, increasing not only your read throughput, but your write throughput as well.

Application-Level Sharding

At this point we’ve discussed the ideas of breaking your tables up both by rows and by columns. The next step, which I will refer to asapplication-level sharding, is explicitly taking tables from your database and moving them onto different servers (different master/slave setups).

The sole reason for this is to expand your write throughput and to reduce overall load on the servers by isolating queries. This of course comes with – as all sharding does – the pain of rewriting your code to understand where the data is stored, and how to access it now that you’ve broken it up.

The interesting thing about application-level sharding is that if you’ve already implemented “pool” code (as shown previously), that exact same code can be used to access application shards as well. You just need to make a separate pool for each master and make pools for each set of shard slaves. The code needs to ask for a connection to the right pool for the shard that holds your data. The only real difference from before is that you have to be extra careful with your queries, since each pool isn’t a copy of the same data.

One of the best tactics when looking to break your database up is to base it upon a natural separation of how your application accesses data. Keep related tables together on the same server so that you don’t lose the ability to quickly access similar data through the same connection (and perhaps even keep the ability to do joins across those related tables).

For example, let’s look at this typical blog database structure and how we might shard it:
 

white-php-part2-f5

Figure 3 Keeping related tables together

As you can see, we’ve broken this table into two application-based shards. One contains both the Users and Profiles tables. The other contains the blog Entries and Comments tables. The related data is kept together. With one database connection you can still read both core user information as well as their profiles. With another connection you can read in the blog entries and their associated comments and potentially even join between those tables if needed.

Conclusion

In the end, all of these techniques – pooling and various forms of sharding – can be applied to your system to help it scale beyond its current limitations. They should all be used sparingly and only when needed.

While setting your system up for MySQL database pooling is actually fairly easy and painless, sharding is a different matter, and in my opinion it should only be attempted once you have no other options available to you. Every implementation of sharding requires additional coding work. The logic used to conceptually break your database or tables up needs to be translated into code. You are left without the ability to do database joins or can do them only in a limited fashion. You can be left without the ability to do single queries that search through all of your data. Plus you have to encapsulate all of your database access behind more abstraction layers that understand the logic of where the data actually exists now, so that you can change it in the future again if needed.

For that reason, only do the parts that directly match your own application’s needs and infrastructure. It’s a step that a MySQL-powered Website may need to take when data or traffic levels rise beyond manageable sizes. But it is not a step to be taken lightly.



Eli White
 is a long time PHP user and the author of the book PHP 5 in Practice. He has worked on many large scale PHP projects including Digg, TripAdvisor, and for the Hubble space telescope program. He frequently speaks at PHP conferences to share his knowledge. More about Eli can be found at eliw.com.

 

 

Creds: http://www.oracle.com/technetwork/articles/dsl/white-php-part1-355135.html

Leave a Reply