From FreeBSDwiki

(Difference between revisions)

Latest revision as of 17:55, 25 August 2012

[edit] Overview of Array Types

[edit] RAID1: Simple Mirroring

While RAID1 offers slow write speeds and even some read performance disadvantages when compared to array types such as RAID3, RAID5, or RAID0+1, it does offer one potentially gigantic advantage that the more complex arrays can't: simplicity and survivability. If you lose a RAID controller, lose one or more drives, even forget how RAID works or what it means, any one single surviving member drive of a RAID1 array may be hooked up to a normal drive controller and operated as a standalone drive.

RAID1 is the performance leader when it comes to massively parallel read processes, but offers little or no read performance benefit to a lightly loaded server or workstation which is usually only serving one request at a time.

[edit] RAID0+1: The Mirrored Stripe Set

RAID0, unlike RAID1, offers drastically improved write performance and also drastically improved single-read performance. However, RAID0 does not offer parity, meaning that loss of a single member drive means irrevocable loss of all data on the array. RAID0+1 is an attempt to offer the best of both worlds, by creating a stripe set of mirrored pairs - offering most of the advantages of both protocols with few of the flaws of either. You must lose at least two member drives to cause dataloss in a RAID0+1 array, and further the two drives must be both members of a single mirrored pair, making this an extremely survivable array type in regard to member failure.

While RAID0+1 does offer high performance and high survivability, it also requires a lot of drives for a relatively small array, and still can't offer the "plug any single member in as a singleton and go" safety blanket of pure RAID1. Like the other striping array types, if you lose the RAID controller itself you may have a scary nail-biter of a time trying to recover the array on replacement hardware.

[edit] RAID3/RAID4: Stripe Set with Parity Member

RAID3 is a RAID0 array with an extra member which gets a parity chunk written to it for each chunk written to the data members. RAID4 is just like RAID3, but uses block-level parity rather than byte-level parity. RAID3/RAID4 offer significant performance benefits across the board, from write performance to single-read performance to multi-read performance, although not as large a multi-read boost as proper RAID1 implementations do. The parity member also allows you to lose any single drive without losing the array, but loss of any second member is sufficient to kill the array.

Use of the parity member in read operations is optional with FreeBSD's RAID3 implementation (graid3), but testing seems to show round-robin parity member reads offering little to no performance benefit with potentially severe performance decreases, so it doesn't seem to be a good idea for most settings.

Graid3 only allows array sizes of 2^n-1 - ie 3, 5, 9, and so on. This may be a significant handicap to those trying to squeeze the absolute most out of an array without breaking the bank. Linux's RAID4 does not suffer from this limitation.

RAID3 is extremely rare outside the FreeBSD world - I have neither seen nor heard of it in actual use other than with graid3 under FreeBSD. It is similarly rare to see RAID4 outside Linux's implementation.

[edit] RAID5: Stripe Set with Distributed Parity

RAID5 is extremely similar to RAID3, except that parity blocks are distributed among all member drives - where a RAID3 array with 5 member drives will write 4 blocks to members 1,2,3,4 and then a parity block to member 5, a RAID5 array with 5 member drives will instead write five blocks to members 1,2,3,4,5 and then a parity block to member 1 - and on the next cycle, will write five data blocks to 2,3,4,5,1 and then a parity block to member 2.

RAID5 and RAID3 are obviously extremely similar. Although RAID5 has dominated the industry as a whole for decades, there has been some controversy in recent years over their relative performance levels. We will test both types in this article.

FreeBSD does not currently support software RAID5 at all. Even inexpensive "hardware" controllers such as the Nvidia onboard types (which use the CPU to XOR data blocks to generate parity and vice versa, and are frequently referred to as "fakeraid") are unsupported. True hardware controllers which do not use the CPU for parity calculation are, of course, supported.

[edit] Results

[edit] RAID1 Mirror Performance

Gmirror, unfortunately, is not doing well at all at this time. 2-drive and 3-drive gmirror arrays performed grossly worse than even a single baseline drive, with a 5-drive gmirror managing to outperform the baseline 250GB drive tested but being handily beaten by the 500GB baseline drive, the Nvidia onboard RAID1 implementation, and especially the Linux RAID1 implementations, which handily dominated everything across the board.

Only results for gmirror's round-robin balance algorithm are shown here, because the load and split balance algorithms performed even more poorly than round-robin. Results for split are available as raw data if you click the image, but were not included on the graph itself. Load results are not available because initial testing showed it performing even worse than split and so the tests were not allowed to complete.

It is interesting to note that the Nvidia, Promise, and Linux RAID1 implementations all display a significant variation in how they handle simultaneous processes - all three exhibited differences up to 15, 30, and even 38 seconds in times to process otherwise identical simultaneously begun cp processes to /dev/null. While gmirror's sheer performance is abysmal, it is worth noting that it does at least handle processes consistently; it never finished processes more than a few hundred milliseconds apart.

The Promise TX-2300 RAID1 implementation was just plain poor, performing nearly as badly as gmirror but still failing to improve on the scores of the single baseline drive, while still turning in oddly inconsistent times as the vastly higher-performing arrays did.

The Gmirror and Linux implementations were the only ones tested which allowed RAID1 arrays with more than two member drives.

[edit] Complex Array Performance

Graid3 is doing noticeably better than gmirror. The 5-drive Graid3 implementation handily outperformed the 2-drive mirror across the board, and the 3-drive Graid3 implementation performed somewhat slower than the Linux or Nvidia 2-drive RAID1 arrays in the 2- and 3-process tests and significantly slower in the 4-process and 5-process tests, but nearly doubled their single-process performance. Also, even the 3-member graid3 array (which only has two actual data members) handily outperformed the individually much faster 500GB baseline drive, which both the 2- and 3-member gmirror arrays inexplicably failed to do. With that said, however, Linux's RAID4 array came very close to its single process performance and beat it like a drum across the rest of the board.

Nvidia's RAID0+1 presents a compelling challenge, clearly outperforming the 5-drive graid3 array in the 3, 4, and 5 process tests - but equally significantly, it's roughly on par for the 2-process test and drastically slower on the single-process test, even getting outperformed by the 3-drive RAID3 array. We'll go into more of what that means for the real world in the overall high performance section next.

Linux's RAID5 and RAID4 implementations both handily outperform 6.2-RELEASE's RAID3 under heavy load, but for single process performance graid3 is the pick of the litter out of all systems tested.

Only results for graid3's default configuration are shown here, because the -r configuration (always use the parity member during reads) performed slightly to significantly poorer in all but the 5-process test, in which it performed only very slightly better. It is possible that a more massively parallel test (or a test of a much less contiguous filesystem) would show some advantage to -r, but for these tests, no advantage is apparent. Raw data is available on the image page itself.

[edit] High Performance RAID

In this section we'll look briefly at the best performers from both the mirrors and the complex arrays tested previously. This graph particularly shows how important it is to understand the target environment when designing a storage array: we have a bewildering collection of performances completely unlike one another here.

While the 5-member Linux RAID1 array clearly dominates the field for massively multiple reads, its performance in the 1- and 2-process tests leaves a lot to be desired. Most servers, much less workstations, in the small-office arena that would be seriously considering software RAID in the first place are probably not going to put anywhere near that heavy a load on a server, making the complex arrays look much more practical. Worse, although there are no graphs up yet, the write performance on RAID1 arrays is abysmal - at absolute best, they're a little below baseline for a single member drive. This array would be a good choice for a relatively small (the size of the array is only the size of a single member drive, remember) but massively loaded database that needed to serve large numbers of concurrent requests all day long, or possibly for a medium-load server with a truly paranoid admin, but otherwise if you're going to burn five drives on an array you would probably be served better with one of the complex (striped) types.

For a lightly loaded workstation or server, FreeBSD's graid3 is at least respectable, turning in the highest single-process performance and not doing too shabbily in two-process performance.

The RAID0+1 configuration is a solid choice for truly paranoid administrators, with decent though not overwhelming performance clear across the board, some write acceleration, and extreme survivability (a minimum of two drive failures are needed to bring down the array). However, Linux RAID5 and RAID4 outperform it all across the board, and FreeBSD RAID3 outperforms it for 1 or 2 concurrent processes. RAID5, RAID4, and RAID3 also offer better write acceleration as well as producing an array twice the size of the 0+1 with roughly the same number of member drives.

Finally, Linux's RAID5 and RAID4 arrays are truly solid choices across the board: they outperform everything but graid3 on the single-process test, outperform all other contenders on the 2-process test, and outperform everything but the 5-member RAID1 across the rest of the board. When comparing these two to one another, RAID4 seems like the clear winner despite RAID5's much larger popularity: roughly 25% higher performance on the single-process test means a significant performance bump for most server and workstation features, with the other tests remaining about on par with RAID5.

[edit] Conclusions

Unfortunately, FreeBSD as of 6.2-RELEASE is clearly lagging behind the times in the software RAID department. The good news is, in several years of this tester's usage of gmirror it has proven perfectly reliable and easy to set up and use. The bad news is, it is an absolute performance dog, lagging behind a single baseline drive in every possible test.

Graid3 performs much better than gmirror, and manages to take top honors of all arrays tested in single-process read performance, but it suffers from immediate and drastic performance penalties under heavier load. Linux's RAID4, on the other hand, comes extremely close to graid3 in single-process performance, and both Linux RAID4 and Linux RAID5 roughly double graid3's performance across the rest of the board.

In short, if you have other good reasons to want to run FreeBSD on your server, graid3 is a workable if unimpressive solution to add some fault tolerance and a significant amount of performance to your hard drive storage. But if array performance is a higher priority in and of itself than choice of *nix - for example, in a simple Samba server - you will likely be better served running a modern Linux. With any luck, this will change in later versions of FreeBSD, but for the moment the numbers brook no argument.

[edit] Equipment

FreeBSD 6.2-RELEASE (amd64)
Ubuntu Server 7.04 (amd64)
Athlon X2 5000+
2GB DDR2 SDRAM
Nvidia nForce MCP51 SATA 300 onboard RAID controller
Promise TX2300 SATA 300 RAID controller
3x Western Digital 250GB drives (WDC WD2500JS-22NCB1 10.02E02 SATA-300)
2x Western Digital 500GB drives (WDC WD5000AAKS-00YGA0 12.01C02 SATA-300)

[edit] Methodology

The read-ahead cache was changed from the default value of 8 to 128 for all tests performed, using sysctl -w vfs.read_max=128. Initial testing showed that dramatic performance increases occurred for all tested configurations, including baseline single-drive, with increases of vfs.read_max. The value of 128 was arrived at by continuing to double vfs.read_max until no further significant performance increase was to be seen (at vfs.read_max=256) and backing down to the last value tried.

Similarly, for the Linux tests read-ahead cache was changed from the default value of 256 to 4192, using hdparm -a4096 /dev/md0. Baseline drive performance was not tested under Linux, but extremely erratic initial test results on the first RAID1 configuration tested led me to googling Linux disk performance tweaking so as to make a completely fair comparison. The 4192 value was arrived at by successive doubling and testing until the highest performing value was found, then testing against 3/4 its value. The RAID1 array was created using the command mdadm --create /dev/md0 --level raid1 -n 5 --assume-clean /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf, and subsequently shrunk to three members and then to two members as testing completed.

When the Linux RAID5 array was created (mdadm --create /dev/md0 --level raid5 -n 5 --assume-clean /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf) I noticed the system changed the default read-ahead cache value to 1024 - quadruple what it had been for the RAID1 array or for individual drives. Upon retesting, I arrived at hdparm -a8192 /dev/md0 as the sweet spot for a 5-member Linux RAID5 array.

For the actual testing, 5 individual 3200MB files were created on each tested device or array using dd if=/dev/random bs=16m count=200 as random1.bin - random5.bin. These files were then cp'ed from the device or array to /dev/null. Elapsed times were generated by echoing a timestamp immediately before beginning the test and immediately at the end of each individual process, and subtracting the beginning timestamp from the last completed timestamp. Speeds shown are simply the amount of data in MB copied to /dev/null (3200, 6400, 9600, 12800, or 16000) divided by the total elapsed time.

[edit] Notes

The methodology used produces a very highly contiguous filesystem, which may skew results significantly higher than in some real-world settings - particularly in the single-process test. Presumably the multiple process copy tests would be much less affected by fragmentation in real-world filesystems, since by their nature they require a significant amount of drive seeks between blocks of the individual files being copied throughout the test.

In the 5-drive Graid3 array tested, the (significantly faster) 500GB drives were positioned as the last two elements of the array. This is significant particularly because this means the parity drive was noticeably faster than 3 of the 4 data drives in this configuration; some other testing on equipment not listed here leads me to believe that this had a favorable impact when using the -r configuration. There was not, however, enough of an improvement to make the -r results worth including on the graph.

Write performance was also tested on each of the devices and arrays listed and will be included in graphs at a later date (for now, raw data is available in the discussion page).

Googling "gmirror performance" and "gmirror slow" did not get me much of a return; just one other individual wondering why his gmirror was so abominably slow - so I reformatted the test system with 6.2-RELEASE (i386) and retested. Unfortunately, the gmirror results did not improve with the change of platform back to i386. It strikes me as very odd that graid3 with only 3 drives (therefore only 2 data drives) outperforms even a five-drive gmirror implementation. And in sharp contrast to gmirror, of course, the Linux kernel RAID1 results speak for themselves.

@@ Line 24: / Line 24: @@
 == RAID5: Stripe Set with Distributed Parity ==
-RAID5 is extremely similar to RAID1, except that parity blocks are distributed among all member drives - where a RAID3 array with 5 member drives will write 4 blocks to members 1,2,3,4 and then a parity block to member 5, a RAID5 array with 5 member drives will instead write ''five'' blocks to members 1,2,3,4,5 and then a parity block to member 1 - and on the next cycle, will write five data blocks to 2,3,4,5,1 and then a parity block to member 2.
+RAID5 is extremely similar to RAID3, except that parity blocks are distributed among all member drives - where a RAID3 array with 5 member drives will write 4 blocks to members 1,2,3,4 and then a parity block to member 5, a RAID5 array with 5 member drives will instead write ''five'' blocks to members 1,2,3,4,5 and then a parity block to member 1 - and on the next cycle, will write five data blocks to 2,3,4,5,1 and then a parity block to member 2.
 RAID5 and RAID3 are obviously extremely similar. Although RAID5 has dominated the industry as a whole for decades, there has been some controversy in recent years over their relative performance levels.  We will test both types in this article.
-FreeBSD does not support software RAID5 (only raid 0/1/3; raid5 can be achieved with zfs/raidZ). FreeBSD supports hardware raid5 controllers (3ware, adaptec, Promise, LSI Megaraid, etc...). This exclude cheap disigned controllers (such as the Nvidia that's use the CPU to XOR data blocks to generate parity and vice versa).
+FreeBSD does not currently support software RAID5 at all.  Even inexpensive "hardware" controllers such as the Nvidia onboard types (which use the CPU to XOR data blocks to generate parity and vice versa, and are frequently referred to as "fakeraid") are unsupported.  True hardware controllers which do not use the CPU for parity calculation are, of course, supported.
 = Results =

RAID, performance tests