Posted by hansivers on Fri 21 Apr 2006 at 11:10
There are a lot of Linux filesystems comparisons available but most of them are anecdotal, based on artificial tasks or completed under older kernels. This benchmark essay is based on 11 real-world tasks appropriate for a file server with older generation hardware (Pentium II/III, EIDE hard-drive).
Since its initial publication, this article has generated a lot of questions, comments and suggestions to improve it. Consequently, I'm currently working hard on a new batch of tests to answer as many questions as possible (within the original scope of the article). Results will be available in about two weeks (May 8, 2006) Many thanks for your interest and keep in touch with Debian-Administration.org! Hans
Why another benchmark test?
I found two quantitative and reproductible benchmark testing studies using the 2.6.x kernel (see References). Benoit (2003) implemented 12 tests using large files (1+ GB) on a Pentium II 500 server with 512MB RAM. This test was quite informative but results are beginning to aged (kernel 2.6.0) and mostly applied to settings which manipulate exclusively large files (e.g., multimedia, scientific, databases).
Piszcz (2006) implemented 21 tasks simulating a variety of file operations on a PIII-500 with 768MB RAM and a 400GB EIDE-133 hard disk. To date, this testing appears to be the most comprehensive work on the 2.6 kernel. However, since many tasks were "artificial" (e.g., copying and removing 10 000 empty directories, touching 10 000 files, splitting files recursively), it may be difficult to transfer some conclusions to real-world settings.
Thus, the objective of the present benchmark testing is to complete some Piszcz (2006) conclusions, by focusing exclusively on real-world operations found in small-business file servers (see Tasks description).
Test settings
Description of selected tasks
The sequence of 11 tasks (from creation of FS to umounting FS) was run as a Bash script which was completed three times (the average is reported). Each sequence takes about 7 min. Time to complete task (in secs), percentage of CPU dedicated to task and number of major/minor page faults during task were computed by the GNU time utility (version 1.7).
RESULTS
Partition capacity
Initial (after filesystem creation) and residual (after removal of all files) partition capacity was computed as the ratio of number of available blocks by number of blocks on the partition. Ext3 has the worst inital capacity (92.77%), while others FS preserve almost full partition capacity (ReiserFS = 99.83%, JFS = 99.82%, XFS = 99.95%). Interestingly, the residual capacity of Ext3 and ReiserFS was identical to the initial, while JFS and XFS lost about 0.02% of their partition capacity, suggesting that these FS can dynamically grow but do not completely return to their inital state (and size) after file removal.
Conclusion : To use the maximum of your partition capacity, choose ReiserFS, JFS or XFS.
File system creation, mounting and unmounting
The creation of FS on the 20GB test partition took 14.7 secs for Ext3, compared to 2 secs or less for other FS (ReiserFS = 2.2, JFS = 1.3, XFS = 0.7). However, the ReiserFS took 5 to 15 times longer to mount the FS (2.3 secs) when compared to other FS (Ext3 = 0.2, JFS = 0.2, XFS = 0.5), and also 2 times longer to umount the FS (0.4 sec). All FS took comparable amounts of CPU to create FS (between 59% - ReiserFS and 74% - JFS) and to mount FS (between 6 and 9%). However, Ex3 and XFS took about 2 times more CPU to umount (37% and 45%), compared to ReiserFS and JFS (14% and 27%).
Conclusion : For quick FS creation and mounting/unmounting, choose JFS or XFS.
Operations on a large file (ISO image, 700MB)
The initial copy of the large file took longer on Ext3 (38.2 secs) and ReiserFS (41.8) when compared to JFS and XFS (35.1 and 34.8). The recopy on the same disk advantaged the XFS (33.1 secs), when compared to other FS (Ext3 = 37.3, JFS = 39.4, ReiserFS = 43.9). The ISO removal was about 100 times faster on JFS and XFS (0.02 sec for both), compared to 1.5 sec for ReiserFS and 2.5 sec for Ext3! All FS took comparable amounts of CPU to copy (between 46 and 51%) and to recopy ISO (between 38% to 50%). The ReiserFS used 49% of CPU to remove ISO, when other FS used about 10%. There was a clear trend of JFS to use less CPU than any other FS (about 5 to 10% less). The number of minor page faults was quite similar between FS (ranging from 600 - XFS to 661 - ReiserFS).
Conclusion : For quick operations on large files, choose JFS or XFS. If you need to minimize CPU usage, prefer JFS.
Operations on a file tree (7500 files, 900 directories, 1.9GB)
The initial copy of the tree was quicker for Ext3 (158.3 secs) and XFS (166.1) when compared to ReiserFS and JFS (172.1 and 180.1). Similar results were observed during the recopy on the same disk, which advantaged the Ext3 (120 secs) compared to other FS (XFS = 135.2, ReiserFS = 136.9 and JFS = 151). However, the tree removal was about 2 times longer for Ext3 (22 secs) when compared to ReiserFS (8.2 secs), XFS (10.5 secs) and JFS (12.5 secs)! All FS took comparable amounts of CPU to copy (between 27 and 36%) and to recopy the file tree (between 29% - JFS and 45% - ReiserFS). Surprisingly, the ReiserFS and the XFS used significantly more CPU to remove file tree (86% and 65%) when other FS used about 15% (Ext3 and JFS). Again, there was a clear trend of JFS to use less CPU than any other FS. The number of minor page faults was significantly higher for ReiserFS (total = 5843) when compared to other FS (1400 to 1490). This difference appears to come from a higher rate (5 to 20 times) of page faults for ReiserFS in recopy and removal of file tree.
Conclusion : For quick operations on large file tree, choose Ext3 or XFS. Benchmarks from other authors have supported the use of ReiserFS for operations on large number of small files. However, the present results on a tree comprising thousands of files of various size (10KB to 5MB) suggest than Ext3 or XFS may be more appropriate for real-world file server operations. Even if JFS minimize CPU usage, it should be noted that this FS comes with significantly higher latency for large file tree operations.
Directory listing and file search into the previous file tree
The complete (recursive) directory listing of the tree was quicker for ReiserFS (1.4 secs) and XFS (1.8) when compared to Ext3 and JFS (2.5 and 3.1). Similar results were observed during the file search, where ReiserFS (0.8 sec) and XFS (2.8) yielded quicker results compared to Ext3 (4.6 secs) and JFS (5 secs). Ext3 and JFS took comparable amounts of CPU for directory listing (35%) and file search (6%). XFS took more CPU for directory listing (70%) but comparable amount for file search (10%). ReiserFS appears to be the most CPU-intensive FS, with 71% for directory listing and 36% for file search. Again, the number of minor page faults was 3 times higher for ReiserFS (total = 1991) when compared to other FS (704 to 712).
Conclusion : Results suggest that, for these tasks, filesystems can be regrouped as (a) quick and more CPU-intensive (ReiserFS and XFS) or (b) slower but less CPU-intensive (ext3 and JFS). XFS appears as a good compromise, with relatively quick results, moderate usage of CPU and acceptable rate of page faults.
OVERALL CONCLUSION
These results replicate previous observations from Piszcz (2006) about reduced disk capacity of Ext3, longer mount time of ReiserFS and longer FS creation of Ext3. Moreover, like this report, both reviews have observed that JFS is the lowest CPU-usage FS. Finally, this report appeared to be the first to show the high page faults activity of ReiserFS on most usual file operations.
While recognizing the relative merits of each filesystem, only one filesystem can be install for each partition/disk. Based on all testing done for this benchmark essay, XFS appears to be the most appropriate filesystem to install on a file server for home or small-business needs :
While Piszcz (2006) did not explicitly recommand XFS, he concludes that "Personally, I still choose XFS for filesystem performance and scalability". I can only support this conclusion.
References
Benoit, M. (2003). Linux File System Benchmarks.
Piszcz, J. (2006). Benchmarking Filesystems Part II. Linux Gazette, 122 (January 2006).
[ Parent ]
| FILESYSTEM | TIME | DISK USAGE |
|---|---|---|
| REISER4 (lzo) | 1,938 | 278 |
| REISER4 (gzip) | 2,295 | 213 |
| REISER4 | 3,462 | 692 |
| EXT2 | 4,092 | 816 |
| JFS | 4,225 | 806 |
| EXT4 | 4,408 | 816 |
| EXT3 | 4,421 | 816 |
| XFS | 4,625 | 799 |
| REISER3 | 6,178 | 793 |
| FAT32 | 12,342 | 988 |
| NTFS-3g | >10,414 | 772 |
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
Once I switched to ReiserFS all those problems went away.I once used ReiserFS on a fileserver and after a broken PSU took the server down, it was the data that went away, because the tree ReiserFS uses was corrupted.
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
Actually, SWSUSP2 works well enough now that I hibernate all of the machines at home (except the server) when they're not going to be used, such as overnight. Hibernation on my HP laptop is almost infallible -- I've got an "uptime" of over a month, hibernating once or twice (and sometimes more) every day -- and, while it takes a good minute to go into hibernation, it comes out of it within 35 seconds... that's from hitting the power switch to having my KDE desktop back up.
I just wish S3 worked as well, and that the kernel folks would adopt SWSUSP2, which works so much better than the default hibernate mechanism.
--- SER
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
An intesting (and simple) test would be to simulate power failures during copy/delete file operations (large ISO file and files tree) and see how each FS handles each situation. But I'm aware that this is only a small part of real data integrity testing..
If anybody has seen hard data about FS reliability, feel free to post a link here. I would be very interested to investigate this, in order to produce more comprehensive and real-world benchmarks.
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
You said "While recognizing the relative merits of each filesystem, an system administrator has no choice but to install only one filesystem...". I don't understand why you believe there is or should be this sort of restriction.
Can't the administrator decide to 'partition' usage onto different volumes, using different fs types, based on their performance for the usage?
For example, I might create a volume to hold users homes, expecting many small files while requiring maximum speed, and so choose to use XFS, while a volume to hold large files (video, audio, backups, still images, etc.), I might choose to use JFS instead.
Note that I'm not advocating or even suggesting that the above is in some way an optimal setup, it's just an 'off the top of my head' example. The question is "Why shouldn't I be able to do this sort of thing if I so choose?" Is there something I'm missing?
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
The shread command works by writing random data, zeros, and ones over and over to the spots on the disk that the file you want to shred was located. The hopes are that with enough writes, the data will actually be overwritten on the disk. (The head of the hard drive varies a small amout as it traces its path over the disk, so the data might not be completely erased).
The problem is, journaling file systems write data to the journal before they write it to the final location on the disk. So shredding the file blocks on the disk, an attacker might be able to recover data from wherever on the disk the journal is located, even if the data blocks are unreadable.
The real issue is that shredding a file even on ext2 does not always work, because modern hard drives sometimes transparently remap bad sectors on the disk... so what the operating system thinks is the location on the disk it originally wrote mysecret.txt to, the drive might have moved it. An attacker could still read data from the "bad" sector using the right tools.
Realisticly, shred should never be relied on. Using dm-crypt to encrypt a full filesystem is a much better solution, and with today's CPUs power, the performance is small enough trade off for secrecy.
For more information about securly destroying data, you can read the paper TKS1 on this page, which is really interesting. (Scroll down to section 3 on page 4)
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
The substantive point is that people should be much more cautious and skeptical about this report than many are being.
[ Parent ]
[ Parent ]
[ Parent ]
Indeed, notail matters a great deal. Also, mounting all filesystems with noatime should strongly be considered; access times are virtually useless information that is quite expensive to maintain. These, as well as numerous other FS configuration parameters, are commonly used by experienced administrators, and the comparison is meaningless without taking them into account. The author of this comparison means well but is apparently quite lacking in requisite experience and expertise.
[ Parent ]
[ Parent ]
[ Parent ]
That's dumb; the "backup" is immediately followed by deleting all the data on the original, so it's not a backup at all, it's just a pointless relocation. It would make more sense to mkfs the second disk, tree copy the first disk to the second disk, and then unmount the first disk and mount the second disk on the mount point (you of course use partition labels rather than absolute device names), which leaves the first disk as a backup that can be copied to tape or other backup media without affecting performance of the live filesystem. Of course, this all assumes that the filesystem can be unmounted in the first place, which often isn't possible -- making background defragmentation the best choice.
[ Parent ]
[ Parent ]
Nahhh, there is an essential detail here: in-place defragmentation is done on the filesystem itself, and there is no backup. If the in-place fails, goodbye data.
You're babbling; I didn't say anything about in-place defragmentation in the statement you quoted. And when there is in-place defragmentation, who says there's no backup? One always does a backup before defragmenting. Sheesh.
Instead by making a copy to another spindle and copying back there is always a valid copy.
In your scenario, you copied one disk to another and then immediately did a mkfs on one of the first disk. That makes the copy STUPID, compared to simply mkfsing the second disk; in both cases you have one disk containing the original FS and the other containing an empty FS. Sheesh.
[ Parent ]
Now, the best way to defragment, is to do a disk-to-disk image backup followed by a re-format of the original partition and a disk to-disk tree restoreI disagree. A good deframenter would defragment based on usage statistics. Frequently used files would be placed where the disk is faster (outer tracks) while unused files would be placed whis it is slow. Files would be grouped based on use scheme. For example, files used at boot time would be placed together.
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
Hi everybody,
Wow! I *really* do not expect this first contribution to generate so much comments and interesting discussions. Some sound a bit "hard", since this review has a limited and modest objective (to complete some data published by Piczsz with tasks I felt were unclear or missing) but I understand the points made by these authors. I apologize for missing or unclear information.
I'm currently working to rerun all my testing while taking into account many suggestions :
I'm also investigating the best way to test :
within the scope of small-business file server operations.
Some discussions were initiated about how to test data integrity after unexpected system shutdowns. I feel it will be a very interesting metric to benchmark, since small-business and home servers may be less likely to have power failures protection (UPS, etc.).
QUESTION TO EXPERIENCED CONTRIBUTORS :
Since it's my first contribution to debian-administration.org, I've restricted myself to html tags suggested in the "Submit an article" section. However, I agree with previous comments that it will be more interesting to publish graphics of results. How can I do that here (other than upload graphs on a personnal website and link them here)?
Thanks everybody!
[ Parent ]
[ Parent ]
Well, most of the improvements you suggest are already in the tests I did a while ago
Yes, I've already noticed your various posts about your work. Really interesting! Too bad I became aware of it after publishing my initial report. It would have help me to better *bullet-proof* my methodology.. :D
I feel that independant replication of results is as important as good methodology for the advancement of knowledge. I've published my initial results here and, from the beginning, I've invited readers to share comments and suggestions to improve this ongoing work. As in many other scientific fields, it's the accumulation of evidence that helps to conclude about facts, more than waiting for the "definitive" study to come. It's probably my statistician background who speaks here, and I respect that not everybody may share this view.
I suspect that it would be far more interesting to see tests done with large partitions/filesystem sizes than I had the patience to do myself (my tests involved 4GB/8GB).
I'm working actually on 40Gb partition and I'm planning to test on 160Gb partition (transfer sizes and operations will be proportionally increased), to see whether some results scaled up linearly or exponentially.
[ Parent ]
[ Parent ]
Either mail me the images to go along with the article - or host them somewhere yourself and I'll copy them over.
I'm happy to include images in pieces where they are useful and this type of article would benefit from them I agree!
(I'd much prefer to host images here since then I don't have to worry about them disappearing. However if you feel strongly that this is not a good idea I would allow you to host them.)
Thanks for the article, I too was suprised how many readers and commentors appreciated it!
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
It would be good if the CPU load and kernel memory consumption was also tracked (so there was an indication of FS overhead per unit of performance), especially if the tests were run for a normal setup and on two configurations that were deliberately reduced, so that it was possible to extrapolate how the filing system would perform under any other configuration (assuming FS performance follows a simple curve).
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
[ Parent ]
I use ext3 because most tools are written for it, and everything Linux supports it.
[ Parent ]