Weblog entry #57 for dkg
I was trying to dump a large Logical Volume (LV) over ethernet from one machine to another. I found some behavior which surprised me.
fun constraints
- I have only a fairly minimal debian installation on each machine (which fortunately includes netcat-traditional)
- The two machines are connected directly by a single (gigabit) ethernet cable, with no other network connection. So no pulling in extra packages.
- I have serial console access to both machines, but no physical access.
- The LV being transfered is 973GB in size according to lvs (fairly large, that is), and contains a LUKS volume, which itself contains a basically-full filesystem -- transferring just the "used" bytes is not going to save space/time.
- I want to be able to check on how the transfer is doing while it's happening.
- I want the LV to show up as an LV on the target system, and don't have tons of extra room on the target to play around with (so no dumping it to the filesystem as a disk image first).
(how do i get myself into these messes?)
Setup
The first step was to make an LV that i would transfer the data into:
0 targ:~ # lvcreate --name lv0 --size 973GB vg_targ
What i discovered
My first thought was to just use netcat directly, like this:0 src:~ # nc -l -p 12345 < /dev/mapper/vg_src-lv0
0 targ:~ # nc src 12345 > /dev/mapper/vg_targ-lv0
But, of course, it turns out that i can't tell what's going on here -- i don't know how much data has been transferred because i could find no reporting features in netcat-traditional.
No problem (i thought), i'll stick dd inline with it, since i can send a killall -USR1 dd to get throughput reports from that lovely tool. When i do that, i find that the throughput is abysmal; on the order of 1 megabyte per second (8 Mbps), over a link that is supposed to be one gigabit per second (1000 Mbps). This is ~1% utilization of the link. ugh.
My next step was to try to figure out what the source of the slowdown was. I figured there were three options: the disk write speed on the target, the disk read speed on the source, and the network link in the middle.
An individual test of the source disk read looked good:
0 src:~ # dd bs=$((1024*1024)) count=100 < /dev/mapper/vg_src-lv0 > /dev/null 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 1.7599 s, 59.6 MB/s 0 src:~ #
And an individual test of the target disk read looked good, with similar rates from a test like this (WARNING! This is a destructive test, as it overwrites data in the target LV. I could do this because i didn't have anything important in the LV yet. If you do this on an LV with real data it will destroy all your data. If you aren't sure you want to do this, do not try it yourself):
0 targ:~ # dd bs=$((1024 * 1024 )) count=100 < /dev/zero > /dev/mapper/vg_targ-lv0 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 2.79603 s, 37.5 MB/s 0 targ:~ #
And a test over the network looked fine:
0 targ:~ # nc -l -p 12345 > /dev/null
0 src:~ # dd bs=$((1024 * 1024)) count=100 < /dev/zero | nc targ 12345 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.90866 s, 115 MB/s 0 src:~ #
i was puzzled: disk throughput was good on both sides of the transfer, and the network traffic was nice and speedy, but putting it all together, things were slow slow slow.
You've probably already guessed the problem: buffer sizes for the writes!
It turned out that writing to the LV with successive 1KB blocks was very slow, but using 100KB blocks gave much better performance (WARNING! this is also a destructive test):
0 targ:~ # dd bs=1024 count=102400 < /dev/zero > /dev/mapper/vg_targ-lv0 102400+0 records in 102400+0 records out 104857600 bytes (105 MB) copied, 110.697 s, 947 kB/s 0 targ:~ # dd bs=10240 count=10240 < /dev/zero > /dev/mapper/vg_targ-lv0 10240+0 records in 10240+0 records out 104857600 bytes (105 MB) copied, 29.3647 s, 3.6 MB/s 0 targ:~ # dd bs=102400 count=1024 < /dev/zero > /dev/mapper/vg_targ-lv0 1024+0 records in 1024+0 records out 104857600 bytes (105 MB) copied, 2.78096 s, 37.7 MB/s 0 targ:~ #
and, as it turns out, netcat-traditional uses hard-coded 8KB buffers for transfer, so a raw netcat transfer to disk is likely to be really slow for the setup i was looking at.
A Solution?
So the solution i arrived at was to use dd as a buffering tool (in addition to its use as a throughput reporter). I took advantage of the fact that dd can have different buffer sizes for input and output. I told dd to use 1MB buffers on the LV side, and 8KB buffers on the netcat side:
0 src:~ # dd ibs=$((1024*1024)) obs=8192 < /dev/mapper/vg_src-lv0 | nc -l -p 12345
0 targ:~ # nc src 12345 | dd ibs=8192 obs=$((1024*1024)) of=/dev/mapper/vg_targ-lv0
one problem with this setup is that when the transfer completes, both processes hang, i think because the process from targ still has a chance to write data back to src, and nothing has told the two processes to ignore those file descriptors. Hitting Ctrl-D on the targ process should terminate both sides cleanly at that point.
Sidetracks
Some other things i looked at:
- i looked into changing the MTU on the two NICs to support Jumbo Frames. I don't think this helped, because netcat might not take advantage of the larger frames internally, but i might be mistaken about this.
- i tried a number of things to close/lock down the irrelevant file descriptors for netcat. for example, there's no need for netcat on targ to send any data to src. I don't think this had any effect.
- Either netcat or dd is doing something unintuitive (well, i find it unintuitive) when you do netcat | dd where dd's input block size > 8192. In fact, it looks like data might even be getting dropped in this pipeline without any warning indication. This would be worth looking into further, but i'm out of time on this one at the moment. Here's the difference:
0 dkg@pip:~$ BS=8192; dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null & \ > sleep 0.01 && nc.traditional -w 1 localhost 12345 < /dev/null | dd bs="$BS" count=1000 > /dev/null ;\ > echo client returned $? ; wait [1] 22310 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 0.0986349 s, 83.1 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 0.0966373 s, 84.8 MB/s [1]+ Done dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null client returned 0 0 dkg@pip:~$ BS=8193; dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null & \ > sleep 0.01 && nc.traditional -w 1 localhost 12345 < /dev/null | dd bs="$BS" count=1000 > /dev/null ;\ > echo client returned $? ; wait [1] 22315 1000+0 records in 1000+0 records out 8193000 bytes (8.2 MB) copied, 0.101437 s, 80.8 MB/s 0+1000 records in 0+1000 records out 8192000 bytes (8.2 MB) copied, 0.0980517 s, 83.5 MB/s [1]+ Done dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null client returned 0 0 dkg@pip:~$
Note how the incoming buffers are all underfilled (0+1000 instead of 1000+0), and the total bytes transferred are 1000 less than they should be. The only difference between the two commands is the value of $BS. Seems weird and bad to me.
A similar problem seems to happen with BS=1025, and netcat-openbsd shows similarly odd behavior.
I'm hoping that this is attributable to my own misunderstanding, and not indicative of a deep bug in one of these tools. If you can enlighten me as to why this happens, i'd be happy to hear it.
- Out of curiosity, i tried this kind of thing over the loopback on another machine using netcat-openbsd (i couldn't have used it to solve this particular problem, because the two machines in question are not on the larger network), and discovered that the two netcats have fairly different featuresets (and treat some significant options differently).
Comments on this Entry
[ Parent | Reply to this comment ]
[ Send Message | View dkg's Scratchpad | View Weblogs ]
also, would that have affected the sizes of the buffered writes to disk?
[ Parent | Reply to this comment ]
[ Send Message ]
[ Parent | Reply to this comment ]
What do you think about using ssh and not nc, it is always slower?
[ Parent | Reply to this comment ]
[ Send Message | View dkg's Scratchpad | View Weblogs ]
I agree a comparison would be useful. Why not try it out and write up a report?
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Send Message | View dkg's Scratchpad | View Weblogs ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
http://www.ivarch.com/programs/pv.shtml
I've used it in the past for task similar to the one you describe here and found it quite useful.
Best regards,
Massimo
[ Parent | Reply to this comment ]