Weblog entry #57 for dkg

dd, netcat, and disk throughput
Posted by dkg on Mon 21 Dec 2009 at 06:21
Tags: ,

I was trying to dump a large Logical Volume (LV) over ethernet from one machine to another. I found some behavior which surprised me.

fun constraints

  • I have only a fairly minimal debian installation on each machine (which fortunately includes netcat-traditional)
  • The two machines are connected directly by a single (gigabit) ethernet cable, with no other network connection. So no pulling in extra packages.
  • I have serial console access to both machines, but no physical access.
  • The LV being transfered is 973GB in size according to lvs (fairly large, that is), and contains a LUKS volume, which itself contains a basically-full filesystem -- transferring just the "used" bytes is not going to save space/time.
  • I want to be able to check on how the transfer is doing while it's happening.
  • I want the LV to show up as an LV on the target system, and don't have tons of extra room on the target to play around with (so no dumping it to the filesystem as a disk image first).

(how do i get myself into these messes?)

Setup

The first step was to make an LV that i would transfer the data into:

0 targ:~ # lvcreate --name lv0 --size 973GB vg_targ

What i discovered

My first thought was to just use netcat directly, like this:
0 src:~ # nc -l -p 12345 < /dev/mapper/vg_src-lv0
0 targ:~ # nc src 12345 > /dev/mapper/vg_targ-lv0

But, of course, it turns out that i can't tell what's going on here -- i don't know how much data has been transferred because i could find no reporting features in netcat-traditional.

No problem (i thought), i'll stick dd inline with it, since i can send a killall -USR1 dd to get throughput reports from that lovely tool. When i do that, i find that the throughput is abysmal; on the order of 1 megabyte per second (8 Mbps), over a link that is supposed to be one gigabit per second (1000 Mbps). This is ~1% utilization of the link. ugh.

My next step was to try to figure out what the source of the slowdown was. I figured there were three options: the disk write speed on the target, the disk read speed on the source, and the network link in the middle.

An individual test of the source disk read looked good:

0 src:~ # dd bs=$((1024*1024)) count=100 < /dev/mapper/vg_src-lv0 > /dev/null
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.7599 s, 59.6 MB/s
0 src:~ # 

And an individual test of the target disk read looked good, with similar rates from a test like this (WARNING! This is a destructive test, as it overwrites data in the target LV. I could do this because i didn't have anything important in the LV yet. If you do this on an LV with real data it will destroy all your data. If you aren't sure you want to do this, do not try it yourself):

0 targ:~ # dd bs=$((1024 * 1024 )) count=100 < /dev/zero > /dev/mapper/vg_targ-lv0 
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 2.79603 s, 37.5 MB/s
0 targ:~ # 

And a test over the network looked fine:

0 targ:~ # nc -l -p 12345 > /dev/null
0 src:~ #  dd bs=$((1024 * 1024)) count=100 < /dev/zero | nc targ 12345
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.90866 s, 115 MB/s
0 src:~ # 

i was puzzled: disk throughput was good on both sides of the transfer, and the network traffic was nice and speedy, but putting it all together, things were slow slow slow.

You've probably already guessed the problem: buffer sizes for the writes!

It turned out that writing to the LV with successive 1KB blocks was very slow, but using 100KB blocks gave much better performance (WARNING! this is also a destructive test):

0 targ:~ # dd bs=1024 count=102400 < /dev/zero > /dev/mapper/vg_targ-lv0
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 110.697 s, 947 kB/s
0 targ:~ # dd bs=10240 count=10240 < /dev/zero > /dev/mapper/vg_targ-lv0
10240+0 records in
10240+0 records out
104857600 bytes (105 MB) copied, 29.3647 s, 3.6 MB/s
0 targ:~ # dd bs=102400 count=1024 < /dev/zero > /dev/mapper/vg_targ-lv0
1024+0 records in
1024+0 records out
104857600 bytes (105 MB) copied, 2.78096 s, 37.7 MB/s
0 targ:~ # 

and, as it turns out, netcat-traditional uses hard-coded 8KB buffers for transfer, so a raw netcat transfer to disk is likely to be really slow for the setup i was looking at.

A Solution?

So the solution i arrived at was to use dd as a buffering tool (in addition to its use as a throughput reporter). I took advantage of the fact that dd can have different buffer sizes for input and output. I told dd to use 1MB buffers on the LV side, and 8KB buffers on the netcat side:

0 src:~ # dd ibs=$((1024*1024)) obs=8192 < /dev/mapper/vg_src-lv0 | nc -l -p 12345
0 targ:~ # nc src 12345 | dd ibs=8192 obs=$((1024*1024)) of=/dev/mapper/vg_targ-lv0

one problem with this setup is that when the transfer completes, both processes hang, i think because the process from targ still has a chance to write data back to src, and nothing has told the two processes to ignore those file descriptors. Hitting Ctrl-D on the targ process should terminate both sides cleanly at that point.

Sidetracks

Some other things i looked at:

  • i looked into changing the MTU on the two NICs to support Jumbo Frames. I don't think this helped, because netcat might not take advantage of the larger frames internally, but i might be mistaken about this.
  • i tried a number of things to close/lock down the irrelevant file descriptors for netcat. for example, there's no need for netcat on targ to send any data to src. I don't think this had any effect.
  • Either netcat or dd is doing something unintuitive (well, i find it unintuitive) when you do netcat | dd where dd's input block size > 8192. In fact, it looks like data might even be getting dropped in this pipeline without any warning indication. This would be worth looking into further, but i'm out of time on this one at the moment. Here's the difference:
    0 dkg@pip:~$ BS=8192; dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null & \
    > sleep 0.01 && nc.traditional -w 1 localhost 12345 < /dev/null | dd  bs="$BS" count=1000 > /dev/null ;\
    > echo client returned $? ; wait
    [1] 22310
    1000+0 records in
    1000+0 records out
    8192000 bytes (8.2 MB) copied, 0.0986349 s, 83.1 MB/s
    1000+0 records in
    1000+0 records out
    8192000 bytes (8.2 MB) copied, 0.0966373 s, 84.8 MB/s
    [1]+  Done                    dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null
    client returned 0
    0 dkg@pip:~$ BS=8193; dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null & \
    > sleep 0.01 && nc.traditional -w 1 localhost 12345 < /dev/null | dd  bs="$BS" count=1000 > /dev/null ;\
    > echo client returned $? ; wait
    [1] 22315
    1000+0 records in
    1000+0 records out
    8193000 bytes (8.2 MB) copied, 0.101437 s, 80.8 MB/s
    0+1000 records in
    0+1000 records out
    8192000 bytes (8.2 MB) copied, 0.0980517 s, 83.5 MB/s
    [1]+  Done                    dd bs="$BS" count=1000 < /dev/zero | nc.traditional -w 5 -l -p 12345 > /dev/null
    client returned 0
    0 dkg@pip:~$ 
    

    Note how the incoming buffers are all underfilled (0+1000 instead of 1000+0), and the total bytes transferred are 1000 less than they should be. The only difference between the two commands is the value of $BS. Seems weird and bad to me.

    A similar problem seems to happen with BS=1025, and netcat-openbsd shows similarly odd behavior.

    I'm hoping that this is attributable to my own misunderstanding, and not indicative of a deep bug in one of these tools. If you can enlighten me as to why this happens, i'd be happy to hear it.

  • Out of curiosity, i tried this kind of thing over the loopback on another machine using netcat-openbsd (i couldn't have used it to solve this particular problem, because the two machines in question are not on the larger network), and discovered that the two netcats have fairly different featuresets (and treat some significant options differently).

 

Comments on this Entry

Posted by Anonymous (159.149.xx.xx) on Mon 21 Dec 2009 at 09:52
Have you considered using "Pipe View"?

http://www.ivarch.com/programs/pv.shtml

I've used it in the past for task similar to the one you describe here and found it quite useful.

Best regards,
Massimo

[ Parent | Reply to this comment ]

Posted by Anonymous (67.217.xx.xx) on Mon 21 Dec 2009 at 14:51
Yes I will second the PipeView suggestion. Never tried it with that much data, but it certainly is a good tool to try.

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Mon 21 Dec 2009 at 18:08
[ View dkg's Scratchpad | View Weblogs ]
I considered pv, but i couldn't get it onto these machines, because of the constraints i mentioned earlier about lack of network access :(

also, would that have affected the sizes of the buffered writes to disk?

[ Parent | Reply to this comment ]

Posted by justanotheruser (2001:0xx:0xx:0xxx:0xxx:0xxx:xx) on Mon 21 Dec 2009 at 11:51
To save some typing, one can replace $((1024*1024)) by 1M

[ Parent | Reply to this comment ]

Posted by Anonymous (89.140.xx.xx) on Mon 21 Dec 2009 at 12:21
Thank you, for this interesting article

What do you think about using ssh and not nc, it is always slower?

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Mon 21 Dec 2009 at 18:11
[ View dkg's Scratchpad | View Weblogs ]
ssh adds some CPU overhead, to be sure, and has different buffering code, which might affect the write speeds.

I agree a comparison would be useful. Why not try it out and write up a report?

[ Parent | Reply to this comment ]

Posted by chrysn (2001:0xx:0xx:0xxx:0xxx:0xxx:xx) on Mon 21 Dec 2009 at 18:24
you can view the progress of any file transfer by going to /proc/`pidof the_reading_program`/fd, determining which fd points to the relevant file using readlink, going to ../fdinfo and looking at the pos field of the file descriptor info file.

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Mon 21 Dec 2009 at 18:32
[ View dkg's Scratchpad | View Weblogs ]
Thanks! This is a very handy tip.

[ Parent | Reply to this comment ]

Posted by Anonymous (76.105.xx.xx) on Wed 23 Dec 2009 at 08:05
You can use "-q0" on the sending side to drop the connection when it finishes.

[ Parent | Reply to this comment ]