Sunday, November 2, 2014

An Example of Parallel Processing

This post shows how to use parallel processing to get a CPU intensive job done faster in Unix/Linux. By splitting a large task into several parts, it is quite easy to give each part to a separate CPU, and complete the task many times faster than it would on a single processor.
These days, even small PCs and other devices often come equipped with several CPU cores. But some tasks will use only one core, sometimes using 100% of it, while other cores stand by idle. Sometimes this is a waste of resources.

Look at those CPUs

I am writing this on a Linux laptop containing 8 CPU cores. Actually it is a quad core Haswell system with 2 hardware threads per core. But Linux thinks it has 8 entirely separate processors, look:
bash-4.2$ grep proc /proc/cpuinfo
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7
You would get similar output on any multi-processor system. For example an x86 Xeon system, a Sun Sparc “T” box or a virtual machine. However the “CPUs” are provisioned, the OS just sees them as regular CPUs.

Give the System Some Work to do

To get those processors working, give them some hard work to do. Compressing data is CPU intensive.
For example, compress (gzip) a large text file called big.txt (3.2 Gb).
bash-4.2$ ls -lh big.txt
-rw-rw-r--. 1 james james 3.2G Feb  5 21:46 big.txt
bash-4.2$ time gzip big.txt

real    0m49.897s
user    0m48.857s
sys     0m1.016s
It took about 50 seconds to compress big.txt. Repeating the test gave similar results: 50.7 s, 50.6 s and 51.2 s.

Gzip Uses Only One CPU

Running ‘top‘ in another window during the above test, it can be seen that CPU usage is about 12.5%Top shows CPU stats across all processors, so 12.5% is one eighth of the 8 CPUs in the system, or one whole CPU. The gzip task completely occupied a single core, taking 100% of its resources. Meanwhile the other 7 processors did pretty much nothing – shown by the 87.4% idle time.
top - 21:53:09 up 37 min,  2 users,  load average: 0.24, 0.20, 0.15
Tasks: 183 total,   2 running, 181 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.3 us,  0.2 sy,  0.0 ni, 87.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8096132 total,  5764756 used,  2331376 free,    45284 buffers
KiB Swap:  8388604 total,        0 used,  8388604 free,  4942988 cached
Also, running ps -eLF during the gzip shows it running on a single processor (processor number 1), as you would expect. And there is only 1 thread.
bash-4.2$ ps -eLF  | egrep 'UID|zip' | grep -v grep
UID        PID  PPID   LWP  C NLWP    SZ   RSS PSR STIME TTY          TIME CMD
james     2376  1750  2376 90    1  1143   504   1 22:05 pts/1    00:00:04 gzip big.txt

Example 1 – Run Small Jobs in Parallel

If you have many files to compress, and many CPUs, it is quicker to zip them in parallel than one at a time. Putting jobs into the background with an ampersand (&) will make them run at the same time. Doing a “wait” command will make the shell wait until the last one completes. I had a directory containing hundreds of files, each 3 Gb in size, that needed to be compressed. It was easier to write a script than do it by hand:
PARALLEL=4

ls *.dbf |
while read file
do
   jobcount=$(( $jobcount+1 ))
   print "$jobcount gzip $file &"
   gzip $file &
   if [[ $jobcount -eq $PARALLEL ]] then
      print "waiting for $jobcount jobs to complete"
      wait
      print "$jobcount jobs completed"
      jobcount=0
   fi
done

print "waiting for remaining jobs to complete"
wait
The above piece of script will steadily compress all *.dbf files in the current directory, using up to 4 cpus at once, running up to 4 “gzip” commands at the same time.
The PARALLEL variable (4) is the maximum number of jobs that will run at once. As the system contained 10 CPUs, and was not busy, it was reasonable to commandeer 4 of them. Note the wait command in the inner block will be executed when there are 4 or more files yet to be compressed. The outer loop wait will execute when less than 4 remain. I actually have the above code in a cron job, housekeeping that folder.

Example 2 – Split a Big Task into Smaller Tasks

Try the same test again, but this time split up the big file in to 8 separate parts.
First, take a check sum of the file with good old cksum
bash-4.2$ cksum big.txt
1754962233 3345539400 big.txt
Now use the split command with “-n” to chop the file up:
-rw-rw-r--. 1 james james 3.2G Feb  5 21:46 big.txt
bash-4.2$ split -n 8 big.txt
bash-4.2$ ls -lh
total 6.3G
-rw-rw-r--. 1 james james 3.2G Feb  5 21:46 big.txt
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xaa
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xab
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xac
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xad
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xae
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xaf
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xag
-rw-rw-r--. 1 james james 399M Feb  5 22:22 xah
…giving 8 file pieces of 399 Mb each, called xaa, xab, …, xah.
(split on some older versions of Solaris and Linux doesn’t have the -n switch, but you can use -b instead to get pieces of a given size)

Run the Compress Again, Parallelized

This time, 8 gzip tasks will be run in parallel. This takes a bit of thinking about. In order to run the jobs in parallel (ie. at the same time), they need to be launched in the background of the shell. But the time command only works on jobs running in the foreground.
One solution is to use the wait command, which simply waits for all background jobs to finish. Then the commands needed would be:
gzip xaa &
gzip xab &
gzip xac &
gzip xad &
gzip xae &
gzip xaf &
gzip xag &
gzip xah &
wait
I could put that in a script and invoke it with time, but it is quicker to just use awk.
bash-4.2$ time ls x* | awk '{print "gzip "$1" &"} END {print "wait"}' | sh -x
+ gzip xaa
+ gzip xab
+ gzip xac
+ gzip xag
+ wait
+ gzip xah
+ gzip xae
+ gzip xaf
+ gzip xad

real    0m10.048s
user    1m18.151s
sys     0m1.564s
bash-4.2$ ls
big.txt  xaa.gz  xab.gz  xac.gz  xad.gz  xae.gz  xaf.gz  xag.gz  xah.gz
All of the data has been gzipped in about 10 seconds, one fifth of the time it took before. Notice that the “user” time was recorded as 1 minute 18 seconds, even though only 10 seconds of real time actually passed. This is just a quirk of the time command.
Looking at top during the parralized gzip operation, it is pretty obvious that all 8 CPUs were being hammered by 8 gzip commands:
top - 23:03:44 up  1:48,  2 users,  load average: 0.61, 0.29, 0.23
Tasks: 193 total,   8 running, 185 sleeping,   0 stopped,   0 zombie
%Cpu(s): 85.1 us,  1.5 sy,  0.0 ni, 11.6 id,  1.5 wa,  0.2 hi,  0.1 si,  0.0 st
KiB Mem:   8096132 total,  7938740 used,   157392 free,     3020 buffers
KiB Swap:  8388604 total,     6512 used,  8382092 free,  7096640 cached

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2733 james     20   0    4572    504    352 R  99.8  0.0   0:05.40 gzip
 2735 james     20   0    4572    504    352 R  99.8  0.0   0:05.39 gzip
 2736 james     20   0    4572    504    352 R  99.4  0.0   0:05.39 gzip
 2738 james     20   0    4572    500    352 R  99.1  0.0   0:05.39 gzip
 2734 james     20   0    4572    504    348 R  98.8  0.0   0:05.37 gzip
 2737 james     20   0    4572    504    352 R  97.1  0.0   0:05.33 gzip
 2731 james     20   0    4572    500    352 R  49.2  0.0   0:02.71 gzip
 2732 james     20   0    4572    504    348 D  46.2  0.0   0:02.60 gzip
and ps agrees, showing one gzip job in each processor:
bash-4.2$ ps -eLF  | egrep 'UID|zip' | grep -v grep
UID        PID  PPID   LWP  C NLWP    SZ   RSS PSR STIME TTY          TIME CMD
james     2764  2763  2764 99    1  1143   504   5 23:05 pts/1    00:00:09 gzip xaa
james     2765  2763  2765 99    1  1143   504   7 23:05 pts/1    00:00:09 gzip xab
james     2766  2763  2766 99    1  1143   504   2 23:05 pts/1    00:00:09 gzip xac
james     2767  2763  2767 99    1  1143   500   3 23:05 pts/1    00:00:09 gzip xad
james     2768  2763  2768 99    1  1143   500   1 23:05 pts/1    00:00:09 gzip xae
james     2769  2763  2769 99    1  1143   500   4 23:05 pts/1    00:00:09 gzip xaf
james     2770  2763  2770 99    1  1143   504   6 23:05 pts/1    00:00:09 gzip xag
james     2771  2763  2771 99    1  1143   504   0 23:05 pts/1    00:00:09 gzip xah

Glue it Back Together

Okay, our compressed data is still in 8 parts. Putting it back together is easy:
bash-4.2$ ls
big.txt  xaa.gz  xab.gz  xac.gz  xad.gz  xae.gz  xaf.gz  xag.gz  xah.gz
bash-4.2$
bash-4.2$ cat x* > all.gz
Now there is a file all.gz. Because of the way gzip works, this file won’t be identical to the original bigfile.txt.gz, but the decompressed data will be identical. To prove that , just gunzip it and do another checksum:
bash-4.2$ gunzip all.gz
bash-4.2$ cksum all big.txt
1754962233 3345539400 all
1754962233 3345539400 big.txt
Uncompressed, the all file gives the same checksum as big.txt, also matching the original checksum taken at the outset of the test (1754962233)

Conclusion

This is just a rough example in the shell to show the effect of multi-processing. Real applications do it differently: buy dividing processes into many threads and running threads across different CPUs. For example, Firefox will run perhaps 40 threads and spread them over all available CPUs.
Note that the gzip test is definitely CPU bound. Repeating the test gives a similar time. Disk/cache effects can be discounted because CPU is the bottleneck in this case. Even if the while file were cached in memory, it would not improve execution speed.
Obviously, this technique applies only where the main operation is “associative”. That is, where the processed data is identical to the concatenation of the processed pieces of data. This is part of the design of several compress tools and is part of the gzip RFC.

No comments:

Post a Comment