This post shows how to use parallel processing to get a CPU intensive job done faster in Unix/Linux. By splitting a large task into several parts, it is quite easy to give each part to a separate CPU, and complete the task many times faster than it would on a single processor.
These days, even small PCs and other devices often come equipped with several CPU cores. But some tasks will use only one core, sometimes using 100% of it, while other cores stand by idle. Sometimes this is a waste of resources.
Look at those CPUs
I am writing this on a Linux laptop containing 8 CPU cores. Actually it is a quad core Haswell system with 2 hardware threads per core. But Linux thinks it has 8 entirely separate processors, look:
bash-4.2$ grep proc /proc/cpuinfo processor : 0 processor : 1 processor : 2 processor : 3 processor : 4 processor : 5 processor : 6 processor : 7
You would get similar output on any multi-processor system. For example an x86 Xeon system, a Sun Sparc “T” box or a virtual machine. However the “CPUs” are provisioned, the OS just sees them as regular CPUs.
Give the System Some Work to do
To get those processors working, give them some hard work to do. Compressing data is CPU intensive.
For example, compress (gzip) a large text file called big.txt (3.2 Gb).
bash-4.2$ ls -lh big.txt -rw-rw-r--. 1 james james 3.2G Feb 5 21:46 big.txt bash-4.2$ time gzip big.txt real 0m49.897s user 0m48.857s sys 0m1.016s
It took about 50 seconds to compress big.txt. Repeating the test gave similar results: 50.7 s, 50.6 s and 51.2 s.
Gzip Uses Only One CPU
Running ‘top‘ in another window during the above test, it can be seen that CPU usage is about 12.5%. Top shows CPU stats across all processors, so 12.5% is one eighth of the 8 CPUs in the system, or one whole CPU. The gzip task completely occupied a single core, taking 100% of its resources. Meanwhile the other 7 processors did pretty much nothing – shown by the 87.4% idle time.
top - 21:53:09 up 37 min, 2 users, load average: 0.24, 0.20, 0.15 Tasks: 183 total, 2 running, 181 sleeping, 0 stopped, 0 zombie %Cpu(s): 12.3 us, 0.2 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 8096132 total, 5764756 used, 2331376 free, 45284 buffers KiB Swap: 8388604 total, 0 used, 8388604 free, 4942988 cached
Also, running ps -eLF during the gzip shows it running on a single processor (processor number 1), as you would expect. And there is only 1 thread.
bash-4.2$ ps -eLF | egrep 'UID|zip' | grep -v grep UID PID PPID LWP C NLWP SZ RSS PSR STIME TTY TIME CMD james 2376 1750 2376 90 1 1143 504 1 22:05 pts/1 00:00:04 gzip big.txt
Example 1 – Run Small Jobs in Parallel
If you have many files to compress, and many CPUs, it is quicker to zip them in parallel than one at a time. Putting jobs into the background with an ampersand (&) will make them run at the same time. Doing a “wait” command will make the shell wait until the last one completes. I had a directory containing hundreds of files, each 3 Gb in size, that needed to be compressed. It was easier to write a script than do it by hand:
PARALLEL=4 ls *.dbf | while read file do jobcount=$(( $jobcount+1 )) print "$jobcount gzip $file &" gzip $file & if [[ $jobcount -eq $PARALLEL ]] then print "waiting for $jobcount jobs to complete" wait print "$jobcount jobs completed" jobcount=0 fi done print "waiting for remaining jobs to complete" wait
The above piece of script will steadily compress all *.dbf files in the current directory, using up to 4 cpus at once, running up to 4 “gzip” commands at the same time.
The PARALLEL variable (4) is the maximum number of jobs that will run at once. As the system contained 10 CPUs, and was not busy, it was reasonable to commandeer 4 of them. Note the wait command in the inner block will be executed when there are 4 or more files yet to be compressed. The outer loop wait will execute when less than 4 remain. I actually have the above code in a cron job, housekeeping that folder.
Example 2 – Split a Big Task into Smaller Tasks
Try the same test again, but this time split up the big file in to 8 separate parts.
First, take a check sum of the file with good old cksum
bash-4.2$ cksum big.txt 1754962233 3345539400 big.txt
Now use the split command with “-n” to chop the file up:
-rw-rw-r--. 1 james james 3.2G Feb 5 21:46 big.txt bash-4.2$ split -n 8 big.txt bash-4.2$ ls -lh total 6.3G -rw-rw-r--. 1 james james 3.2G Feb 5 21:46 big.txt -rw-rw-r--. 1 james james 399M Feb 5 22:22 xaa -rw-rw-r--. 1 james james 399M Feb 5 22:22 xab -rw-rw-r--. 1 james james 399M Feb 5 22:22 xac -rw-rw-r--. 1 james james 399M Feb 5 22:22 xad -rw-rw-r--. 1 james james 399M Feb 5 22:22 xae -rw-rw-r--. 1 james james 399M Feb 5 22:22 xaf -rw-rw-r--. 1 james james 399M Feb 5 22:22 xag -rw-rw-r--. 1 james james 399M Feb 5 22:22 xah
…giving 8 file pieces of 399 Mb each, called xaa, xab, …, xah.
(split on some older versions of Solaris and Linux doesn’t have the -n switch, but you can use -b instead to get pieces of a given size)
Run the Compress Again, Parallelized
This time, 8 gzip tasks will be run in parallel. This takes a bit of thinking about. In order to run the jobs in parallel (ie. at the same time), they need to be launched in the background of the shell. But the time command only works on jobs running in the foreground.
One solution is to use the wait command, which simply waits for all background jobs to finish. Then the commands needed would be:
gzip xaa & gzip xab & gzip xac & gzip xad & gzip xae & gzip xaf & gzip xag & gzip xah & wait
I could put that in a script and invoke it with time, but it is quicker to just use awk.
bash-4.2$ time ls x* | awk '{print "gzip "$1" &"} END {print "wait"}' | sh -x + gzip xaa + gzip xab + gzip xac + gzip xag + wait + gzip xah + gzip xae + gzip xaf + gzip xad real 0m10.048s user 1m18.151s sys 0m1.564s bash-4.2$ ls big.txt xaa.gz xab.gz xac.gz xad.gz xae.gz xaf.gz xag.gz xah.gz
All of the data has been gzipped in about 10 seconds, one fifth of the time it took before. Notice that the “user” time was recorded as 1 minute 18 seconds, even though only 10 seconds of real time actually passed. This is just a quirk of the time command.
Looking at top during the parralized gzip operation, it is pretty obvious that all 8 CPUs were being hammered by 8 gzip commands:
top - 23:03:44 up 1:48, 2 users, load average: 0.61, 0.29, 0.23 Tasks: 193 total, 8 running, 185 sleeping, 0 stopped, 0 zombie %Cpu(s): 85.1 us, 1.5 sy, 0.0 ni, 11.6 id, 1.5 wa, 0.2 hi, 0.1 si, 0.0 st KiB Mem: 8096132 total, 7938740 used, 157392 free, 3020 buffers KiB Swap: 8388604 total, 6512 used, 8382092 free, 7096640 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2733 james 20 0 4572 504 352 R 99.8 0.0 0:05.40 gzip 2735 james 20 0 4572 504 352 R 99.8 0.0 0:05.39 gzip 2736 james 20 0 4572 504 352 R 99.4 0.0 0:05.39 gzip 2738 james 20 0 4572 500 352 R 99.1 0.0 0:05.39 gzip 2734 james 20 0 4572 504 348 R 98.8 0.0 0:05.37 gzip 2737 james 20 0 4572 504 352 R 97.1 0.0 0:05.33 gzip 2731 james 20 0 4572 500 352 R 49.2 0.0 0:02.71 gzip 2732 james 20 0 4572 504 348 D 46.2 0.0 0:02.60 gzip
and ps agrees, showing one gzip job in each processor:
bash-4.2$ ps -eLF | egrep 'UID|zip' | grep -v grep UID PID PPID LWP C NLWP SZ RSS PSR STIME TTY TIME CMD james 2764 2763 2764 99 1 1143 504 5 23:05 pts/1 00:00:09 gzip xaa james 2765 2763 2765 99 1 1143 504 7 23:05 pts/1 00:00:09 gzip xab james 2766 2763 2766 99 1 1143 504 2 23:05 pts/1 00:00:09 gzip xac james 2767 2763 2767 99 1 1143 500 3 23:05 pts/1 00:00:09 gzip xad james 2768 2763 2768 99 1 1143 500 1 23:05 pts/1 00:00:09 gzip xae james 2769 2763 2769 99 1 1143 500 4 23:05 pts/1 00:00:09 gzip xaf james 2770 2763 2770 99 1 1143 504 6 23:05 pts/1 00:00:09 gzip xag james 2771 2763 2771 99 1 1143 504 0 23:05 pts/1 00:00:09 gzip xah
Glue it Back Together
Okay, our compressed data is still in 8 parts. Putting it back together is easy:
bash-4.2$ ls big.txt xaa.gz xab.gz xac.gz xad.gz xae.gz xaf.gz xag.gz xah.gz bash-4.2$ bash-4.2$ cat x* > all.gz
Now there is a file all.gz. Because of the way gzip works, this file won’t be identical to the original bigfile.txt.gz, but the decompressed data will be identical. To prove that , just gunzip it and do another checksum:
bash-4.2$ gunzip all.gz bash-4.2$ cksum all big.txt 1754962233 3345539400 all 1754962233 3345539400 big.txt
Uncompressed, the all file gives the same checksum as big.txt, also matching the original checksum taken at the outset of the test (1754962233)
Conclusion
This is just a rough example in the shell to show the effect of multi-processing. Real applications do it differently: buy dividing processes into many threads and running threads across different CPUs. For example, Firefox will run perhaps 40 threads and spread them over all available CPUs.
Note that the gzip test is definitely CPU bound. Repeating the test gives a similar time. Disk/cache effects can be discounted because CPU is the bottleneck in this case. Even if the while file were cached in memory, it would not improve execution speed.
Obviously, this technique applies only where the main operation is “associative”. That is, where the processed data is identical to the concatenation of the processed pieces of data. This is part of the design of several compress tools and is part of the gzip RFC.
No comments:
Post a Comment