Parallel Computing
Many of the data processing operations are data-parallel: different traces, shot gathers, frequency slices, etc. can be processed independently. Madagascar provides several mechanisms for handling this type of embarrassingly parallel applications on computers with multiple processors.
OpenMP (internal)
OpenMP is a standard framework for parallel applications on shared-memory systems. It is supported by the latest versions of GCC and by some other compilers.
To use OpenMP in your program, you do not need to add anything to your SConstruct. Just make sure the OMP libraries are installed on your system before you configure Madagascar, (or -- reinstall them and rerun the configuration command). Of course, you need to use the appropriate pragmas in your code. To find Madagascar programs that use OpenMP and that you can take as a model, run the following command:
<bash> grep "pragma omp" $RSFSRC/user/*/*.c |\ awk -F ':' '{ print $1 }' |\ uniq |\ awk -F '/' '{ print $NF }' |\ grep M </bash> On the last check (2011-08-10), 84 standalone programs (approximately 10% of Madagascar programs) were using OMP. Running this command in the directory $RSFSRC/api/c will yield a few functions parallelized with OMP (among which a Fourier Transform).
OpenMP (external)
To run on a multi-core shared-memory machine a data-parallel process that does not contain OpenMP calls, use sfomp. Thus, a call like <bash> sfradon np=100 p0=0 dp=0.01 < inp.rsf > out.rsf </bash> becomes <bash> sfomp sfradon np=100 p0=0 dp=0.01 < inp.rsf > out.rsf </bash> sfomp splits the input along the slowest axis (presumed to be data-parallel) and runs it through parallel threads. The number of threads is set by the OMP_NUM_THREADS environmental variable or (by default) by the number of available CPUs.
MPI (internal)
MPI (Message-Passing Interface) is the dominant standard framework for parallel processing on different computer architectures including distributed-memory systems. Several MPI implementations (such as Open MPI and MPICH2) are available.
An example of compiling a program with mpicc and running it under mpirun can be found in $RSFSRC/book/rsf/bash/mpi/SConstruct
MPI (external)
To parallelize a task using MPI but without including MPI calls in your source code, try sfmpi, as follows: <bash> mpirun -np 8 sfmpi sfradon np=100 p0=0 dp=0.01 input=inp.rsf output=out.rsf </bash> where the argument after -np specifies the number of processors involved. sfmpi will use this number to split the input along the slowest axis (presumed to be data-parallel) and to run it through parallel threads. Notice that the keywords input and output are specific to sfmpi and they will be used to specify the standard input and output streams of your program.
Some MPI implementations do not support system calls implemented in sfmpi and therefore will not support this feature.
MPI + OpenMP (both external)
It is possible to combine the advantages of shared-memory and distributed-memory architectures by using OpenMP and MPI together. <bash> mpirun -np 32 sfmpi sfomp sfradon np=100 p0=0 dp=0.01 input=inp.rsf output=out.rsf </bash> will distribute the job on 32 nodes and split it again on each node using shared-memory threads.
pscons
To get SCons to cut your inputs into slices, run in parallel on one multi-cpu workstation or on multiple cluster nodes and then collect, use the pscons wrapper to scons. Just running pscons with no special environment variable set is equivalent to running scons -j nproc, where nproc is the auto-detected number of threads on your system. To fully use the potential of pscons for running on a distributed-memory computer, you need to set the environment variables RSF_CLUSTER and RSF_THREADS, and to use a split argument in your SConstruct Flow statements where appropriate.
Computing on the local node only by using the option local=1
By default, with pscons, SCons wants to run all the commands of the SConstruct file in parallel. The option local=1 forces SCons to compute locally. It can be very useful in order to prevent serial parts of your python script to be run inefficiently in parallel.
<python> Flow('spike',None,'spike n1=100 n2=300 n3=1000',local=1) </python>
Computing on the nodes of the cluster specified by the environment variable $RSF_CLUSTER
<python> Flow('radon','spike','radon adj=y p0=-4 np=200 dp=0.04',split=[3,1000],reduce='cat') </python>
The option split instructs Flow to split the input file along the third axis of length 1000. If you have several source files and want to split only some of them, say the first and the third one, the option to use will be split=[3,1000,[0,2]].
If we choose $RSF_THREADS=26, we obtain, as an itermediate result in the local directory, the files spike__0.rsf, spike__1.rsf, ..., spike__25.rsf, which are sent and distributed for computation on the different nodes specified by $RSF_CLUSTER. After the parallel computation on the nodes, the resulting files radon__0.rsf, radon__1.rsf, ..., radon__25.rsf, are recombined together to create the output radon.rsf. The parameter reduce selects the type of recombination. Two typical options are reduce='cat' or reduce='add'.
Computing in parallel without using any option
This choice is appropriate when you write a python loop in your program and want it to be run in parallel. This is a way, as well, to speed up sequential parts of your program. However, the user should make judicious decisions as it can have the opposite effect. Indeed, in a serial part of the program, the second command has to wait for the first to finish the run on a different node and to communicate it.
<python> Flow('spike',None,'spike n1=100 n2=300 n3=1000') Flow('radon','spike','radon adj=y p0=-4 np=200 dp=0.04') </python>
In our example, we used 26 threads and send them on 4 nodes, using
respectively 6 CPUs on the first node, 4 CPUs on the second, and 8 CPUs
on each of the last two nodes.
<bash> export RSF_THREADS=26 export RSF_CLUSTER='140.168.1.236 6 140.168.1.235 4 140.168.1.234 8 140.168.1.233 8' </bash>
One important setting is to properly manage the temporary files location specified by $TMPDATAPATH and the data storage location specified by $DATAPATH . The temporary files used during the computation have to be stored locally on each node to avoid too much communication between the hard disks and the nodes. The paths will depend on your cluster and you can set them in your .bashrc file, for example:
<bash> export DATAPATH=/disk1/data/myname/ export TMPDATAPATH=/tmp/ </bash>
Run
Once your SConstruct file is ready and your environment variables are set, you can use the following suggested procedure. It has been tested and is currently used on a linux cluster.
- Make sure the disk located at $DATAPATH is mounted on the different nodes.
- Test if there is enough space available on the different nodes of the cluster at the location specified by $TMPDATAPATH. This directory may be filled up, if some jobs have been interrupted. Clean this up if necessary.
- Look at what is going on on your cluster with sftop.
- Everything looks good ? Then go and run pscons instead of scons.
- If you need to kill your processes on the cluster, the command sfkill can do it remotely on all the nodes for a specific job command. If you kill your jobs, check it did not filled up the $TMPDATAPATH with temporary files before you run pscons again.
One nice feature of running SCons on clusters is fault tolerance (see relevant blog post).