Many of the data processing operations are data-parallel: different traces, shot gathers, frequency slices, etc. can be processed independently. Madagascar provides several mechanisms for handling this type of embarrassingly parallel applications on computers with multiple processors.
- OpenMP
- MPI
- MPI + OpenMP
- SCons
OpenMP is a standard framework for parallel applications on shared-memory systems. It is supported by the latest versions of GCC and by some other compilers.
To run a data-parallel processing task like
on a shared-memory computer with multiple processors (such as a multi-core PC), try sfomp, as follows:.
sfomp splits the input along the slowest axis (presumed to be data-parallel) and runs it through parallel threads. The number of threads is set by the OMP_NUM_THREADS environmental variable or (by default) by the number of available CPUs.
MPI (Message-Passing Interface) is a standard framework for parallel processing on different computer architectures including distributed-memory systems. Several MPI implementations (such as MPICH) are available.
To parallelize a task using MPI, try sfmpi, as follows:
where the argument after -np specifies the number of processors involved. sfmpi will use this number to split the input along the slowest axis (presumed to be data-parallel) and to run it through parallel threads.
Note: Some MPI implementations do not support system calls implemented in sfmpi and therefore will not support this option.
It is possible to combine the advantages of shared-memory and distributed-memory architectures by using OpenMP and MPI together.
will distribute the job on 32 nodes and split it again on each node using shared-memory threads.
If you process data using SCons, another option is available. Change
in your SConstruct file to
where the optional split= parameter contains the axis that needs to be split and the size of this axis. Then run something like
The -j options instructs SCons to run in parallel creating 8 threads, while the CLUSTER= option supplies it with the list of nodes to use and the number of processes to involve for each node. The output may look like
< inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=0 squeeze=n > inp__0.rsf < inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=42 squeeze=n > inp__1.rsf /usr/bin/ssh node1.utexas.edu "cd /home/test ; /bin/env < inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=84 squeeze=n > inp__2.rsf " < inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=126 squeeze=n > inp__3.rsf < inp.rsf /RSFROOT/bin/sfwindow n3=42 f3=168 squeeze=n > inp__4.rsf /usr/bin/ssh node1.utexas.edu "cd /home/test ; /bin/env < inp.rsf /RSFROOT/bin/sfwindow f3=210 squeeze=n > inp__5.rsf " < inp__0.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__0.rsf /usr/bin/ssh node1.utexas.edu "cd /home/test ; /bin/env < inp__1.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__1.rsf " < inp__3.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__3.rsf /usr/bin/ssh node1.utexas.edu "cd /home/test ; < spike__4.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__4.rsf " < inp__2.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__2.rsf < inp__5.rsf /RSFROOT/bin/sfradon p0=0 np=100 dp=0.01 > out__5.rsf < out__0.rsf /RSFROOT/bin/sfcat axis=3 out__1.rsf out__2.rsf out__3.rsf out__4.rsf out__5.rsf > out.rsf
Splitting the input with sfwindow and putting the output back together with sfcat are immediately apparent. The advantage of the SCons-based approach (in addition to documentation and reproducible experiments) is fault tollerance: If one of the nodes dies during the process, one should be able to restart the computation without recreating parts that are already computed.
All these options will continue to evolve and improve with further testing. Please report your experiences and suggestions.