Running SEED on a clusterΒΆ

SEED has a parallel version based on the Message Passing Interface (MPI) that can speed up docking campaigns on multicore cluster architectures.

In order to install it you need a C-compiled MPI implementation (for example OpenMPI) against which to link the code. For manually installed libraries you might also have to modify the Makefile.local and provide the name and path of the specific MPI library you want to use. You can compile SEED MPI implementation by running:

cd SEED/src
make seed_mpi

SEED MPI version implements a simple parallelization strategy in which each process screens a subset of the library against the same target. In order to start a SEED parallel run use:

mpirun -np N seed_4_mpi seed.inp >& log

where N is the number of required MPI ranks.

Two reading modes are available in the parallel version, single and multi, which can be specified in the third line of i7.

The single mode is recommended when running on a cluster as it requires no pre-processing of the library mol2 file and implements dynamic load balance. The master rank reads each molecule and dispatches it to the first available rank. In this way as soon as a rank is idle after finishing the previous computation, it will receive a new molecule from the master rank. Note that the master rank only reads the input file, but does not do any additional computation related to the screening.

The multi mode is the old reading mode originally implemented in SEED, without dynamic load balance; each process reads from a separate file, which means that the library has to be split beforehand into N mol2 files. Depending on the way the library was prepared, there might be some patterns (e.g., the ligands are ordered by molecular weight) which could result in degraded performance as some processes will be faster than others. In order to avoid this problem we recommend shuffling the library before splitting it. We provide scripts for both shuffling and splitting the library:

python library_name.mol2
bash library_name_shuffled.mol2 N output_name

The shuffled library will be suffixed _shuffled.mol2. The N files will be named output_name_part${i}.mol2 with i going from 0 to N-1. Note that output_name could also contain a path as it might be desirable to save the split files into a different folder. In the input file seed.inp you have to provide the basename of the library (in this case output_name.mol2) as input file, omitting the suffix. Each MPI rank i will look for its corresponding library file output_name_part${i}.mol2.

Note that, irrespective of the chosen reading mode, each rank will write log information, poses, and energies to separate output files. Each output file will be suffixed with _part${i}, where i is the rank number (the master rank being 0).

If needed, the output stuctural files and energy tables can be easily recombined together by concatenation:

cat seed_best_part*.mol2 > seed_best_all.mol2
awk 'FNR!=1' seed_best_part*.dat > seed_best_all.dat

where FNR!=1 removes the header line.

The multi reading mode can be useful when reanalyzing SEED output poses, without having to concatenate the output files, or when running with a limited number of MPI processes, as with single the master rank only reads and dispatches molecules without doing any computation. For the serial version the chosen reading mode is inconsequential as only one process will be started.

In general the SEED parallel execution is targeted towards the use on a cluster; We therefore provide a template submission script for systems using the SLURM job scheduler.