Preliminary words

This set of tutorials constitutes our “recipe cookbook” explaining how to prepare the input files (receptor and fragment library) for SEED. This is not by any means the only way to do it, and you probably can find better ways yourself, especially if you are familiar with preparing structural inputs for docking or Molecular Dynamics software.

In the following we will make use of many custom scripts (you can find them in the scripts folder) and rely on external software, most of which is open source or free for academics. In particular, for specific tasks, we will use:

Most of the tasks for which we suggest one of these softwares can be carried out with alternative ones that you might be more comfortable with. Thus you do not need to have all the software from the list installed on your system.

Before starting the input preparation tutorials read about the Input Files to understand the format for structural input read by SEED.

To run the code provided in this tutorial as is, you should set the PYTHONSCRIPTS variable to "SEED/scripts/python".

Note that the Python scripts we provide were written in Python 2. A few of them will require minor modifications to be run with Python 3.

Ligand Preparation from a PDB

In this tutorial we will learn how to prepare a ligand for SEED docking starting from a PDB file. This can be useful in a typical redocking experiment scenario. Let us assume for this case that you know the protonation state and the conformation of your ligand and you don’t want to change them.

After you have extracted the ligand from the pdb file, convert it to mol2 format:

obabel -ipdb lig.pdb -p 7.4 -omol2 > lig.mol2

Check carefully the protonation state and verify that you are really getting the molecule you want as the automatic detection of bonds from a pdb file is not always correct.

Now clean the mol2 file (the script gives unique names to atoms, set the residue number to 1 and the name to LIG):

python ${PYTHONSCRIPTS}/mol2_cleaner.py lig.mol2

Generate the CGenFF parameters for the ligand:

cgenff lig.mol2 > lig.str

Combine lig.mol2 with the charges and atom types from the CGenFF stream file lig.str:

python ${PYTHONSCRIPTS}/mol2ori_to_mol2seed4_cgenff4_singlefiles.py lig.mol2 lig.str out.mol2

Protein Preparation (OLD-STYLE)

It is now recommended to prepare the protein receptor with CAMPARI, because it allows to carry out all the necessary steps (reconstruction of missing parts, protonation, relaxation, minimization) within the same software and it directly writes a SEED-compatible mol2 file for the receptor. Further details can be found in Protein preparation with CAMPARI.

For legacy reasons we also present an alternative workflow for the preparation of the receptor, as detailed below.

In this tutorial we will learn how to prepare a receptor for SEED docking. We assume here that the starting point is a protein from the PDB with a ligand inside the protein binding pocket and some structural water molecules we want to keep as part of the receptor. In order to start, select the chains that interest you, and save the different components in your system in three separate files:

  1. protein.pdb
  2. waters.pdb containing explicit waters you want to keep
  3. ligand.pdb containing a ligand, in order to minimize the protein in presence of the ligand.

Prepare the ligand mol2 and stream file as in Tutorial Ligand Preparation from a PDB. Here we also reconvert the mol2 used for generating the stream file to a PDB (so that the residue name, indices, atom names, … are identical in the PDB and STR we will give to CHARMM)

obabel -imol2 lig.mol2 -opdb > lig.pdb

As in this tutorial we will use CHARMM for the protein preparation we first need to have a CHARMM-proof PDB file (CHARMM naming convention for atom and residues):

  • Extract the protein chain of interest and run psfgen on it. psfgen is a plugin for VMD and information about it can be found at http://www.ks.uiuc.edu/Research/vmd/plugins/psfgen/

    The input file for psfgen psf_protein.gen (I/O names and paths to be adapted) is in scripts/psfgen_files. To avoid any problems with CHARMM, renumber residues from 1. (you can use the python script PDB_cleaner_protein.py for this task).

  • Prepare a CHARMM PDB file of the explicit structural water molecules: this is the same as for the protein, but using psf_waters.gen as input to psfgen. Once again, to avoid any problems, renumber residues from 1.

The described steps can be done by running:

python ${PYTHONSCRIPTS}/PDB_cleaner_protein.py prot.pdb out.pdb
psfgen psf_protein.gen > log_protein
psfgen psf_waters.gen > log_waters

Good alternatives to prepare a CHARMM-proof receptor file are to use the CHARMM-GUI web server or the CAMPARI software package (using keyword PDB_W_CONV 3 to write CHARMM compatible pdb files).

Now run the minimization with CHARMM (see file H_min_rdie.inp). Watch out to replace all the I/O names and paths properly in the file. Also pay attention to set the correct number of water molecules.

charmm < H_min_rdie.inp > log

Extract the protein and water chains from receptor_min.pdb and convert it to a mol2 file (we use UCSF Chimera for this task, as up to our knowledge, UCSF Chimera is one of the only few free tools that can “perfectly” handle a mol2 file of a protein).

chimera prot_water.pdb

In the GUI of Chimera click on File => Save Mol2. Be sure to check “Use untransformed coordinates” and nothing else; then enter the file name and save.

Always check that you have what you expected and wanted in your output!

Now run the script to add the atom types and charges taken from CHARMM force field and contained in the file top_all36_prot.rtf.

python ${PYTHONSCRIPTS}/mol2tripos_to_seed_protein.py prot_water.mol2 ${CHARMMFILES}/top_all36_prot.rtf out_forseed.mol2

Pay attention that this script retrieves atom types and charges ONLY if the atom names in the mol2 file are consistent with the CHARMM topology file. TIP3 waters are also recognized, but the termini are not. This means that for the moment you have to fix by hand the few atom types and atom charges corresponding to your termini (charged, capped…). An alternative is to copy atom names and partial charges to the mol2 file taking them directly from the psf topology file generated by psfgen.

Library Preparation

In this tutorial we will learn how to prepare a chemical library for a prospective docking campaign with SEED. Before starting with the actual steps of the preparation there are a few preliminary points which should be taken into account:

  • Choose your library carefully and tailor it to your needs. Consider what you are interested in: small fragments? interaction with which side chains? charged molecules? | Most of the times it makes sense to pre-filter the library before running the docking, as the less noise, the better results. Does it make sense to have compounds with 7 rotatable bonds? With 5 chiral centers? With a logP of 6? With only 3 heavy atoms? Without any rings? With aggregator structures, or PAINS?

    In general, for the use with SEED, we suggest to choose compounds with: logP <= 3, rotatable bonds <= 5, N ring > 1, HAC > 5 and < 40.

    It is important to note that SEED can perform only rigid fragment docking, hence it does not account for ligand flexibility. The workaround to this is to pre-generate multiple conformers for each ligand and dock all of them separately. Of course this can work reasonably well for molecules with only a few rotatable bonds.

  • The most common format for chemical libraries is SDF and there are a few things to carefully check, according to the source of your library: Are your compounds named properly in the SDF file? It is easier to solve this before preparing the library than afterwards. Are there any unwanted elements, such as counter salts or mixtures in a single molecule? Do you have chiral centers? If yes, is the chemical library selling enantiopure compounds or solely mixtures? If it is a mixture, do you have all stereoisomers in the original library or should you generate them?

  • Never trust what you have prepared. Painfully double check as much as you can. That includes opening the output file in a text editor and verify it complies to the format you have decided to output (traditionally SDF for a normal library or MOL2 for screening with SEED); extracting n random molecules (10:sup:2) and visualizing them in PyMOL. If it does not display them properly, they are probably not compliant with the format. Always remember that most docking software are not very tolerant to mistakes in the input format.

In order to make it easier to trace back results from output to input, we highly recommend to give unique names to the fragments in the input library. Our convention (which we will follow in this tutorial) is to append to the fragment name an index for the tautomer/protomer and an index for the conformer.

As starting point we assume to have a collection of ligands in a single SDF file without defined conformations and protonation states.

  1. We first use ChemAxon MolConverter to remove counterions from the molecules.

    molconvert -F sdf library.sdf -o library_nosalt.sdf
    
  2. Then we proceed by generating tautomers with ChemAxon and keep only the ones with occupancy above a threshold (in this example 24.9%):

    cxcalc dominanttautomerdistribution -H 7.2 -f sdf -t "tauto_occupancy" library_nosalt.sdf > tautodistrib.sdf
    python ${PYTHONSCRIPTS}/sdf_select_bytag_nordkit.py tautodistrib.sdf maintauto.sdf 24.9
    

    Note that the Python script appends “tauto_number” to the molecule name so that it is unique.

  3. Now we generate the conformers and prune the ones which are too similar in terms of RMSD:

    mkdir 100conf_075rmsd/
    python ${PYTHONSCRIPTS}/sdf_conformergen_outsplit.py maintauto.sdf 100 0.75 100conf_075rmsd/ 4
    cd 100conf_075rmsd/ ; ls | sed 's/.sdf//g' > ../conformers_tautomers_original_full.list ; cd ..
    

    Note that this script outputs each structure separately.

  4. Convert the SDF files to MOL2. We use CORINA for this task but you can choose any alternatives you prefer:

    mkdir mol2_split
    for i in `cat conformers_tautomers_original_full.list`; do
      corina -i t=sdf 100conf_075rmsd/${i}.sdf -o t=mol2 -d no3d -d newtypes -o fcharges |egrep -v "\#" | awk 'NF'  > mol2_split/${i}.mol2
    done
    

    We now tar the temporary folder as we will not need it in what follows:

    tar -zcf 100conf_075rmsd_split.tgz 100conf_075rmsd/ && rm -r 100conf_075rmsd/
    
  5. We can now generate the CGenFF parameters. In order to save time we generate just the parameters for ONE conformer of each tautomer, as in the fixed charge model we use, parameters do not depend on the conformation (but of course different tautomers of the same molecules need different parameters). First we write the list of unique tautomers:

    egrep "_conf_1$" conformers_tautomers_original_full.list > original_tautomers_firstconf.list
    

    Then we can generate the parameters:

    mkdir cgenff_param
    for i in `cat original_tautomers_firstconf.list`; do
      a=`echo $i | sed 's/_conf_1//g'`
      cgenff mol2_split/${i}.mol2 > cgenff_param/${a}.str
    done &> /dev/null
    

    For molecules which did not go through parametrization succesfully, CGenFF generates non-empty files with empty parameters, and we should get rid of them. In addition, for further putative use in CHARMM, it may turn useful to rename the residue name to the standard identifier “LIG”.

    cd cgenff_param
    mkdir ../cgenff_clean
    for i in *.str ; do
      [[ `egrep "RESI" ${i}` ]] && sed -r 's/RESI ......../RESI LIG     /g' ${i} > ../cgenff_clean/${i}
    done &> /dev/null
    
    cd ../cgenff_clean ; ls | sed 's/.str//g' > ../tautomers_firstconf_cgenffparam.list ; cd ../
    for i in `cat tautomers_firstconf_cgenffparam.list` ; do
      grep $i conformers_tautomers_original_full.list ;
    done > tautomers_conformers_cgenffparam.list
    
  6. Finally we create the MOL2 library file for SEED:

    mkdir mol2seed
    for i in `cat tautomers_conformers_cgenffparam.list` ; do
      a=`echo $i | sed -r 's/_conf_[0-9]*//g' `
      python ${PYTHONSCRIPTS}/mol2ori_to_mol2seed4_cgenff4_singlefiles.py mol2_split/${i}.mol2 cgenff_clean/${a}.str mol2seed/${i}_seed.mol2
    done
    

    At this point different conformer of the same fragment (or tautomer) have the same name . To avoid any ambiguity we rename them in the MOL2 file and as a final step we reconcatenate all the files into a unique one.

    cd mol2seed ; for i in *; do sed -i "s/${i%_conf*}/${i%_seed.mol2}/" $i; done; cd ..
    cd mol2seed ; for i in * ; do cat $i >> ../library_seed.mol2 ; done ; cd ..
    

The chemical library is now ready to be docked by SEED. The steps for this tutorial can be run all together using the bash script library_preparation.sh.