AlphaFold2 with full databases and multimer option

Summary

This article describes running AlphaFold2 with full databases at the UW-Madison Center for High Throughput Computing (CHTC) and is part of an article series titled Five ways to run AlphaFold.

CHTC cluster, HTCondor, Containerization

The UW-Madison Center for High Throughput Computing (CHTC) offers free computing accounts for all UW-Madison personnel and offer the full set of necessary databases (about 2.5 Terabytes) to run the full AlphaFold2 algorithm.

Predicted quaternary structure
Predicted quaternary structure (no heme or iron.)

The CHTC is a complex network of Linux network to which “jobs” can be submitted with a scheduler program called HTCondor. This is the same method used on the Biochemistry Compute Cluster (BCC) for which I wrote a HTCondor tutorial with a useful, practical QuickStart section. However, BCC is too small to hold the complete databases and does not have the useful, newer GPU chips.

Scheduling a job with HTCondor requires 2 files: a shell script containing all the instructions necessary to run the requested computations, and a submit script specifying the details on the software used and a list of potential files to accompany the shell script. HTCondor jobs run by default in the “Vanilla Universe” but for running AlphaFold2 we need to specify the “Container Universe” that can accommodate Docker and Apptainer (formerly Singularity) containers.

SIF container

I originally worked with Docker, but changed to Apptainer along the way trying to solve “problems” that in the end were not due to Docker. Partial code to build the container is located at https://github.com/jsgro/alphafold_singularity and more details will be made available elsewhere. Interested users with a CHTC account can email me to get a copy of the container alphafold-2.3.2.sif on that server.

Submit script

The submit script provides the “job requirements” such as the container image name and location, files to be transferred, request for memory and computation resources (CPU, GPU, RAM…) The script below has advanced settings e.g for providing the sequence file as a variable, but also contain the CHTC-specific setting for working with the GPU pool.

The multimer.sh script contains the information to run AlphaFold2 and requires 4 parameters in a specific order. The 3rd one is called DUMMY as a place holder for the directory to save the results (I was never able to make this one a variable as I did for the sequence.) These 4 arguments are used within the runA3.sh script below in turn to be passed on to the multimer.sh script (this example) or the monomer.sh script. These 2 scripts are available within the Github repository.

universe = container
container_image = alphafold-2.3.2.sif
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = PATH_TO_CONTAINER/alphafold-2.3.2.sif, $(FASTA_file), runA3.sh, multimer.sh
executable = runA3.sh
# arguments passed to executable SH file as $1 $2 $3 $4
arguments = /gpulab_data/alphafold  L1_16x5.fa DUMMY  2023-06-01

requirements = (HasGpulabData == true)
request_GPUs = 1
+WantGPULab = true
require_gpus = (GlobalMemoryMb >= 60000)
request_cpus =  4
request_memory = 350 GB
request_disk = 220 GB

error   = err_runAsif.$(Process)
output  = out_runAsif.$(Process)
log     = runAsif.$(Process).log

queue FASTA_file jobName from (
multimer.fasta   multi-test)

Shell script

The shell script (named runA3.sh above) contains the specific commands that will be run once the job has been accepted by a compute node for processing. Many extra commands can be included as “tests” by the user. Below is a simple but working example.

#!/bin/bash
# run multimer or monomer
bash ./multimer.sh $1 $2 $(pwd) $4 -m multimer
# save results and compress
tar cvf `basename $2 .fa`.tar `basename $2 .fa`
gzip `basename $2 .fa`.tar 
ls -lh `basename $2 .fa`.tar.*

monomer and multimer scripts

This script will evolve over time as the name of the databases as they are installed on the cluster are likely to be updated.

The current scripts are at monomer.sh and multimer.sh which requires adding “-m multimer” within the shell script.

Of note is the possibility to reduce the number of computed structures to minimize the disk space use, adding this line at the end of the multimer script :”--num_multimer_predictions_per_model=2”  will reduce the number of model to 2 structures for each of the 5 AlphaFold predictions. This will reduce the number of total PDB files to 10 instead of 25.