CHTC cluster, HTCondor, Containerization
The UW-Madison Center for High Throughput Computing (CHTC) offers free computing accounts for all UW-Madison personnel and offer the full set of necessary databases (about 2.5 Terabytes) to run the full AlphaFold2 algorithm.
The CHTC is a complex network of Linux network to which “jobs” can be submitted with a scheduler program called HTCondor. This is the same method used on the Biochemistry Compute Cluster (BCC) for which I wrote a HTCondor tutorial with a useful, practical QuickStart section. However, BCC is too small to hold the complete databases and does not have the useful, newer GPU chips.
Scheduling a job with HTCondor requires 2 files: a shell script containing all the instructions necessary to run the requested computations, and a submit script specifying the details on the software used and a list of potential files to accompany the shell script. HTCondor jobs run by default in the “Vanilla Universe” but for running AlphaFold2 we need to specify the “Container Universe” that can accommodate Docker and Apptainer (formerly Singularity) containers.
I originally worked with Docker, but changed to Apptainer along the way trying to solve “problems” that in the end were not due to Docker. Partial code to build the container is located at https://github.com/jsgro/alphafold_singularity and more details will be made available elsewhere. Interested users with a CHTC account can email me to get a copy of the container alphafold-2.3.2.sif on that server.
The submit script provides the “job requirements” such as the container image name and location, files to be transferred, request for memory and computation resources (CPU, GPU, RAM…) The script below has advanced settings e.g for providing the sequence file as a variable, but also contain the CHTC-specific setting for working with the GPU pool.
multimer.sh script contains the information to run AlphaFold2 and requires 4 parameters in a specific order. The 3rd one is called
DUMMY as a place holder for the directory to save the results (I was never able to make this one a variable as I did for the sequence.) These 4 arguments are used within the
runA3.sh script below in turn to be passed on to the
multimer.sh script (this example) or the
monomer.sh script. These 2 scripts are available within the Github repository.
universe = container container_image = alphafold-2.3.2.sif should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = PATH_TO_CONTAINER/alphafold-2.3.2.sif, $(FASTA_file), runA3.sh, multimer.sh executable = runA3.sh # arguments passed to executable SH file as $1 $2 $3 $4 arguments = /gpulab_data/alphafold L1_16x5.fa DUMMY 2023-06-01 requirements = (HasGpulabData == true) request_GPUs = 1 +WantGPULab = true require_gpus = (GlobalMemoryMb >= 60000) request_cpus = 4 request_memory = 350 GB request_disk = 220 GB error = err_runAsif.$(Process) output = out_runAsif.$(Process) log = runAsif.$(Process).log queue FASTA_file jobName from ( multimer.fasta multi-test)
The shell script (named
runA3.sh above) contains the specific commands that will be run once the job has been accepted by a compute node for processing. Many extra commands can be included as “tests” by the user. Below is a simple but working example.
#!/bin/bash # run multimer or monomer bash ./multimer.sh $1 $2 $(pwd) $4 -m multimer # save results and compress tar cvf `basename $2 .fa`.tar `basename $2 .fa` gzip `basename $2 .fa`.tar ls -lh `basename $2 .fa`.tar.*
monomer and multimer scripts
This script will evolve over time as the name of the databases as they are installed on the cluster are likely to be updated.
Of note is the possibility to reduce the number of computed structures to minimize the disk space use, adding this line at the end of the multimer script :”
--num_multimer_predictions_per_model=2” will reduce the number of model to 2 structures for each of the 5 AlphaFold predictions. This will reduce the number of total PDB files to 10 instead of 25.