Summary
Combine software and scripts on Docker and local macOS computer (Intel amd64 or arm64 Silicon Chip M series) to follow successfully the Rosetta tutorial Ligand Docking with a G-Protein Coupled Receptor. This method will allow to access the native OS speed while fulfilling all preparatory and exploratory steps that fail or are too complex to set-up on the local computer.
NOTE: The complete tutorial is now on page: Rosetta – Ligand Docking and extends materials from this blog.
Rosetta
The Rosetta software suite includes algorithms for computational modeling and analysis of protein structures. […] including de novo protein design, enzyme design, ligand docking, and structure prediction of biological macromolecules and macromolecular complexes.
The software is rather complex and it can be difficult to “make things work” considering the breath of options for algorithms or hardware and operating system support. This method will allow to access the native OS speed while fulfilling all preparatory and exploratory steps.
This post is an attempt to help users that want to use the software on their native OS (macOS) but since some functionality built-in the tutorials assumes a Linux OS, some of the steps are in fact easier handled on the Linux side thanks to Docker.
Preparations
These instructions should benefit macOS users primarily.
Users should be somewhat familiar with bash
command line (see e.g. my Survival Command Line tutorial) and have the Docker Desktop software installed (see e.g. my tutorial Docker – Beginner for Biologists.)
The Rosetta Docker image can be downloaded from Terminal
with the command below, assuming that Docker is already installed:
docker pull rosettacommons/rosetta:latest
While the purpose of the Docker image is to provide access to all the Linux compiled binaries, we will take advantage of some of the Linux functionality as well as the installed Python.
Rosetta preparations
Rosetta is freely available for academic and non-commercial purposes, under license. The software can be downloaded from the links provided on the Download page.
In order to compute the docking computation “natively”(for faster results) on the local computer users should download the newest “release”, e.g. from the Academic download page. For this post I used Rosetta 3.14 for M1 (Silicon Chip “M1 binaries“, 13Gb) Macintosh, Intel-based Mac users should download the “Mac binaries” (14Gb). The unarchived file will require about 45 Gb of disk space but will contain the material for all tutorials and demos.
Getting Started
We will use 2 Terminal sessions: one to navigate within the Macintosh natively. The other to run a Docker container that will be activated in a way that both Terminal session will share the same directory area on the local computer.
Environment variables
On the Macintosh side it will be useful to create environment variables as suggested in the Rosetta Commons section How To Read These Tutorials. Assuming that the binaries are found within the Downloads
directory, we can keep that location and its default name. For the M1 series the unarchived directory was called rosetta.binary.m1.release-371
and the following variables were created. I replaced my username by $USER
so that these commands become generic and can be copied, with the caveat that the binary name might be different (change accordingly!.)
Open a Mac Terminal (/Applications/Utilities/Terminal.app)
and paste the (edited) commands:
export ROSETTA3=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/source
export ROSETTA3_DB=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/database
export ROSETTA_TOOLS=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/tools
export ROSETTA3_DEMOS=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/demos
Copy/Paste these (or edited) commands onto your Terminal for the current session. To make these permanent add them within .zshrc
or .bashrc
files. Or simply Copy/Paste again next time!
Start the Docker container
Open a new Terminal (I usually select a different color to better distinguish which terminal I am using using the Top menu cascade: Shell > New Window
and then select a new color (Basic is white background.)
Navigate the the top level of the release directory within the main
directory. For me it would be:
cd /Users/$USER/Downloads/rosetta.binary.m1.release-371/main/
Verify that this was successful with pwd
and the continue.
Launch the Docker container. The -v
option lets us share the current directory (i.e. main
with the container, mapping it as /data
within, and making it the working directory with the -w
option:
docker run -it --rm -v ${PWD}:/data -w /data rosettacommons/rosetta
(On a Silicon M series Mac Docker will complain about the “platform” but this can be ignored.)
A listing of the files and directories should reveal the same content as the main
directory:
/data# ls -F
CITING_ROSETTA.md README.md rosetta_scripts_scripts/
CLA.md database/ source/
CONTRIBUTING.md demos/ tests/
LICENSE.md documentation/ tools/
PyRosetta.notebooks/ pyrosetta_scripts/
Continue Docker Container set-up
The tutorial assumes that the user is sitting in front of the fully functional Linux computer, including graphical interface, with all ancillary software needed for such a computer already installed. However, within the container a few important utilities are missing, so we need to add them now. You can check the Linux distribution that is running with the command: cat /etc/os-release
, and python -V
if you want to check the Python version installed.
Then issue the following commands after the #
prompt:
apt-get update
apt-get install -y wget nano pymol
With these installed we can run all of the commands in the tutorial. Adding the pymol
Python module will allow the creation of PyMOL .pse
files as described in the tutorial, from the Docker Container Terminal.
Following the tutorial
Open a web browser with the Rosetta tutorial: Ligand Docking with a G-Protein Coupled Receptor in order to follow the steps. We will follow some steps within the macOS, while other steps will more easily be performed within the Docker container. In addition to 2 different colors for the Terminal (if you chose to do that) you can be guided by the fact that the Container runs as root and has #
as the prompt, while on the Mac it would be $
(bash
) or %
(zsh
) depending on the shell in use.
Within the Docker container we don’t really need the ROSETTA
variables as all paths can start with /data
to be “absolute” (i.e. non ambiguous.)
The sections below will follow the same numbering as the tutorial.
However, read most of the information from the web page!
1. Go to the desired location:
On macOS Terminal:
cd $ROSETTA3_DEMOS/tutorials/ligand_docking/protein_prep
On Container Terminal:
cd /data/demos/tutorials/ligand_docking/protein_prep
2. Prepare a human dopamine 3 receptor structure:
The first step has to be accomplished on the Docker Container side as the called python script clean_pdb.py
calls on wget
to download a PDB file but is not installed on macOS by default. The script then calls on zcat
to unarchive the file but on macOS this software behaves differently. Thus it is best to accomplish this task on the Container side, but since we are sharing the directories these will “magically” appear on the macOS side as well!
On the Container Terminal type:
/data/tools/protein_tools/scripts/clean_pdb.py 3PBL.pdb A
Files 3PBL_A.fasta
and 3PBL_A.pdb
should now be within the directory.
2.3. From either terminal looking within protein_prep
directory type:
cp 3PBL_A.pdb ../docking
3. prepare the ligand files :
The command pymol eticlopride_conformers.sdf
assumes a Linux computer with a Graphical interface and will not work on macOS and will not work with the Docker Container as it is running as “Text-only.”
To open PyMOL from command line on a Mac use the command:open -a /Applications/PyMOL.app
. Then slide the file eticlopride_conformers.sdf
onto PyMOL with your mouse (using as an argument does not open it.)
3.2.3. Generate a .params file
This command can be run from either computer… However, the first line of the script reads: #!/usr/bin/env python
which assumes that the computer environment has a python
path defined. On my Mac it is currently defined as python3
and therefore it complains with env: python: No such file or directory
This is fixed easily by adding python3
in front of the actual command.
You can run this step with either of the following commands. The first one could also work on Mac if python
is defined as such. For the Mac Terminal option I make use of the environment variable $ROSETTA3
. You can use the option -h
first as suggested in the tutorial.
Option 1 (In Container): /data/source/scripts/python/public/molfile_to_params.py -n ETQ -p ETQ --conformers-in-one-file eticlopride_conformers.sdf
Option 2 (Mac Terminal): python3 $ROSETTA3/scripts/python/public/molfile_to_params.py -n ETQ -p ETQ --conformers-in-one-file eticlopride_conformers.sdf
Note: The tutorial assumes that we are within the ligand_prep/
directory, but also calls the sdf
file with ligand_prep/eticlopride_conformers.sdf
which will cause an error since we are within that directory already. Hence the directory name has been removed from the above commands.
You can also verify that the last line of the just created file ETQ.params
contains the following text: PDB_ROTAMERS ETQ_conformers.pdb
with the commandtail ETQ.params
and copy the files into the ligand_docking
directory with the command:
cp ETQ* ../
4. Final preparations in the docking directory
First we go back to within the ligand_docking
directory
cd ../
The command pymol 3PBL_A.pdb ETQ.pdb
invites to explore the complex graphically. (see above.)
4.2 Concatenate protein and ligand
cp protein_prep/3PBL_A.pdb .
cat 3PBL_A.pdb ETQ.pdb > 3PBL_ETQ.pdb
If you are missing these files check the web page for instruction to obtain them from the answers
directory.
5. Rosetta wrapper and helpers
See web page for details. Copy the files (remember that .
means “current directory.)
cp docking/dock.xml .
cp docking/options .
cp docking/crystal_complex.pdb .
6.Run the docking study
This is where it is useful to have the macOS binaries installed. For large projects they will run faster than running those within the Docker Container by emulation. The name of the binary will differ depending on the operating system.
The tutorial assumes a standard installation, with binary:
$ROSETTA3/bin/rosetta_scripts.linuxgccrelease
The binaries within the container are within /usr/local/bin
and the specific one to call for this section is called
rosetta_scripts.cxx11threadserialization.linuxgccrelease
On the Mac it will be:
$ROSETTA3/bin/rosetta_scripts.static.macosclangrelease
To run the docking use the appropriate binary, followed by @options
On the Mac it would be:
$ROSETTA3/bin/rosetta_scripts.static.macosclangrelease @options
7. Rosetta models
The Rosetta models are saved with the prefix 3PBL_ETQ_
followed by a four digit identifier. 3PBL_ETQ_0001.pdb
. Each model PDB contains the coordinates, and Rosetta score corresponding to that model further down the file. All models data is also summarized within the plain text file scores.sc
.
8. Transform_accept_ratio.
9. ligand_rms_no_super_X
gives the RMSD difference between our model ligand and the crystal structure ligand given in crystal_complex.pdb
.
10.Use pymol to visually compare
11. Script visualize_ligand.py
Provides a quick visualizations of protein-ligand interfaces and saves a .pse
PyMOL session file. This can to be done on the Docker Container side if the pymol
Python package was installed (see above.) Note that the tutorial file name is 3PBL_A_ETQ_0001.pdb
but our result files do not have the _A_
portion.
Run this command from the Docker Container Terminal, assuming we are within the ligand_docking
directory:
scripts/visualize_ligand.py 3PBL_ETQ_0001.pdb
The new file 3PBL_ETQ_0001.pse
can be opened on the Mac side, either graphically or from the Mac Terminal with the command: open ./3PBL_ETQ_0001.pse
Analysis
The out
directory contains 50 precomputed structures for a better analysis evaluation as well as files score.sc
, a score_vs_rmsd.csv
file, a rmsds_to_best_model.data
, and several .png
image files.
Change into that directory:
cd out
The following files should be present:
score.sc
: summary score file for the 50 structures as outputted by Rosettascore_vs_rmsd.csv
: a comma separated file with the filename in the first column, total_score for the complex in the second column, the interface score in the third column, and ligand RMSD to the native structure in the fourth column.
If the score_vs_rmsd.csv
is absent it can be recreated with the provided script extract_scores.bash
../scripts/extract_scores.bash score.sc > score_vs_rmsd.csv
The next file is calculated with script calculate_ligand_rmsd.py
rmsds_to_best_model.data
: space separated file containing RMSD comparisons with the best scoring model (not crystal structure!) for all PDB files.
However, when run under Python3 the script calculate_ligand_rmsd.py
will give an error which can be prevented by updating the print
statements on lines 83 and 221. It just needs some parentheses added around the print statements. Open the file with a simply word processor e.g. nano
) and change line 83 from print "Doing aligning"
to print("Doing aligning")
and line 221 from print "file, name: "+file+' '+name
to: print("file, name: "+file+' '+name)
(See Appendix “Python print statement” for a script to modify these lines without manual editing.)
Finally comment out the last line as the script capture_command.sh
is not present.
After editing, the script can then be run from the Docker Container with the command below. The Tutorial suggests to leave the out
directory first and run this on the predictions that were run previously:
cd ../
Then:
../scripts/calculate_ligand_rmsd.py -n 3PBL_ETQ_0003.pdb -c X -a 7 -o rmsds_to_best_model.data *_000*.pdb
The script calculate_ligand_rmsd.py
uses the pymol
Python module and therefore can only work properly under the Docker Container session.
2.4. PNG files
There is no explanation on the production of the PNG files. (See Appendix for short R and Python script to create similar images from file out/score_vs_rmsd.csv
)
The command gthumb
is the Linux way to display the image and will not work from within the Docker Container. On macOS use the graphical interface (double click on the icon) or use the command line from within the macOS Terminal. For example, if the file is within out
:
open -a /System/Applications/Preview.app ./out/score_vs_crystal_rmsd_plot.png
5. Look at some structures
This is mostly a visual exercise.
Appendix
Python print statement
The Python 2 print
statement was converted to a function in Python 3, which changed the nomenclature, and requires parentheses. The following regular expression command will change the print
statements to print()
function syntax within the same file using the sed
(stream editor) program. On macOS the default sed
version does not work and one would need to install gnu-sed (gsed
) for this to work.
Thus the substitute command should be issued at the Docker Container Terminal:
sed -i -r 's/^(\s*print)\s+(.*)/\1(\2)/g' ../scripts/calculate_ligand_rmsd.py
This command was derived from a vi
/vim
text editor command found on stackoverflow.com
The -i
option will edit and overwrite the file calculate_ligand_rmsd.py
(Note: if there is a permission error remove -i
and redirect the output to a new file.)
Copilot provided the following explanations:
Here’s what’s happening in this sed
command:
-r
option allowssed
to understand extended regular expressions.s
is the substitute command.^\s*print\s+(.*)
is the pattern to match. It matches lines that start with zero or more spaces, followed byprint
, followed by one or more spaces, and then any characters.\1(\2)
is the replacement pattern. It replaces the matched pattern withprint
followed by the matched characters in parentheses.g
at the end is a flag that tellssed
to apply the substitution globally on each line.
PyMOL as pymol
Note: It may be possible to use PyMOL on macOS in a similar manner as on Linux knowing that the executable pymol
is found as /Applications/PyMOL.app/Contents/bin/pymol
Scripts for PNG files
The following scripts were suggested by Copilot, with the simple question to plot the 3rd and 4th columns.
R Script
Assumes ggpplot2
has been installed or install with command at the R console:install.packages("ggplot2")
Make sure that you are in the directory containing the desired CSV file.
The plot is shown graphically but can be saved manually.
# Read the data from a text file (assuming the file is named 'data.txt')
data <- read.table("score_vs_rmsd.csv", header=FALSE)
# Extract the 3rd and 4th columns
x <- data$V3
y <- data$V4
# Load the ggplot2 library for plotting
library(ggplot2)
# Create a scatter plot
ggplot(data, aes(x=x, y=y)) +
geom_point() +
theme_minimal() +
labs(title="Scatter Plot of 3rd and 4th Columns",
x="3rd Column",
y="4th Column")
Python version
Assumes pandas
and matplotlib
are installed (e.g. using pip
)
The script will export a PNG file called plot.png
import pandas as pd
import matplotlib.pyplot as plt
# Read the data from a text file (assuming the file is named 'data.txt')
data = pd.read_csv('data.txt', sep=' ', header=None)
# Extract the 3rd and 4th columns
x = data.iloc[:, 2]
y = data.iloc[:, 3]
# Create a scatter plot
plt.scatter(x, y)
plt.title('Scatter Plot of 3rd and 4th Columns')
plt.xlabel('3rd Column')
plt.ylabel('4th Column')
plt.grid(True)
# Save the plot to a PNG file
plt.savefig('plot.png')