Summary
1. How to easily convert a string of character to appear one character per line> Either of:
echo "abcdefg" | fold -w1
echo "abcdefg" | grep -o .
2. How to convert amino acid sequence from one-letter to three-letter or vice versa?
Easiest is on a listed web site.3. How to write the 1- or 3-letter code one amino acid per line. Either of:
echo "abcdefg" | fold -w3
echo "abcdefg" | grep -o ...
One character per line
In spite of the -omics large scale analyzes it is sometimes still necessary to compute minute details of a protein or a peptide and navigate the various formats that can be found. Recently I wanted to do a task that is rather simple, and that may indeed be done by hand, but finding scriptable ways to perform a computing task insures that it is reproducible, and also it makes it easier to record how things were done.
Question 1: How can a sequence be written one amino acid per line, e.g. with the short peptide sequence:
MQNLNDRLASYLDSVHALEEANADLEQKIKGWYE
(a small portion of a keratin protein.)
(Why would I want to do that? In short it was to paste into a spreadsheet.)
I did easily find an answer on stackoverflow…bash-split-string-into-character-array and in fact was surprised by the 2 possible answers:
echo "abcdefg" | fold -w1
# OR
echo "abcdefg" | grep -o .
The fold
command was new to me, but the word itself made sense after all.
For the grep
command I was at first wondering how it may work. The answer is simply that the dot .
represents any character in a regular expression (*) (i.e. a text-based search pattern) and in that sense it matches every single amino acid. But in normal mode this would provide just one line, so the magic is in the -o
modifier, even though the manual pages do not clearly explain what would happen:
-o, –only-matching
Prints only the matching part of the lines.
Amino acid conversion between 1- and 3-letter codes
Question 2: how can I convert a 1-letter code peptide sequence to a 3-letter code version?
I was surprised that I could not find an EMBOSS command for this task. But a quick search for terms such as “amino acid one into three” will provide pages with tables of translation, but also utility pages that allow to accomplish the conversion. For example:
Title | Link |
---|---|
Three-/one-letter Amino Acid Codes | https://www.bioline.com/media/calculator/01_17.html |
Sequence Manipulation Suite: One to Three | https://www.bioinformatics.org/sms2/one_to_three.html |
Sequence Manipulation Suite: Three to One | https://www.bioinformatics.org/sms2/three_to_one.html |
Peptide Amino Acids Sequence Converter: (Three-Letter to One-Letter, One-Letter to Three-Letter) | https://www.peptide2.com/peptide_amino_acid_converter.php |
However, it seems that they are all derived from the same script.
The peptide above would become:
> 34 aminoacids; Mw=3966.11Da
MetGlnAsnLeuAsnAspArgLeuAlaSerTyrLeuAspSerValHisAlaLeuGluGlu
AlaAsnAlaAspLeuGluGlnLysIleLysGlyTrpTyrGlu***
The problem with these is that the 3-letter code translation has no space between the name of amino acids. This is useful to align to nucleic acid sequence codons, but I needed one per line for a protein modeling script. The fold
and grep
commands work well also in this case with a simple modification: grep -o ...
and fold -w3
For my final script I wanted it all upper case. The final command looked like this, shown with the first 3 amino acids:
echo "MetGlnAsn" | fold -w3 | tr [:lower:] [:upper:]
MET
GLN
ASN
The whole point of this was to create a script file for UCSF Chimera using the swapaa
for amino acid side-chain mutation.
Of course, for such a short sequence this could be done by hand. But having this worked out in this fashion it would be possible to process many more sequences. But that is another story.
(*) The concept of regular expression began in the 1950s, when the American mathematician Stephen Cole Kleene formalized the description of a regular language. In 1935, he joined the mathematics department at the University of Wisconsin–Madison, where he spent nearly all of his career.
Amino acids, one and three letter codes
Amino acid | Three letter code | One letter code |
---|---|---|
alanine | ala | A |
arginine | arg | R |
asparagine | asn | N |
aspartic acid | asp | D |
asparagine or aspartic acid | asx | B |
cysteine | cys | C |
glutamic acid | glu | E |
glutamine | gln | Q |
glutamine or glutamic acid | glx | Z |
glycine | gly | G |
histidine | his | H |
isoleucine | ile | I |
leucine | leu | L |
lysine | lys | K |
methionine | met | M |
phenylalanine | phe | F |
proline | pro | P |
serine | ser | S |
threonine | thr | T |
tryptophan | trp | W |
tyrosine | tyr | Y |
valine | val | V |