Summary
1. How to easily convert a string of character to appear one character per line> Either of:
echo "abcdefg" | fold -w1
echo "abcdefg" | grep -o .2. How to convert amino acid sequence from one-letter to three-letter or vice versa?
Easiest is on a listed web site.3. How to write the 1- or 3-letter code one amino acid per line. Either of:
echo "abcdefg" | fold -w3
echo "abcdefg" | grep -o ...
One character per line
In spite of the -omics large scale analyzes it is sometimes still necessary to compute minute details of a protein or a peptide and navigate the various formats that can be found. Recently I wanted to do a task that is rather simple, and that may indeed be done by hand, but finding scriptable ways to perform a computing task insures that it is reproducible, and also it makes it easier to record how things were done.
Question 1: How can a sequence be written one amino acid per line, e.g. with the short peptide sequence:
MQNLNDRLASYLDSVHALEEANADLEQKIKGWYE(a small portion of a keratin protein.)
(Why would I want to do that? In short it was to paste into a spreadsheet.)
I did easily find an answer on stackoverflow…bash-split-string-into-character-array and in fact was surprised by the 2 possible answers:
echo "abcdefg" | fold -w1
# OR
echo "abcdefg" | grep -o .
The fold command was new to me, but the word itself made sense after all.
For the grep command I was at first wondering how it may work. The answer is simply that the dot . represents any character in a regular expression (*) (i.e. a text-based search pattern) and in that sense it matches every single amino acid. But in normal mode this would provide just one line, so the magic is in the -o modifier, even though the manual pages do not clearly explain what would happen:
-o, –only-matching
Prints only the matching part of the lines.
Amino acid conversion between 1- and 3-letter codes
Question 2: how can I convert a 1-letter code peptide sequence to a 3-letter code version?
I was surprised that I could not find an EMBOSS command for this task. But a quick search for terms such as “amino acid one into three” will provide pages with tables of translation, but also utility pages that allow to accomplish the conversion. For example:
| Title | Link |
|---|---|
| Three-/one-letter Amino Acid Codes | https://www.bioline.com/media/calculator/01_17.html |
| Sequence Manipulation Suite: One to Three | https://www.bioinformatics.org/sms2/one_to_three.html |
| Sequence Manipulation Suite: Three to One | https://www.bioinformatics.org/sms2/three_to_one.html |
| Peptide Amino Acids Sequence Converter: (Three-Letter to One-Letter, One-Letter to Three-Letter) | https://www.peptide2.com/peptide_amino_acid_converter.php |
However, it seems that they are all derived from the same script.
The peptide above would become:
> 34 aminoacids; Mw=3966.11Da
MetGlnAsnLeuAsnAspArgLeuAlaSerTyrLeuAspSerValHisAlaLeuGluGlu
AlaAsnAlaAspLeuGluGlnLysIleLysGlyTrpTyrGlu***
The problem with these is that the 3-letter code translation has no space between the name of amino acids. This is useful to align to nucleic acid sequence codons, but I needed one per line for a protein modeling script. The fold and grep commands work well also in this case with a simple modification: grep -o ... and fold -w3
For my final script I wanted it all upper case. The final command looked like this, shown with the first 3 amino acids:
echo "MetGlnAsn" | fold -w3 | tr [:lower:] [:upper:]
MET
GLN
ASN
The whole point of this was to create a script file for UCSF Chimera using the swapaa for amino acid side-chain mutation.
Of course, for such a short sequence this could be done by hand. But having this worked out in this fashion it would be possible to process many more sequences. But that is another story.
(*) The concept of regular expression began in the 1950s, when the American mathematician Stephen Cole Kleene formalized the description of a regular language. In 1935, he joined the mathematics department at the University of Wisconsin–Madison, where he spent nearly all of his career.
Amino acids, one and three letter codes
| Amino acid | Three letter code | One letter code |
|---|---|---|
| alanine | ala | A |
| arginine | arg | R |
| asparagine | asn | N |
| aspartic acid | asp | D |
| asparagine or aspartic acid | asx | B |
| cysteine | cys | C |
| glutamic acid | glu | E |
| glutamine | gln | Q |
| glutamine or glutamic acid | glx | Z |
| glycine | gly | G |
| histidine | his | H |
| isoleucine | ile | I |
| leucine | leu | L |
| lysine | lys | K |
| methionine | met | M |
| phenylalanine | phe | F |
| proline | pro | P |
| serine | ser | S |
| threonine | thr | T |
| tryptophan | trp | W |
| tyrosine | tyr | Y |
| valine | val | V |