1. How to easily convert a string of character to appear one character per line> Either of:
echo "abcdefg" | fold -w1
echo "abcdefg" | grep -o .
2. How to convert amino acid sequence from one-letter to three-letter or vice versa?
Easiest is on a listed web site.
3. How to write the 1- or 3-letter code one amino acid per line. Either of:
echo "abcdefg" | fold -w3
echo "abcdefg" | grep -o ...
One character per line
In spite of the -omics large scale analyzes it is sometimes still necessary to compute minute details of a protein or a peptide and navigate the various formats that can be found. Recently I wanted to do a task that is rather simple, and that may indeed be done by hand, but finding scriptable ways to perform a computing task insures that it is reproducible, and also it makes it easier to record how things were done.
Question 1: How can a sequence be written one amino acid per line, e.g. with the short peptide sequence:
MQNLNDRLASYLDSVHALEEANADLEQKIKGWYE(a small portion of a keratin protein.)
(Why would I want to do that? In short it was to paste into a spreadsheet.)
I did easily find an answer on stackoverflow…bash-split-string-into-character-array and in fact was surprised by the 2 possible answers:
echo "abcdefg" | fold -w1 # OR echo "abcdefg" | grep -o .
fold command was new to me, but the word itself made sense after all.
grep command I was at first wondering how it may work. The answer is simply that the dot
. represents any character in a regular expression (*) (i.e. a text-based search pattern) and in that sense it matches every single amino acid. But in normal mode this would provide just one line, so the magic is in the
-o modifier, even though the manual pages do not clearly explain what would happen:
Prints only the matching part of the lines.
Amino acid conversion between 1- and 3-letter codes
Question 2: how can I convert a 1-letter code peptide sequence to a 3-letter code version?
I was surprised that I could not find an EMBOSS command for this task. But a quick search for terms such as “amino acid one into three” will provide pages with tables of translation, but also utility pages that allow to accomplish the conversion. For example:
|Three-/one-letter Amino Acid Codes||https://www.bioline.com/media/calculator/01_17.html|
|Sequence Manipulation Suite: One to Three||https://www.bioinformatics.org/sms2/one_to_three.html|
|Sequence Manipulation Suite: Three to One||https://www.bioinformatics.org/sms2/three_to_one.html|
|Peptide Amino Acids Sequence Converter: (Three-Letter to One-Letter, One-Letter to Three-Letter)||https://www.peptide2.com/peptide_amino_acid_converter.php|
However, it seems that they are all derived from the same script.
The peptide above would become:
> 34 aminoacids; Mw=3966.11Da MetGlnAsnLeuAsnAspArgLeuAlaSerTyrLeuAspSerValHisAlaLeuGluGlu AlaAsnAlaAspLeuGluGlnLysIleLysGlyTrpTyrGlu***
The problem with these is that the 3-letter code translation has no space between the name of amino acids. This is useful to align to nucleic acid sequence codons, but I needed one per line for a protein modeling script. The
grep commands work well also in this case with a simple modification:
grep -o ... and
For my final script I wanted it all upper case. The final command looked like this, shown with the first 3 amino acids:
echo "MetGlnAsn" | fold -w3 | tr [:lower:] [:upper:] MET GLN ASN
The whole point of this was to create a script file for UCSF Chimera using the
swapaa for amino acid side-chain mutation.
Of course, for such a short sequence this could be done by hand. But having this worked out in this fashion it would be possible to process many more sequences. But that is another story.
(*) The concept of regular expression began in the 1950s, when the American mathematician Stephen Cole Kleene formalized the description of a regular language. In 1935, he joined the mathematics department at the University of Wisconsin–Madison, where he spent nearly all of his career.
Amino acids, one and three letter codes
|Amino acid||Three letter code||One letter code|
|asparagine or aspartic acid||asx||B|
|glutamine or glutamic acid||glx||Z|