1to3 and 3to1 peptide sequence conversion

Summary

1. How to easily convert a string of character to appear one character per line> Either of:

echo "abcdefg" | fold -w1

echo "abcdefg" | grep -o .

2. How to convert amino acid sequence from one-letter to three-letter or vice versa?
Easiest is on a listed web site.

3. How to write the 1- or 3-letter code one amino acid per line. Either of:

echo "abcdefg" | fold -w3

echo "abcdefg" | grep -o ...

One character per line

In spite of the -omics large scale analyzes it is sometimes still necessary to compute minute details of a protein or a peptide and navigate the various formats that can be found. Recently I wanted to do a task that is rather simple, and that may indeed be done by hand, but finding scriptable ways to perform a computing task insures that it is reproducible, and also it makes it easier to record how things were done.

Question 1: How can a sequence be written one amino acid per line, e.g. with the short peptide sequence: MQNLNDRLASYLDSVHALEEANADLEQKIKGWYE(a small portion of a keratin protein.)
(Why would I want to do that? In short it was to paste into a spreadsheet.)
I did easily find an answer on stackoverflow…bash-split-string-into-character-array and in fact was surprised by the 2 possible answers:

echo "abcdefg" | fold -w1
# OR
echo "abcdefg" | grep -o .

The fold command was new to me, but the word itself made sense after all.

For the grep command I was at first wondering how it may work. The answer is simply that the dot . represents any character in a regular expression (*) (i.e. a text-based search pattern) and in that sense it matches every single amino acid. But in normal mode this would provide just one line, so the magic is in the -o modifier, even though the manual pages do not clearly explain what would happen:

-o, –only-matching
Prints only the matching part of the lines.

Amino acid conversion between 1- and 3-letter codes

Question 2: how can I convert a 1-letter code peptide sequence to a 3-letter code version?

I was surprised that I could not find an EMBOSS command for this task. But a quick search for terms such as “amino acid one into three” will provide pages with tables of translation, but also utility pages that allow to accomplish the conversion. For example:

Title Link
Three-/one-letter Amino Acid Codes https://www.bioline.com/media/calculator/01_17.html
Sequence Manipulation Suite: One to Three https://www.bioinformatics.org/sms2/one_to_three.html
Sequence Manipulation Suite: Three to One https://www.bioinformatics.org/sms2/three_to_one.html
Peptide Amino Acids Sequence Converter: (Three-Letter to One-Letter, One-Letter to Three-Letter) https://www.peptide2.com/peptide_amino_acid_converter.php

However, it seems that they are all derived from the same script.

The peptide above would become:

>    34 aminoacids; Mw=3966.11Da
MetGlnAsnLeuAsnAspArgLeuAlaSerTyrLeuAspSerValHisAlaLeuGluGlu
AlaAsnAlaAspLeuGluGlnLysIleLysGlyTrpTyrGlu***

The problem with these is that the 3-letter code translation has no space between the name of amino acids. This is useful to align to nucleic acid sequence codons, but I needed one per line for a protein modeling script. The fold and grep commands work well also in this case with a simple modification: grep -o ... and fold -w3

For my final script I wanted it all upper case. The final command looked like this, shown with the first 3 amino acids:

echo "MetGlnAsn" | fold -w3 | tr [:lower:] [:upper:]
MET
GLN
ASN

The whole point of this was to create a script file for UCSF Chimera using the swapaa for amino acid side-chain mutation.

Of course, for such a short sequence this could be done by hand. But having this worked out in this fashion it would be possible to process many more sequences. But that is another story.


(*) The concept of regular expression began in the 1950s, when the American mathematician Stephen Cole Kleene formalized the description of a regular language. In 1935, he joined the mathematics department at the University of Wisconsin–Madison, where he spent nearly all of his career.


 Amino acids, one and three letter codes

Amino acid Three letter code One letter code
alanine ala A
arginine arg R
asparagine asn N
aspartic acid asp D
asparagine or aspartic acid asx B
cysteine cys C
glutamic acid glu E
glutamine gln Q
glutamine or glutamic acid glx Z
glycine gly G
histidine his H
isoleucine ile I
leucine leu L
lysine lys K
methionine met M
phenylalanine phe F
proline pro P
serine ser S
threonine thr T
tryptophan trp W
tyrosine tyr Y
valine val V