Fast Sequence Composition

Sequence Composition: fast shell method

Motivation

This post was inspired to “verify” a Python Script meant to compute the sequence composition of amino acids, as suggested by ChatGPT. (See ChatGPT_Protein_Composition.ipynb, or viewer ChatGPT_Protein_Composition.ipynb.) The code below is inspired by post number-of-occurences-of-letters-in-a-word.

I found the process useful, and in addition applicable to nucleic acids as well. There is no programming, just the “stringing” of Unix utilities, each one “doing a few simple things very well” which is part of the power of Unix/Linux systems.

The example below works for a simple sequence that is copied and pasted onto a terminal using the bash or a similar text-based shell. (Windows PowerShell or the older cmd programs are very different… Windows users may be helped with how-to-install-and-use-the-linux-bash-shell-on-windows-10/.)

Version 1: short pasted sequence

Let’s assume that we have a sequence in FastA format and we want to know the base composition, or the amino acid composition.

While there are specialized software to handle such questions, Unix-like shell command can once more demonstrate its power by piping multiple simple utilities.

Piping (symbol | is used to transfer results from one software to the next.

If the sequence is small, we can use echo to start the process and then use a counting method (details below.)

echo -n GFFGAIAGFLE | sed 's/\(.\)/\1\n/g'| sort | uniq -c
  • echo starts the data stream. We add -n to remove the return character at the end of the line
  • sed is the stream editor that is used here with a regular expression that: a) recognizes any character symbolized by a dot .. The parentheses that contain the dot have to be escaped with \ to be avoid confusion with the shell. The dot is then replaced by the content of abuffernamed \1 followed by a return character, ornewline” \n.
    • Therefore each character is replace with itself plus a newline (\n)
  • sort will alphabetize the list
  • uniq -c will count each separate character (one per line.)

The output from the above example would be:

   2 A
   1 E
   3 F
   3 G
   1 I
   1 L

We could add one more command to invert the columns, with a colon added:

echo -n GFFGAIAGFLE | sed 's/\(.\)/\1\n/g'| sort | uniq -c | awk '{print $2" : "$1}'

Output:

A : 2
E : 1
F : 3
G : 3
I : 1
L : 1

Version 2: using a FastA file

This method will also work with any FastA sequence file with a single sequence within it (i.e. not a multiple sequence file.) The first thing to do is to remove the first line which contains the name of the sequence, and then apply the counting mechanism. However, we have to add a command to remove the end of lines or they will also be counted. This is done by adding tr -d '\n' within the pipeline.

For example, if we have the following sequence file within a text file called myseq.fa:

>3BT6:B
GFFGAIAGFLEGGWEGMIAGWHGYTSHGAHGVAVAADLKSTQEAINKITKNLNSLSELEV
KNLQRLSGAMDELHNEILELDEKVDDLRADTISSQIELAVLLSNEGIINSEDEHLLALER
KLKKMLGPSAVDIGNGCFETKHKCNQTCLDRIAAGTFNAGEFSLPTFDS

The command would be, with the added column inversion:

sed 1d < myseq.fa | tr -d '\n' | sed 's/\(.\)/\1\n/g'| sort | uniq -c | awk '{print $2" : "$1}'

Here we send the file content to sed with sed 1d < myseq.fa, and then pass the output to the tr utility to remove the end of line. The rest is the same as above.

The first few lines of output would be:

A : 16
C : 3
D : 10
E : 16
F : 7
G : 17
...

This is how small Unix utilities can be very helpful for things the authors had not even imagined!


More on this subject: