Sequence Vectorization in Bioinformatics

An in-depth exploration of sequence vectorization and applications

Published in

The Computational Biology Magazine

4 min readJun 11, 2021

In one of my previous articles, I mentioned how and what is vectorization. In gist, the process enables us to convert variable-length nucleotide sequences into fixed-length numeric vectors.

For all the programming lovers, the link is below;

Vectorization of DNA sequences

Efficient conversion of ACGT sequences into k-mer frequency vectors

medium.com

However, in this article, I will rather focus on utility and research importance. Hence, I will stick to my tool Seq2Vec for the rest of the article. You can follow the readme and install from the following source;

anuradhawick/seq2vec

This tool is intended to be used for data generation in Bioinformatics Machine Learning related tasks. You can use…

github.com

A Simulated Real-world Scenario

A typical analysis in metagenomics is the binning or the identification of the number of species. Therefore, let’s simulate few species reads and mix to emulate a metagenomic sample from the environment. I will be using SimLoRD to simulate metagenomics long reads (as they are likely to become popular in future).

Species details

I have set the minimum read length to be 5000 base pairs to be reasonable. Using the following sample command and table you can replicate the simulation (with some randomness of course).

simlord -rr <REF> --no-sam -mr 5000 -c 45 <OUTPUT>

Note that I use --no-sam to save disk space.

╔══════════════════════════════════════════╦═════════════════════╗
║ Species Name                             ║ Simulation Coverage ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Pseudomonas_putida.CP002290.1            ║                  83 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Metallosphaera_cuprina.NC_015435.1       ║                  14 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Chlamydophila_psittaci.CP002807.1        ║                  64 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Cyanobacterium_UCYN.NC_013771.1          ║                  71 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Nocardia_farcinica.NC_006361.1           ║                  83 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Xanthobacter_autotrophicus.NC_009720.1   ║                  11 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Clostridium_thermocellum.NC_009012.1     ║                  36 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Erysipelothrix_rhusiopathiae.NC_015601.1 ║                  69 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Variovorax_paradoxus.NC_012792.1         ║                  72 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Aeromonas_veronii.NC_015424.1            ║                  45 ║
╚══════════════════════════════════════════╩═════════════════════╝

The resulting dataset consisted of 209875 read sequences totalling 1.7GB. Note that this is not a huge dataset, yet a reasonable start.

Vectorization

Now, we can use Seq2Vec to vectorise these sequences. Using the following command, let’s make vectors of 4-mers (tetra-nucleotide frequency vectors).

seq2vec -f reads.fasta -o 4mers -k 4 -t 8

We can always use different k values but 4 is a common start. Nonetheless, starting with a small k is advisable when resource utilization is not known in advance.

The vectorization results in 209875 vectors each with 136 dimensions. At this point, we can perform different operations like clustering, embedding, etc. However, for the sake of visualization, we can simply embed the vectors into 2 dimensions and plot.

Embedding in Lower Dimensions

Lower dimensional embedding is a popular tactic to see how your data lies in space. Moreover, manifold learning gives not only the lower dimensions but also a good approximation of the arrangement of points in space using neighbourhoods. So let’s try to use UMAP (umap-learn) to embed and see how our data lies.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import umapdata = np.array(LOAD DATA)
ids = np.array(LOAD GROUND TRUTH)data_2d = umap.UMAP().fit_transform(data)plt.figure(figsize=(10,10))
sns.scatterplot(x=data_2d.T[0], y=data_2d.T[1])
plt.legend(bbox_to_anchor=(1,1), loc="upper left")

I have omitted loading data and ground truth for clarity. It would be a simple text file reading operation. Read ids represent ground truth which is typically available in FASTQ header from SimLoRD.

Our plot looks as follows;

Note that we can see clusters corresponding to different species. Also, we can see local string-like structures in clusters. This likely represents paths in each species. Cluster sizes indicate the coverage of the species.

This could be a good starting point for binning although much more nice handling needs to be done for good binning of long reads.

Conclusions

Although in this article we used simulated data, the behaviour remains quite similar in real datasets. However, a lot of noise would prevail requiring advanced algorithms and techniques other than simple embedding.

Our research group (as a part of my PhD work) has produced 2 tools (so far) that exploit these features in metagenomic binning. We also showed that binning can save a lot of RAM and time used for assembly. In some cases, we were even able to improve the metagenomics long read (PacBio/ONT) assemblies.

MetaBCC-LR

Other than the vectors discussed here we introduce a new form of a vector called coverage histogram to increase available information. Published in Bioinformatics (from ISMB 2020 conference). More details in the following links;

MetaBCC-LR: metagenomics binning by coverage and composition for long reads

Metagenomics studies have provided key insights into the composition and structure of microbial communities found in…

doi.org

anuradhawick/MetaBCC-LR

MetaBCC-LR is coded purely using C++ (v9) and Python 3.6. To run MetaBCC-LR, you will need to install the following…

github.com

LRBinner

Further improving MetaBCC-LR, we use variational auto-encoders (motivated by VAMB but significantly different) and a new distance histogram-based clustering to perform binning. This work has been accepted at WABI-2021 and will be available online soon (will update this post).

anuradhawick/LRBinner

Training on CUDA is now available! A new update will be available end of June 2021 with much faster vectorization and…

github.com

Tutorial posts on using these tools will be available soon!

Happy reading! Cheers.

Sequence Vectorization in Bioinformatics

An in-depth exploration of sequence vectorization and applications

Vectorization of DNA sequences

Efficient conversion of ACGT sequences into k-mer frequency vectors

anuradhawick/seq2vec

This tool is intended to be used for data generation in Bioinformatics Machine Learning related tasks. You can use…

A Simulated Real-world Scenario

Species details

Vectorization

Embedding in Lower Dimensions

Conclusions

MetaBCC-LR

MetaBCC-LR: metagenomics binning by coverage and composition for long reads

Metagenomics studies have provided key insights into the composition and structure of microbial communities found in…

anuradhawick/MetaBCC-LR

MetaBCC-LR is coded purely using C++ (v9) and Python 3.6. To run MetaBCC-LR, you will need to install the following…

LRBinner

anuradhawick/LRBinner

Training on CUDA is now available! A new update will be available end of June 2021 with much faster vectorization and…

Written by Anuradha Wickramarachchi