Sequence Vectorization in Bioinformatics

An in-depth exploration of sequence vectorization and applications

--

In one of my previous articles, I mentioned how and what is vectorization. In gist, the process enables us to convert variable-length nucleotide sequences into fixed-length numeric vectors.

Photo by Elena Mozhvilo on Unsplash

For all the programming lovers, the link is below;

However, in this article, I will rather focus on utility and research importance. Hence, I will stick to my tool Seq2Vec for the rest of the article. You can follow the readme and install from the following source;

A Simulated Real-world Scenario

Photo by Louis Reed on Unsplash

A typical analysis in metagenomics is the binning or the identification of the number of species. Therefore, let’s simulate few species reads and mix to emulate a metagenomic sample from the environment. I will be using SimLoRD to simulate metagenomics long reads (as they are likely to become popular in future).

Species details

I have set the minimum read length to be 5000 base pairs to be reasonable. Using the following sample command and table you can replicate the simulation (with some randomness of course).

simlord -rr <REF> --no-sam -mr 5000 -c 45 <OUTPUT>

Note that I use --no-sam to save disk space.

╔══════════════════════════════════════════╦═════════════════════╗
║ Species Name ║ Simulation Coverage ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Pseudomonas_putida.CP002290.1 ║ 83 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Metallosphaera_cuprina.NC_015435.1 ║ 14 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Chlamydophila_psittaci.CP002807.1 ║ 64 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Cyanobacterium_UCYN.NC_013771.1 ║ 71 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Nocardia_farcinica.NC_006361.1 ║ 83 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Xanthobacter_autotrophicus.NC_009720.1 ║ 11 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Clostridium_thermocellum.NC_009012.1 ║ 36 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Erysipelothrix_rhusiopathiae.NC_015601.1 ║ 69 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Variovorax_paradoxus.NC_012792.1 ║ 72 ║
╠══════════════════════════════════════════╬═════════════════════╣
║ Aeromonas_veronii.NC_015424.1 ║ 45 ║
╚══════════════════════════════════════════╩═════════════════════╝

The resulting dataset consisted of 209875 read sequences totalling 1.7GB. Note that this is not a huge dataset, yet a reasonable start.

Vectorization

Now, we can use Seq2Vec to vectorise these sequences. Using the following command, let’s make vectors of 4-mers (tetra-nucleotide frequency vectors).

seq2vec -f reads.fasta -o 4mers -k 4 -t 8

We can always use different k values but 4 is a common start. Nonetheless, starting with a small k is advisable when resource utilization is not known in advance.

The vectorization results in 209875 vectors each with 136 dimensions. At this point, we can perform different operations like clustering, embedding, etc. However, for the sake of visualization, we can simply embed the vectors into 2 dimensions and plot.

Embedding in Lower Dimensions

Lower dimensional embedding is a popular tactic to see how your data lies in space. Moreover, manifold learning gives not only the lower dimensions but also a good approximation of the arrangement of points in space using neighbourhoods. So let’s try to use UMAP (umap-learn) to embed and see how our data lies.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import umap
data = np.array(LOAD DATA)
ids = np.array(LOAD GROUND TRUTH)
data_2d = umap.UMAP().fit_transform(data)plt.figure(figsize=(10,10))
sns.scatterplot(x=data_2d.T[0], y=data_2d.T[1])
plt.legend(bbox_to_anchor=(1,1), loc="upper left")

I have omitted loading data and ground truth for clarity. It would be a simple text file reading operation. Read ids represent ground truth which is typically available in FASTQ header from SimLoRD.

Our plot looks as follows;

Plot by Author

Note that we can see clusters corresponding to different species. Also, we can see local string-like structures in clusters. This likely represents paths in each species. Cluster sizes indicate the coverage of the species.

This could be a good starting point for binning although much more nice handling needs to be done for good binning of long reads.

Conclusions

Although in this article we used simulated data, the behaviour remains quite similar in real datasets. However, a lot of noise would prevail requiring advanced algorithms and techniques other than simple embedding.

Our research group (as a part of my PhD work) has produced 2 tools (so far) that exploit these features in metagenomic binning. We also showed that binning can save a lot of RAM and time used for assembly. In some cases, we were even able to improve the metagenomics long read (PacBio/ONT) assemblies.

MetaBCC-LR

Other than the vectors discussed here we introduce a new form of a vector called coverage histogram to increase available information. Published in Bioinformatics (from ISMB 2020 conference). More details in the following links;

LRBinner

Further improving MetaBCC-LR, we use variational auto-encoders (motivated by VAMB but significantly different) and a new distance histogram-based clustering to perform binning. This work has been accepted at WABI-2021 and will be available online soon (will update this post).

Tutorial posts on using these tools will be available soon!

Happy reading! Cheers.

--

--