UMAP

seqlearner.MultiTaskLearner.visualize(method="UMAP", family=None, proportion=1.5)

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data 1. The data is uniformly distributed on a Riemannian manifold; 2. The Riemannian metric is locally constant (or can be approximated as such); 3. The manifold is locally connected.

The details about the underlying mathematics of UMAP method can be found in the following paper: - McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

We have used the sklearn wrapper function which implements UMAP and applied it on the embedding results. The visualize method has the following arguments:

Arguments

  • method: String, Possible values are TSNE and UMAP
  • family: String, Name of protein family to be visualized
  • proportion: Positive float, population proportion of number of other classes by number of

Apply UMAP visualization on CRISP Protein family

from seqlearner import MultiTaskLearner as mtl
import pandas as pd
from seqlearner import Freq2Vec
sequences = pd.read_csv("./protein_sequences.csv", header=None)
freq2vec = Freq2Vec(sequences, word_length=3, window_size=5, emb_dim=25, loss="mean_squared_error", epochs=250)
freq2vec.freq2vec_maker()
freq2vec_embedding = mtl.embed(word_length=3, embedding="freq2vec", func="sum", emb_dim=25, gamma=0.1, epochs=100)
mtl.visualize(method="UMAP", family="CRISP_family", proportion=2.0)

The visualization plot is in the following:

See Also