Stochastic Neighbor Embedding (SNE) is a family of algorithms that create a low-dimensional representation of high-dimensional data by randomly initializing positions for each data point in low-dimensional space, and then optimizing their positions iteratively to better approximate the relationships between those data points and a sampling of their neighbors in the original high dimensional space. The defining characteristic of SNE algorithms is that the appropriate place for each data point in the low dimensional space is represented as a probability distribution for each of its neighbors.
The family of SNE algorithms can be confusing; some are variations on the probability functions used to characterize the relationships between neighbors in either the low-dimensional space or the high-dimensional space. Others involve entirely different approaches, such as alternative algorithms for gradient descent, or approximate algorithms for finding neighbors. We’ll walk you through the original algorithm, its major variants, and some recently published innovations on how to minimize the runtime while maximizing the quality of the embedding.
SNE was the brainchild of Geoffrey Hinton and Sam Roweis in 2002 [1], who cited such well-known techniques as MultiDimensional Scaling (MDS) and Principal Components Analysis (PCA), along with Roweis’ earlier work on Locally Linear Embedding. Their algorithm models each data point’s neighbors using a Gaussian probability distribution whose entropy is adjusted until it’s approximately the log of the perplexity. In other words, for perplexity k, approximately k neighbors are considered, and they are weighted by distance according to a gaussian distribution. The same distribution is used in low dimensional space, and they’re compared using Kullback-Leibler Divergence (KLD), a measure of relative entropy between two probability distributions. The SNE algorithm performs a gradient descent over KLD.
The original SNE algorithm produced embeddings of ground-breaking quality. However, it suffered from two major setbacks that prevented it from achieving wide adoption. The first setback was the computational cost: each iteration of gradient descent required the comparison of every pair of data points, and so had quadratic complexity. This severely limited the size of dataset for which SNE could be computed. The second setback was the crowding problem: although data points in the low-dimensional space would end up close to their neighbors, there were often only very small gaps between dissimilar “clumps” or clusters of points.
In 2008, Laurens van der Maaten and Geoffrey Hinton published a significant improvement upon the SNE algorithm, known as t-SNE [2]. This is the t-SNE that we all (think we) know and (think we) love. This variant of SNE incorporates two major modifications to the original: (1) it simplifies the representation of neighbors into a single probability distribution by making the neighbor matrix symmetrical, and (2) it uses a Student t-distribution instead of a Gaussian distribution to characterize the low-dimensional embedding. The Student t-distribution’s long tails allow more distant neighbors greater freedom in where they are placed, which helps alleviate the crowding problem suffered by the original SNE algorithm. The “clumps” in SNE became “islands” in t-SNE.
The SNE Algorithm
Step 1: Find nearest neighbors in original data.
Step 2: Represent neighbors as a probability distribution.
Step 3: Initialize low-dimensional embedding.
Step 4a: Optimization – “early exaggeration”: optimize KLD between low-dimensional embedding and exaggerated original probability distributions.
Step 4b: Optimization: optimize KLD between low-dimensional embedding and plain original probability distributions.
You, as a practitioner of cytometry, may think that you have seen or even run many t-SNE embeddings in recent years. As it turns out, the de facto standard implementation of t-SNE in the last few years was not the one described in van der Maaten’s 2008 work, but rather “Barnes-Hut t-SNE”, which makes some approximations in order to compute a t-SNE-like embedding much faster.
Laurens van der Maaten published a follow-up manuscript in 2014 entitled “Accelerating t-SNE using tree-based algorithms” [3], in which he noted that “the gradient of the Kullback-Leibler divergence that is minimized has a very natural interpretation as an N-body system in which all of the N points in the low-dimensional embedding exert forces on each other, and the resultant force on each of the points needs to be computed.” This breakthrough observation led van der Maaten to adopt the Barnes-Hut approximation, a tree-based algorithm that was originally developed in the field of computational astrophysics for estimating the effect that each data point has on each other [4]. The Barnes-Hut approximation makes the simplifying assumption that the effects of points that are far away can be estimated as a group, rather than individually, since their effects are small anyway. This algorithm has an O(n log n) runtime per iteration, and thus offers a huge speedup relative to the original quadratic t-SNE algorithm, despite producing nearly identical embeddings.
Figure 1, linear and log-log plot comparisons of the theoretical run time complexities of t-SNE (quadratic), bht-SNE (n log n), and FIt-SNE (linear).
The t-SNE algorithm appeared in cytometry almost immediately, with the so-called “viSNE implementation” of t-SNE (2008) [5]. This implementation addressed the scalability issues of t-SNE by subsampling the data, and performing the computation in a distributed manner, on a cluster of machines. The authors demonstrated with a handful of examples that t-SNE can produce informative embeddings of cytometry datasets.
Following its introduction in 2014, Barnes-Hut t-SNE experienced rapid adoption in the field of cytometry, and its popularity has only grown as high-parameter flow and mass cytometry have reshaped the field, and new modalities like single-cell RNA-Seq have gained popularity. The Barnes-Hut approximation has become so ubiquitous that the term “t-SNE” became essentially synonymous with “Barnes-Hut t-SNE”. Around this time, we at FlowJo released our first implementation, a Barnes-Hut t-SNE plugin (FlowJo v10.1r7, 2015). In recent years, t-SNE has become so popular that speakers at the annual CYTO conference often refer to their “obligatory t-SNE plot”.
Some other variants of t-SNE have also been proposed and made available, either in commercial software or open source. One such variant is Cen-se’, which modifies Barnes-Hut t-SNE by using a Cauchy distribution to represent neighbors in the high-dimensional space. Another, HSNE, performs t-SNE hierarchically and interactively on-demand, allowing users to navigate high-level, low-resolution representations of large datasets, then focus in on certain areas for more detail [6].
The cytometry community’s uptake of t-SNE (and friends) was so rapid that, even in the midst of heavy usage, fundamental questions remained unanswered about the t-SNE algorithm and the embeddings it produces. Several researchers began independently exploring the nature of t-SNE, how best to apply it, and what to do with its results.
In 2017, George C. Linderman and his colleagues posted a pre-publication manuscript online that explored the relationship between t-SNE and clustering. They established under certain parameter settings, the early exaggeration phase of t-SNE provably produces a spectral clustering of the data. It is unclear if or how this finding pertains to typical uses of t-SNE in cytometry (whose implementations of t-SNE differ considerably from the one used by the authors), nor to what extent it indicates the practical quality of a t-SNE embedding. Although their work provides a sort of mathematical justification for gating on the “islands” of a t-SNE plot, this procedure is still contentious today for the reasons mentioned here, and it continues to be studied [8]. Linderman’s work was later published in the SIAM Journal on Mathematics of Data Science [7].
In 2018, Dmitry Kobak and Phillip Berens uploaded a pre-print manuscript exploring how to incorporate global dataset structure into a t-SNE embedding (along with some other innovations) [9]. They also include a protocol for the application of t-SNE to single-cell RNA-Seq datasets.
That same year, Anna Belkina and her collaborators (including yours truly) deposited the first version of a manuscript on optimizing the parameters of the t-SNE algorithm [10]. They performed empirical analysis of a range of cytometry datasets and determined that (1) the choice of learning rate is critical to map quality for all datasets, (2) typical default parameters that were inherited by most cytometry t-SNE implementations from original 2014 van der Maaten publication are poorly suited for large datasets that are common in cytometry applications, and (3) the duration of the early exaggeration phase of t-SNE has a profound effect on embedding quality, and should be controlled by monitoring the rate of change of Kullback-Leibler divergence. They proposed a cytometry-friendly, optimized version of the t-SNE algorithm called opt-SNE, which implements their recommendations. Notably, opt-SNE is able to produce high-quality embeddings for large datasets (>107 data points) using five to ten times fewer iterations of gradient descent, which makes embedding such datasets feasible and worthwhile for the first time. Opt-SNE is available in FlowJo v10.6 and SeqGeq v1.6, along with an open source implementation described in the paper. This work was published in November 2019 in Nature Communications [11].
Although the Barnes-Hut approximation was a huge leap forward for t-SNE in terms of speed, dataset sizes in cytometry have outpaced the growth of computing power. An O(n log n) algorithm was not sufficient. So, in 2017, mere months after posting his manuscript evaluating t-SNE as a means of clustering, George C. Linderman et al. posted a second manuscript describing how to take advantage of curious mathematical similarities in order to calculate the t-SNE gradient using interpolation and a Fast Fourier Transform (FFT) in O(n) time [12]. Additionally, they made use of a modern approximate nearest neighbor algorithm to accelerate the first step of the calculation as well. These advances, collectively termed “FIt-SNE,” produce a further four to five times speedup to the t-SNE calculation. Along with opt-SNE parameterization and some engineering improvements to the underlying code, the result is a t-SNE implementation that can embed datasets with tens of millions of events in about a day, using a modestly spec’d modern workstation running FlowJo v10.6.
Figure 2, a t-SNE embedding of 40 million cells, produced by FlowJo 10.6.
A parallel thread of research has been to leverage a GPU during the calculation of t-SNE. Several authors have had success calculating t-SNE using vector fields and GPU computation, with impressive results (t-SNE-CUDA [13], TensorFlow.js t-SNE [14]).
There’s still more work to be done. Large scale studies may collect orders of magnitude more data than current approaches support.
There are many dimensionality reduction algorithms aside from t-SNE, each with its own philosophy, resource needs, and characteristics. UMAP, a recent alternative, is derived from concepts in the mathematics of topology and set theory [15]. It is purported to partially capture both local and global structure, and in theory the algorithm should scale well. Another contender is LargeVis, whose authors claim it produces high-quality embeddings with lower computational cost than t-SNE, using a different objective function and a sampling technique to handle non-neighbor interactions [16]. An even newer option, TriMap, has only just been described in a pre-print manuscript [17]. It uses the relative distances between a sampling of non-neighbor triplets of data points to calculate a gradient towards better representation of global structure. Other methods abound (e.g., TSEE [18], EmbedSOM [19], etc.), but to date none has quite supplanted t-SNE as the method of choice for cytometric data.
It’s clear that t-SNE is a powerful tool for uncovering the structure of cytometric data. Still, it won’t remain the darling of cytometry forever. We’ll be keeping our eyes peeled for new trends, arming you with best tools available, and perhaps surprise you with a few of our own breakthroughs in the near future.
SNE: Stochastic Neighbor Embedding, the original algorithm. 2002
t-SNE: t-distributed Stochastic Neighbor Embedding, a more practical and effective version. 2008
bht-SNE: Barnes-Hut t-SNE, a huge speed improvement that made the technique feasible for use in cytometry. 2014
vi-SNE: A proprietary implementation of t-SNE. 2013
Cen-se’: A proprietary modification of t-SNE. 2019
H-SNE: A hierarchical, interactive, t-SNE-like algorithm. 2016
opt-SNE: Guidelines and minor modifications for obtaining better embeddings, especially for large datasets. 2019
FIt-SNE: FFT-Interpolated tSNE, another huge speed improvement that made the technique feasible for large cytometry datasets. 2018
tSNE-CUDA: A GPU-accelerated implementation of bht-SNE. 2018
GPGPU-tSNE: A GPU-accelerated reformulation of t-SNE. 2019
[1] Hinton, G. E., & Roweis, S. T. (2003). Stochastic neighbor embedding. In Advances in neural information processing systems (pp. 857-864).
[2] Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.
[3] Van Der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. The Journal of Machine Learning Research, 15(1), 3221-3245.
[4] Barnes, J., & Hut, P. (1986). A hierarchical O (N log N) force-calculation algorithm. nature, 324(6096), 446.
[5] Amir, E. A. D., Davis, K. L., Tadmor, M. D., Simonds, E. F., Levine, J. H., Bendall, S. C., ... & Pe'er, D. (2013). viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology, 31(6), 545.
[6] Pezzotti, N., Höllt, T., Lelieveldt, B., Eisemann, E., & Vilanova, A. (2016, June). Hierarchical stochastic neighbor embedding. In Computer Graphics Forum (Vol. 35, No. 3, pp. 21-30).
[7] Linderman, G. C., & Steinerberger, S. (2019). Clustering with t-SNE, provably. SIAM Journal on Mathematics of Data Science, 1(2), 313-332.
[8] Eshghi, S. T., Au-Yeung, A., Takahashi, C., Bolen, C. R., Nyachienga, M. N., Lear, S. P., ... & O'Gorman, W. E. (2019). Quantitative comparison of conventional and t-SNE-guided gating analyses. Frontiers in immunology, 10.
[9] Kobak, D., & Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10(1), 1-14.
[10] Belkina, A. C., Ciccolella, C. O., Anno, R., Halpert, R., Spidlen, J., & Snyder-Cappione, J. E. (2019). Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and allow analysis of large datasets. bioRxiv, 451690.
[11] Belkina, A. C., Ciccolella, C. O., Anno, R., Halpert, R., Spidlen, J., & Snyder-Cappione, J. E. (2019). Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature Communications, 10(1), 1-12.
[12] Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S., & Kluger, Y. (2019). Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature methods, 16(3), 243.
[13] Chan, D. M., Rao, R., Huang, F., & Canny, J. F. (2018, September). t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data. In 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) (pp. 330-338). IEEE.
[14] Pezzotti, N., Thijssen, J., Mordvintsev, A., Höllt, T., Van Lew, B., Lelieveldt, B. P., ... & Vilanova, A. (2019). GPGPU linear complexity t-SNE optimization. IEEE transactions on visualization and computer graphics.
[15] McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
[16] Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April). Visualizing large-scale and high-dimensional data. In Proceedings of the 25th international conference on world wide web (pp. 287-297). International World Wide Web Conferences Steering Committee.
[17] Amid, E., & Warmuth, M. K. (2018). A more globally accurate dimensionality reduction method using triplets. arXiv preprint arXiv:1803.00854.
[18] An, S., Ma, L., & Wan, L. (2019). TSEE: an elastic embedding method to visualize the dynamic gene expression patterns of time series single-cell RNA sequencing data. BMC genomics, 20(2), 224.
[19] Kratochvil, M., Koladiya, A., Balounova, J., Novosadova, V., Sedlacek, R., Fišer, K., ... & Drbal, K. (2019). SOM-based embedding improves efficiency of high-dimensional cytometry data analysis. bioRxiv, 496869.