However, it requires considerable time to obtain principal or impartial components as the number of cells increases. characterize novel cell types and detect intra-population heterogeneity Tecarfarin sodium (Potter 2018). The amount of scRNA-seq data in the public domain has increased owing to technological development and the efforts to obtain large-scale transcriptomic profiling of cells (Han et al. 2018). Computational algorithms to process and analyze large-scale high-dimensional single-cell data are essential. To cluster high-dimensional scRNA-seq data, dimension-reduction algorithms such as principal component analysis (PCA) (Joliffe and Morgan 1992) or impartial component analysis (ICA) (Hyv?rinen and Oja 2000) have been successfully applied to process and to visualize high-dimensional scRNA-seq data. However, it requires considerable time to obtain principal or independent components as the number of cells increases. Dimension reduction decreases processing time at the cost of losing original cell-to-cell distances. For instance, t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten 2014) effectively visualizes multidimensional data into a reduced-dimensional space. However, t-SNE distorts the distance between cells for its visualization. Besides, t-SNE requires considerable time for large-scale scRNA-seq data visualization and clustering. Random projection (RP) (Bingham and Mannila 2001) has been suggested as a powerful dimension-reduction method. Based on the JohnsonCLindenstrauss lemma (Johnson and Lindenstrauss 1984), RP reduces the dimension while the distances between the points are approximately preserved (Frankl and Maehara 1988). Theoretically, RP is very fast because it does not require calculation of pairwise cell-to-cell distances or theory components. To effectively handle very large-scale scRNA-seq data without excessive distortion of cell-to-cell distances, we developed SHARP (Supplemental Code), a hyperfast clustering algorithm based on ensemble RP (Methods) (Fig. 1A). RP (Bingham and Mannila 2001) projects the original for scRNA-seq data with cells and genes. Compared with it, a simple hierarchical clustering algorithm requires log( min(triangular part shows the scatter plots of the cell-to-cell distances, whereas the triangular part shows the Pearson’s correlation coefficient (PCC) of the corresponding two Tecarfarin sodium spaces. ((GCG), (INS), acinar (PRSS1), and (SST) cells (Supplemental Fig. S7). Clustering 1.3-million-cell data using SHARP Of note, SHARP provides an opportunity to study the million-cell-level data set. Previous analysis on the scRNA-seq data with 1,306,127 cells from embryonic mouse brains (10x Genomics 2017) was performed using rows corresponds to a gene (or transcript), and each of the columns corresponds to a single cell. The type of input data can be either fragments/reads per kilo base per million mapped reads (FPKM/RPKM), counts per million mapped reads (CPM), transcripts per million (TPM), or unique molecule identifiers (UMI). For consistency, FPKM/RPKM values are converted into TPM values, and UMI values are converted into CPM values. Data partition For a large-scale data set, SHARP performs data partition using a divide-and-conquer strategy. SHARP divides scRNA-seq data into blocks, where each block may contain different numbers of cells (i.e., is the minimum integer Tecarfarin sodium no less than in each block are as follows: If Tecarfarin sodium = 1, = = = 1; If = 2, 3, = log2( (0, 1] as suggested by the JohnsonCLindenstrauss lemma. Ensemble RP After RP, pairwise Pearson correlation coefficients between each pair of single cells were calculated using the dimension-reduced feature matrix. An agglomerative hierarchical clustering (hclust) with the ward.D (Ward 1963) method was used to cluster the correlation-based distance matrix. We first applied RP times to obtain RP-based dimension-reduced feature matrices and then further distance matrices. Each of the K matrices was clustered by a ward.D-based hclust. As a result, different Rabbit polyclonal to TPT1 clustering results were obtained, each from a RP-based distance matrix, that would be combined by a weighted-based metaclustering (wMetaC) algorithm (Ren et al. 2017) detailed in the next step. wMetaC Compared with the traditional cluster-based similarity partitioning algorithm (CSPA) (Strehl and Ghosh 2002) that treats each instance and each cluster equally important, wMetaC assigns different weights to different instances (or instance pairs) and different clusters to improve the clustering performance. wMetaC includes four steps: (1) calculating cell weights, (2) calculating weighted cluster-to-cluster pairwise similarity, (3) clustering on a weighted cluster-based similarity matrix, and (4) determining final results by a voting scheme. Note that wMetaC was applied to each block of single cells. The flowchart of the wMetaC ensemble clustering method is shown in Supplemental Figure S15. Specifically, for calculating cell weights, similar to the first several steps in CSPA, we first converted the individual RP-based clustering results into a colocation similarity matrix, S, whose element represents the similarity between the is the element in the = 1 (i.e., the = 0 (i.e., the reaches the minimum.