单细胞数据高级分析之初步降维和聚类|Dimensionalityreduction|Clustering

作者：纯情利宾立2502857907 | 来源：互联网 | 2023-10-12 08:52

个人的一些碎碎念：聚类，直觉就能想到kmeans聚类，另外还有一个hierarchicalclustering，但是单细胞里面都用得不多，为什么？印象中只有一个scoringmodel是用km

个人的一些碎碎念：

聚类，直觉就能想到kmeans聚类，另外还有一个hierarchical clustering，但是单细胞里面都用得不多，为什么？印象中只有一个scoring model是用kmean进行粗聚类。（10x就是先做PCA，再用kmeans聚类的）

鉴于单细胞的教程很多，也有不下于10种针对单细胞的聚类方法了。

降维往往是和聚类在一起的，所以似乎有点难以区分。

PCA到底是降维、聚类还是可视化的方法，t-SNE呢？

其实稍微思考一下，PCA、t-SNE还有下面的diffusionMap，都是一种降维方法。区别就在于PCA是完全的线性变换得到PC，t-SNE和diffusionMap都是非线性的。

为什么降维？因为我们特征太多了，基因都是万级的，降维之后才能用kmeans啥的。其次就是，降维了才能可视化！我们可视化的最高维度就是三维，几万维是无法可视化的。但paper里，我们最多选前两维，三维在平面上的效果还不如二维。

聚类策略：

聚类还要什么策略？不就是选好特征之后，再选一个k就得到聚类的结果了吗？是的，常规分析确实没有什么高深的东西。

但通常我们不是为了聚类而聚类，我们的结果是为生物学问题而服务的，如果从任何角度都无法解释你的聚类结果，那你还聚什么类，总不可能在paper里就写我们聚类了，得到了一些marker，然后就没了下文把！

什么问题？

什么叫针对问题的聚类呢？下面这篇文章就是针对具体问题的聚类。先知：我们知道我们细胞里有些污染的细胞，如何通过聚类将他们识别出来？

这种具体的问题就没法通过跑常规流程来解决了，得想办法！

Dimensionality reduction.

Throughout the manuscript we use diffusion maps, a non-linear dimensionality reduction technique37. We calculate a cell-to-cell distance matrix using 1 - Pearson correlation and use the diffuse function of the diffusionMap R package with default parameters to obtain the first 50 DMCs. To determine the significant DMCs, we look at the reduction of eigenvalues associated with DMCs. We determine all dimensions with an eigenvalue of at least 4% relative to the sum of the first 50 eigenvalues as significant, and scale all dimensions to have mean 0 and standard deviation of 1.

有点超前（另类），用diffusionMap来降维，计算了细胞-细胞的距离，得到50个DMC，鉴定出显著的DMC，scale一下。

Initial clustering of all cells.

To identify contaminating cell populations and assess overall heterogeneity in the data, we clustered all single cells. We first combined all Drop-seq samples and normalized the data (21,566 cells, 10,791 protein-coding genes detected in at least 3 cells and mean UMI at least 0.005) using regularized negative binomial regression as outlined above (correcting for sequencing depth related factors and cell cycle). We identified 731 highly variable genes; that is, genes for which the z-scored standard deviation was at least 1. We used the variable genes to perform dimensionality reduction using diffusion maps as outlined above (with relative eigenvalue cutoff of 2%), which returned 10 significant dimensions.

For clustering we used a modularity optimization algorithm that finds community structure in the data with Jaccard similarities (neighbourhood size 9, Euclidean distance in diffusion map coordinates) as edge weights between cells38. With the goal of over-clustering the data to identify rare populations, the small neighbourhood size resulted in 15 clusters, of which two were clearly separated from the rest and expressed marker genes expected from contaminating cells (Neurod6 from excitatory neurons, Igfbp7 from epithelial cells). These cells represent rare cellular contaminants in the original sample (2.6% and 1%), and were excluded from further analysis, leaving 20,788 cells.

鉴定出了highly variable genes，还是为了降噪（其实特征选择可以很自由，只是好杂志偏爱这种策略，你要是纯手动选，人家就不乐意了）。

Jaccard相似度，来聚类。

要想鉴定rare populations，就必须over-clustering！！！居然将k设置为15，然后通过marker来筛选出contaminating cells。

确实从中学习了很多，自己手写代码就是不一样，比纯粹的跑软件要强很多。

# cluster cells and remove contaminating populations
cat('Doing initial clustering\n')
cl <- cluster.the.data.simple(cm, expr, 9, seed=42)
md$init.cluster <- cl$clustering
# detection rate per cluster for some marker genes
goi <- c('Igfbp7', 'Col4a1', 'Neurod2', 'Neurod6')
det.rates <- apply(cm[goi, ] > 0, 1, function(x) aggregate(x, by=list(cl$clustering), FUN=mean)$x)
contam.clusters <- sort(unique(cl$clustering))[apply(det.rates > 1/3, 1, any)]
use.cells <- !(cl$clustering %in% contam.clusters)
cat('Of the', ncol(cm), 'cells', sum(!use.cells), 'are determined to be part of a contaminating cell population.\n')
cm <- cm[, use.cells]
expr <- expr[, use.cells]
md <- md[use.cells, ]

# for clustering
# ev.red.th: relative eigenvalue cutoff of 2%
dim.red <- function(expr, max.dim, ev.red.th, plot.title=NA, do.scale.result=FALSE) {
  cat('Dimensionality reduction via diffusion maps using', nrow(expr), 'genes and', ncol(expr), 'cells\n')
  if (sum(is.na(expr)) > 0) {
    dmat <- 1 - cor(expr, use = 'pairwise.complete.obs')
  } else {
    # distance 0 <=> correlation 1
    dmat <- 1 - cor(expr)
  }
  
  max.dim <- min(max.dim, nrow(dmat)/2)
  dmap <- diffuse(dmat, neigen=max.dim, maxdim=max.dim)
  ev <- dmap$eigenvals
  # relative eigenvalue cutoff of 2%, something like PCA
  ev.red <- ev/sum(ev)
  evdim <- rev(which(ev.red > ev.red.th))[1]
  
  if (is.character(plot.title)) {
    # Eigenvalues, We observe a substantial eigenvalue drop-off after the initial components, demonstrating that the majority of the variance is captured in the first few dimensions.
    plot(ev, ylim=c(0, max(ev)), main = plot.title)
    abline(v=evdim + 0.5, col='blue')
  }
  
  evdim <- max(2, evdim, na.rm=TRUE)
  cat('Using', evdim, 'significant DM coordinates\n')
  
  colnames(dmap$X) <- paste0('DMC', 1:ncol(dmap$X))
  res <- dmap$X[, 1:evdim]
  if (do.scale.result) {
    res <- scale(dmap$X[, 1:evdim])
  } 
  return(res)
}

# jaccard similarity
# rows in 'mat' are cells
jacc.sim <- function(mat, k) {
  # generate a sparse nearest neighbor matrix
  nn.indices <- get.knn(mat, k)$nn.index
  j <- as.numeric(t(nn.indices))
  i <- ((1:length(j))-1) %/% k + 1
  nn.mat <- sparseMatrix(i=i, j=j, x=1)
  rm(nn.indices, i, j)
  # turn nn matrix into SNN matrix and then into Jaccard similarity
  snn <- nn.mat %*% t(nn.mat)
  snn.summary <- summary(snn)
  snn <- sparseMatrix(i=snn.summary$i, j=snn.summary$j, x=snn.summary$x/(2*k-snn.summary$x))
  rm(snn.summary)
  return(snn)
}


cluster.the.data.simple <- function(cm, expr, k, sel.g=NA, min.mean=0.001, 
                                    min.cells=3, z.th=1, ev.red.th=0.02, seed=NULL, 
                                    max.dim=50) {
  if (all(is.na(sel.g))) {
    # no genes specified, use most variable genes
    # filter min.cells and min.mean
    # cm only for filtering
    goi <- rownames(expr)[apply(cm[rownames(expr), ]>0, 1, sum) >= min.cells & apply(cm[rownames(expr), ], 1, mean) >= min.mean]
    # gene sum
    sspr <- apply(expr[goi, ]^2, 1, sum)
    # scale the expression of all genes, only select the gene above z.th
    # need to plot the hist
    sel.g <- goi[scale(sqrt(sspr)) > z.th]
  }
  cat(sprintf('Selected %d variable genes\n', length(sel.g)))
  sel.g <- intersect(sel.g, rownames(expr))
  cat(sprintf('%d of these are in expression matrix.\n', length(sel.g)))
  
  if (is.numeric(seed)) {
    set.seed(seed)
  }
  
  dm <- dim.red(expr[sel.g, ], max.dim, ev.red.th, do.scale.result = TRUE)
  
  sim.mat <- jacc.sim(dm, k)
  
  gr <- graph_from_adjacency_matrix(sim.mat, mode='undirected', weighted=TRUE, diag=FALSE)
  cl <- as.numeric(membership(cluster_louvain(gr)))
  
  results <- list()
  results$dm <- dm
  results$clustering <- cl
  results$sel.g <- sel.g
  results$sim.mat <- sim.mat
  results$gr <- gr
  cat('Clustering table\n')
  print(table(results$clustering))
  return(results)
}

推荐阅读

io
如何使用 org.apache.tinkerpop.gremlin.structure.VertexProperty 的 key 方法

本文详细介绍了 `org.apache.tinkerpop.gremlin.structure.VertexProperty` 类中的 `key()` 方法，并提供了多个实际应用的代码示例。通过这些示例，读者可以更好地理解该方法在图数据库操作中的具体用途。 ... [详细]

蜡笔小新 2024-11-21 17:38:10
io
解决JavaScript中法语字符排序问题

在开发一个使用JavaScript、HTML和CSS的Web应用时，遇到从SQLite数据库中提取的法语词汇排序不正确的问题，特别是带重音符号的字母未按预期排序。 ... [详细]

蜡笔小新 2024-11-21 09:08:57
io
Web动态服务器Python基本实现

Web动态服务器Python基本实现 ... [详细]

蜡笔小新 2024-11-21 08:01:30
io
设置Shadowsocks公共代理的关键步骤

本文详细介绍了如何正确设置Shadowsocks公共代理，包括调整超时设置、检查系统限制、防止滥用及遵守DMCA法规等关键步骤。 ... [详细]

蜡笔小新 2024-11-20 20:41:33
io
Vue3中如何提高开发效率

小编给大家分享一下Vue3中如何提高开发效率，相信大部分人都还不怎么了解，因此分享这篇文章给大家参考一下，希望大家阅读完这篇文章后大有收获， ... [详细]

蜡笔小新 2024-11-20 15:33:07
io
Beetl模板引擎初探

Beetl是一款先进的Java模板引擎，以其丰富的功能、直观的语法、卓越的性能和易于维护的特点著称。它不仅适用于高响应需求的大型网站，也适合功能复杂的CMS管理系统，提供了一种全新的模板开发体验。 ... [详细]

蜡笔小新 2024-11-21 16:57:10
io
Ryanair Expands Frankfurt Operations, Challenges Lufthansa's Dominance

Irish budget airline Ryanair announced plans to significantly increase its route network from Frankfurt Airport, marking a direct challenge to Lufthansa, Germany's leading carrier. ... [详细]

蜡笔小新 2024-11-21 13:09:01
metadata
spring boot使用jetty无法启动

spring boot使用jetty无法启动 ... [详细]

蜡笔小新 2024-11-21 10:15:52
settings
Ubuntu 16.04 上 PostgreSQL 的高效安装与配置指南

本文详细介绍了在 Ubuntu 16.04 系统上安装和配置 PostgreSQL 数据库的方法，包括如何设置监听地址、启用密码加密、更改默认用户密码以及调整客户端访问控制。 ... [详细]

蜡笔小新 2024-11-20 22:17:50
io
理解浏览器历史记录（2）hashchange、pushState

阅读目录1.hashchange2.pushState本文也是一篇基础文章。继上文之后，本打算去研究pushState，偶然在一些信息中发现了锚点变 ... [详细]

蜡笔小新 2024-11-20 20:05:37
io
深入解析SpringMVC中的HandlerMapping机制

本文将从基础概念入手，详细探讨SpringMVC框架中DispatcherServlet如何通过HandlerMapping进行请求分发，以及其背后的源码实现细节。 ... [详细]

蜡笔小新 2024-11-20 19:24:42
search
深入解析 Bootstrap Table 的使用技巧

本文详细介绍了如何利用 Bootstrap Table 实现数据展示与操作，包括数据加载、表格配置及前后端交互等关键步骤。 ... [详细]

蜡笔小新 2024-11-20 17:21:26
io
探究64位Linux系统下32位程序的兼容性问题——以OpenVPN为例

本文通过分析一个具体的案例，探讨了64位Linux系统对32位应用程序的兼容性问题。案例涉及OpenVPN客户端在64位系统上的异常行为，通过逐步排查和代码测试，最终定位到了与TUN/TAP设备相关的系统调用兼容性问题。 ... [详细]

蜡笔小新 2024-11-20 16:34:58
io
深入理解Java SE 8新特性：Lambda表达式与函数式编程

本文作为‘Java SE 8新特性概览’系列的一部分，将详细探讨Lambda表达式。通过多种示例，我们将展示Lambda表达式的不同应用场景，并解释编译器如何处理这些表达式。 ... [详细]

蜡笔小新 2024-11-20 14:19:27
io
C#教程：递归构建父子关系的树结构

本教程介绍如何在C#中通过递归方法将具有父子关系的列表转换为树形结构。我们将详细探讨如何处理字符串类型的键值，并提供一个实用的示例。 ... [详细]

蜡笔小新 2024-11-20 11:31:55

纯情利宾立2502857907

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章