library(data.table)
library(ggplot2)
library(ggrepel)
options(Seurat.object.assay.version = "v3")
#### Make a combined Seurat Object
<- readRDS("sem_v20.tiss_a.rds") ## tonsil dataset A with CosMx v2.0
sem_v20 <- readRDS("sem_v15.tiss_b.rds") ## tonsil dataset B with CosMx v1.5 sem_v15
1 Abstract
This post discusses batch effects in CosMx; how they impact analyses, how to detect them, and how to correct them. We recommend Harmony (Korsunsky et al. 2019) as a useful method for correcting batch effects in dimension reduction embeddings such as PCA, and demonstrate an example of how to use the software package.
For those in a hurry to ‘just see the code’, feel free to jump ahead to the Harmony Analysis section of this post.
2 Introduction
When analyzing multiple samples or ‘batches’ of data collected under different technologies or sample conditions, we may need to harmonize the data in a way that preserves biological variation while minimizing technical variation.
For example, suppose I were interested in analyzing multiple breast cancer tissue samples, but these samples were processed in different labs, by different scientists, or with different sample prep protocols. These technical differences could subtly effect gene detection efficiency in ways that may influence or confound an analysis. In particular, PCA, which responds quickly to small systematic changes impacting large numbers of variables, is highly sensitive to batch effects. This sensitivity then cascades down to analyses depending on PCA, e.g. UMAP and the Leiden and Louvain algorithms.
To check for presence of batch effects and whether correction is required, we recommend plotting the ‘batch’ on a UMAP plot or in principal component (PC) space (link to example figure). In the presence of batch effects, we may observe that cells cluster together by batch more distinctly than they do by their biological cell type.
For dimension reduction and clustering analyses, there is a multitude of existing research on batch correction methodology (Luecken and Theis 2019). The Harmony method (Korsunsky et al. 2019) has been recommended as a best-in-class approach in several studies which benchmarked different batch correction methodology for scRNAseq (Tran et al. 2020) (Emmanúel Antonsson and Melsted 2024) . Harmony takes as input a low-dimensional cell embedding such as PCA, and projects cells onto a new unsupervised joint embedding that groups cells by biological cell type rather than dataset-specific conditions. This Harmony embedding can be used as an input to downstream dimension reduction and clustering algorithms like UMAP(Ghojogh et al. 2021) and Leiden(Traag, Waltman, and Van Eck 2019).
For some other downstream analyses, batch effects can be explicitly handled or may be less of a concern. In the case of differential expression (DE) analysis, we could account for the technical effects by specifying ‘batch’ as a covariate in our DE models. Supervised or semi-supervised cell typing analyses that are independent of PCA dimension reductions (i.e., InSituType (Danaher et al. 2022)), may be less affected by batch effects. If we were to observe batch effects in this setting, one useful approach would be to perform the cell typing separately by batch before aggregating the cell type annotations.
In this post, we demonstrate using Harmony for batch effect correction on two 6k-plex tonsil tissue datasets; one dataset generated with CosMx 1.5, while the other generated with CosMx 2.0. The recent CosMx 2.0 release offers noticeable increases in sensitivity; often obtaining 1.5-2x the transcripts compared to the same sample processed with previous CosMx 1.5 software. In case we want to analyze a v1.5 and v2.0 dataset in the same analysis, we might consider treating this variable as a batch effect.
3 Initial Look at the data
3.1 Plot the two batches
Let’s take a quick peek at these two tissues, where unsupervised clustering has been performed independently on each:
Code
<- function (cluster_column, x_column = "x_slide_mm", y_column = "y_slide_mm",
xyplot cls = NULL, clusters = NULL, metadata, ptsize = 0.25, plotfirst = NULL,
alphasize = 1){
<- data.table::copy(data.table::data.table(metadata))
pd if (is.null(clusters))
<- unique(pd[[cluster_column]])
clusters if (!is.null(plotfirst)) {
<- intersect(clusters, plotfirst)
plotfirst <- setdiff(clusters, plotfirst)
notplotfirst <- ggplot(pd[pd[[cluster_column]] %in% plotfirst],
p aes(.data[[x_column]], .data[[y_column]], color = .data[[cluster_column]])) +
geom_point(size = ptsize)
<- p + geom_point(data = pd[pd[[cluster_column]] %in%
p aes(.data[[x_column]], .data[[y_column]],
notplotfirst], color = .data[[cluster_column]]), size = ptsize,
alpha = alphasize) + theme_bw()
}else {
<- ggplot(pd[pd[[cluster_column]] %in% clusters], aes(.data[[x_column]],
p color = .data[[cluster_column]])) +
.data[[y_column]], geom_point(size = ptsize, alpha = alphasize) + theme_bw()
}if (is.null(cls)) {
<- rep(unname(pals::alphabet()), 100)
cls if (any(is.na(suppressWarnings(as.numeric(as.character(clusters)))))) {
<- sort(clusters)
clnames
}else {
<- sort(as.numeric(as.character(clusters)))
clnames
}<- cls[1:length(clusters)]
cls names(cls) <- clnames
<- p + scale_color_manual(values = cls, guide = guide_legend(override.aes = list(size = 4,
p alpha = 1)))
}else {
<- p + scale_color_manual(values = cls, guide = guide_legend(override.aes = list(size = 4,
p alpha = 1)))
}return(p)
}
<-
xyp ::plot_grid(
cowplotxyplot("seurat_clusters",metadata = sem_v20@meta.data,ptsize = 0.02) + coord_fixed() + labs(title = " v2.0")
xyplot("seurat_clusters",metadata = sem_v15@meta.data,ptsize = 0.02) + coord_fixed() + labs(title = "v1.5")
,nrow=1
,+ theme(plot.background = element_rect(fill = 'white',color='white')) +
) ::panel_border(remove=TRUE) cowplot
3.2 Differences in sensitivity between batches
Here, we can compute a few high-level summary stats like transcripts per cell and signal to noise ratio (SNR) for each batch.
We see ~2x increase in transcripts per cell with the v2.0 dataset, and higher signal to noise ratio (snr).
## Comparing counts/cell
.5 <- mean(Matrix::colSums(sem_v15[["RNA"]]$counts))
cts_cell_1.0 <- mean(Matrix::colSums(sem_v20[["RNA"]]$counts))
cts_cell_2
## Comparing SNR
.5 <- mean(Matrix::colMeans(sem_v15[["RNA"]]$counts)) / mean(Matrix::colMeans(sem_v15[["negprobes"]]$counts))
snr_1.0 <- mean(Matrix::colMeans(sem_v20[["RNA"]]$counts)) / mean(Matrix::colMeans(sem_v20[["negprobes"]]$counts))
snr_2
<-
summary_stats data.table(transcripts_per_cell = c(cts_cell_1.5, cts_cell_2.0)
snr = c(snr_1.5, snr_2.0)
,version = c("v1.5", "v2.0"))[] ,
print(summary_stats[])
transcripts_per_cell snr version
<num> <num> <char>
1: 448.3808 3.770205 v1.5
2: 1026.5255 6.043639 v2.0
4 Combining the data
Here we’ll combine the two datasets and prepare to harmonize them.
### Get overlapping genes (all the genes in this case) to combine the two datasets
<- intersect(rownames(sem_v20), rownames(sem_v15))
overlapping_genes <- data.table::rbindlist(list(
combined_metadata @meta.data
sem_v20@meta.data
,sem_v15
)use.names=TRUE,fill = TRUE)
,%in% colnames(sem_v15),batchid:="v1.5"]
combined_metadata[cell_ID %in% colnames(sem_v20),batchid:="v2.0"]
combined_metadata[cell_ID
### Create a combined seurat object
<-
sem_comb ::CreateSeuratObject(counts = cbind(sem_v20[["RNA"]]@counts[overlapping_genes,]
Seurat"RNA"]]@counts[overlapping_genes,])
,sem_v15[[meta.data = data.frame(combined_metadata
,row.names = combined_metadata$cell_ID)
,
)
### offset the x-coordinate of v1.5 to
### more easily plot both tissues side-by-side
<- max(sem_comb@meta.data$x_slide_mm[sem_comb@meta.data$batchid=="v2.0"]) + 1
x_offset @meta.data$x_slide_mm[sem_comb@meta.data$batchid=="v1.5"] <- sem_comb@meta.data$x_slide_mm[sem_comb@meta.data$batchid=="v1.5"] + x_offset sem_comb
5 Harmony Analysis
With our combined dataset containing all batches, we’ll perform a few standard tertiary analysis steps using the Seurat
package. After principal component analysis, we will pass the PC embeddings to Harmony for batch correction.
### Normalize, Scale, and Run PCA on the combined dataset
#### (standard data analysis steps so far)
<- Seurat::FindVariableFeatures(sem_comb)
sem_comb <- Seurat::LogNormalize(sem_comb[["RNA"]]@counts
normed scale.factor = mean(Matrix::colSums(sem_comb[["RNA"]]@counts))
,
)<- Seurat::SetAssayData(sem_comb, layer="data", new.data = normed)
sem_comb <- Seurat::ScaleData(sem_comb)
sem_comb <- Seurat::RunPCA(sem_comb) sem_comb
Now we run harmony with our PCA embeddings, specifying "batchid"
as our variable to adjust for (our indicator for v1.5 and v2.0).
<- harmony::RunHarmony(sem_comb[["pca"]]@cell.embeddings
harmony_embeddings meta_data = sem_comb@meta.data
,vars_use = "batchid"
, )
The harmony_embeddings
returned are a cells
\(\times\) # of pc's
dimension matrix, with the same structure as the PCA embeddings.
str(harmony_embeddings)
1:10,1:5] harmony_embeddings[
num [1:405690, 1:50] -0.6465 2.4765 -0.8616 -0.0705 1.3351 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:405690] "c_1_204_1008" "c_1_204_1026" "c_1_204_1032" "c_1_204_1037" ...
..$ : chr [1:50] "PC_1" "PC_2" "PC_3" "PC_4" ...
PC_1 PC_2 PC_3 PC_4 PC_5
c_1_204_1008 -0.64646187 -0.2827039 -2.8347802 1.6610461 2.2474144
c_1_204_1026 2.47653185 0.5783416 -0.4183124 0.9130872 -1.6640251
c_1_204_1032 -0.86158008 -3.1340938 -7.1958762 -3.2585168 1.8184731
c_1_204_1037 -0.07046239 0.8755237 -1.8787555 0.8000141 2.5718387
c_1_204_1046 1.33506948 1.1483278 -1.8171200 1.0649917 3.6876637
c_1_204_1051 2.10865793 1.8783302 -0.9869917 1.3061003 5.2610306
c_1_204_1068 -0.12255521 -1.0006297 -3.8287685 0.2517314 2.7422425
c_1_204_1099 -0.09401621 -0.3220125 0.3505044 -0.5086164 4.4133441
c_1_204_1100 1.21582681 1.8189899 0.1252721 -0.4560591 4.1331302
c_1_204_1110 2.92880860 1.4376377 -1.6135393 2.6411662 -0.4019986
Here are what the original PCA embeddings look like for comparison:
str(sem_comb[["pca"]]@cell.embeddings)
"pca"]]@cell.embeddings[1:10,1:5] sem_comb[[
num [1:405690, 1:50] -1.117 1.994 -1.456 -0.476 0.869 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:405690] "c_1_204_1008" "c_1_204_1026" "c_1_204_1032" "c_1_204_1037" ...
..$ : chr [1:50] "PC_1" "PC_2" "PC_3" "PC_4" ...
PC_1 PC_2 PC_3 PC_4 PC_5
c_1_204_1008 -1.1173966 -0.6867905 -2.8949895 1.6105015 2.9180760
c_1_204_1026 1.9937914 0.2274147 -0.3855459 0.8213885 -1.0934278
c_1_204_1032 -1.4558828 -3.6307548 -7.3110656 -3.5378910 2.5581377
c_1_204_1037 -0.4759800 0.5822518 -1.8031810 0.6380289 3.0891367
c_1_204_1046 0.8686478 0.7673256 -1.8542354 0.9790450 4.3717053
c_1_204_1051 1.7701286 1.6232100 -0.9709824 1.1699966 6.0201625
c_1_204_1068 -0.6146554 -1.4230712 -3.9015371 0.1792131 3.4066027
c_1_204_1099 -0.5735693 -0.6600411 0.3747940 -0.6467651 4.9810832
c_1_204_1100 0.8734515 1.5610131 0.1432252 -0.5936082 4.8778765
c_1_204_1110 2.4355067 1.0686404 -1.6010894 2.5582519 0.1892688
6 UMAP and clustering with harmony
Let’s continue the analysis. We’ll use the Harmony embeddings rather than PCA embeddings to generate a UMAP and perform unsupervised clustering on the combined dataset.
Code
## (minor side-note), this helper function `make_umap` calls the same package
## and function as Seurat::RunUMAP, but also returns the
## 'nearest-neighbor graph' used in the UMAP algorithm.
## Using the same nearest neighbor graph for clustering and UMAP,
## can give better concordance between UMAP and unsupervised clusters.
<- function(pcaobj, min_dist=0.01, n_neighbors=30, metric="cosine",key ="UMAP_" ){
make_umap <-
ump ::umap(pcaobj$reduction.data@cell.embeddings
uwotn_neighbors = n_neighbors
,nn_method = "annoy"
,metric = metric
,min_dist = min_dist
,ret_extra = c("fgraph","nn")
,verbose = TRUE)
,
<- ump$fgraph
umpgraph dimnames(umpgraph) <- list(rownames(ump$nn[[1]]$idx), rownames(ump$nn[[1]]$idx))
colnames(ump$embedding) <- paste0(key, c(1,2))
<- Seurat::CreateDimReducObject(embeddings = ump$embedding, key = key)
ump return(list(grph = umpgraph
ump = ump))
, }
##### Run helper function to make a umap and return the
##### nearest-neighbors graph used for unsupervised clustering.
<- make_umap(pcaobj = list(
umapharmony reduction.data = Seurat::CreateDimReducObject(embeddings = harmony_embeddings)
))
"umapharmony"]] <- umapharmony$ump
sem_comb[["umapharmonygrph"]] <- Seurat::as.Graph(umapharmony$grph)
sem_comb[[<- Seurat::FindClusters(sem_comb, graph.name = "umapharmonygrph")
sem_comb @meta.data$clusters_unsup_harmony <- sem_comb@meta.data$seurat_clusters sem_comb
Here’s another look at the tissues, with clustering results generated downstream of the Harmony embeddings.
Code
<-
xyp_harmony xyplot("clusters_unsup_harmony", metadata = data.table(sem_comb@meta.data),ptsize = 0.02) +
coord_fixed() + labs(title = "After integration: v2.0 (left) and v1.5 (right)")
print(xyp_harmony)
7 UMAP and clustering without harmony
Here we can re-make the UMAP and cluster without batch correction for sake of comparison.
##### Run helper function to make a umap and return the nearest-neighbors graph.
<- sem_comb[["pca"]]
pcaobj class(pcaobj)
<- make_umap(list(reduction.data = pcaobj)) umapobj
"umap"]] <- umapobj$ump
sem_comb[["umapgrph"]] <- Seurat::as.Graph(umapobj[["grph"]])
sem_comb[[
### Unsupervised clustering
<- Seurat::FindClusters(sem_comb, graph.name = "umapgrph")
sem_comb @meta.data$clusters_unsup_unintegrated <- sem_comb@meta.data$seurat_clusters sem_comb
8 Comparison with and without integration
Plotting the first two PC’s, colored by batchid
, it looks like the v1.5 data may span a more similar domain of PC1 and PC2 space after Harmony, indicating better integration.
Code
set.seed(123)
<- sample(1:nrow(harmony_embeddings)) ## shuffle cell order for plotting
shuffle_order <- c("#1f77b4", "#ff7f0e")
cls names(cls) <- c("v2.0", "v1.5")
all.equal(rownames(harmony_embeddings), sem_comb@meta.data$cell_ID)
<-
pc1vs2_harmony cbind(data.table(sem_comb@meta.data)[,.(cell_ID, batchid, x_slide_mm, y_slide_mm)]
data.table(harmony_embeddings))
,
<-
harmony_plt ggplot(pc1vs2_harmony[shuffle_order], aes(x = PC_1, y = PC_2, color=batchid)) + geom_point(size=0.2, alpha=0.5, shape=16) +
scale_color_manual(values = cls) +
theme_bw() +
coord_fixed() +
guides(color = guide_legend(override.aes=list(size=4, alpha=1)))
<-
pc1vs2_regular cbind(data.table(sem_comb@meta.data)[,.(cell_ID, batchid, x_slide_mm, y_slide_mm)]
data.table(sem_comb@reductions$pca@cell.embeddings))
,
<-
regular_pca_plt ggplot(pc1vs2_regular[shuffle_order], aes(x = PC_1, y = PC_2, color=batchid)) + geom_point(size=0.2, alpha=0.5, shape=16) +
scale_color_manual(values = cls) +
theme_bw() +
coord_fixed() +
guides(color = guide_legend(override.aes=list(size=4, alpha=1)))
<-
pc_comparison_plt ::plot_grid(harmony_plt + labs(title = "integrated using harmony")
cowplot+ labs(title = "un-integrated")) +
,regular_pca_plt theme(plot.background = element_rect(fill = 'white',color='white')) +
::panel_border(remove=TRUE)
cowplot
print(pc_comparison_plt)
Plotting the UMAP, colored by batchid
, there doesn’t appear to be an enourmous difference (which is not a bad thing), but it does look like there may be areas of the UMAP better covered by both datasets (less batch-specific) after Harmony integration.
Code
<-
umap_harmony cbind(data.table(sem_comb@meta.data)[,.(cell_ID, batchid, x_slide_mm, y_slide_mm)]
data.table(sem_comb@reductions$umapharmony@cell.embeddings))
,
<-
harmony_plt ggplot(umap_harmony[shuffle_order], aes(x = UMAP_1, y = UMAP_2, color=batchid)) + geom_point(size=0.2, alpha=0.5, shape=16) +
scale_color_manual(values = cls) +
theme_bw() +
coord_fixed() +
guides(color = guide_legend(override.aes=list(size=4, alpha=1)))
<-
umap_regular cbind(data.table(sem_comb@meta.data)[,.(cell_ID, batchid, x_slide_mm, y_slide_mm)]
data.table(sem_comb@reductions$umap@cell.embeddings))
,
<-
regular_umap_plt ggplot(umap_regular[shuffle_order], aes(x = umap_1, y = umap_2, color=batchid)) + geom_point(size=0.2, alpha=0.5, shape=16) +
scale_color_manual(values = cls) +
theme_bw() +
coord_fixed() +
guides(color = guide_legend(override.aes=list(size=4, alpha=1)))
<-
umap_comparison_plt ::plot_grid(harmony_plt + labs(title = "integrated using harmony")
cowplot+ labs(title = "un-integrated")) +
,regular_umap_plt theme(plot.background = element_rect(fill = 'white',color='white')) +
::panel_border(remove=TRUE)
cowplot
print(umap_comparison_plt)
Code
<-
umap_harmony_clusters cbind(data.table(sem_comb@meta.data)[,.(cell_ID, clusters_unsup_harmony, x_slide_mm, y_slide_mm)]
data.table(sem_comb@reductions$umapharmony@cell.embeddings))
,
<-
harmony_plt_clusters ggplot(umap_harmony_clusters, aes(x = UMAP_1, y = UMAP_2, color=clusters_unsup_harmony)) + geom_point(size=0.2, alpha=0.5, shape=16) +
scale_color_manual(values=unname(pals::alphabet())) +
theme_bw() +
coord_fixed() +
guides(color = guide_legend(override.aes=list(size=4, alpha=1)))
<-
umap_regular_clusters cbind(data.table(sem_comb@meta.data)[,.(cell_ID, clusters_unsup_unintegrated, x_slide_mm, y_slide_mm)]
data.table(sem_comb@reductions$umap@cell.embeddings))
,
<-
regular_umap_plt_clusters ggplot(umap_regular_clusters, aes(x = umap_1, y = umap_2, color=clusters_unsup_unintegrated)) + geom_point(size=0.2, alpha=0.5, shape=16) +
scale_color_manual(values=unname(pals::alphabet())) +
theme_bw() +
coord_fixed() +
guides(color = guide_legend(override.aes=list(size=4, alpha=1)))
<-
umap_comparison_plt_clusters ::plot_grid(harmony_plt_clusters + labs(title = "integrated using harmony")
cowplot+ labs(title = "un-integrated")) +
,regular_umap_plt_clusters theme(plot.background = element_rect(fill = 'white',color='white')) +
::panel_border(remove=TRUE)
cowplot
print(umap_comparison_plt_clusters)
9 Conclusion
In this post we demonstrated applying the batch effect correction technique ‘Harmony’ to two 6K-plex tonsil tissue datasets generated with very different sensitivity performance. In general, the impact of batch effects may depend on the severity of the technical artifact, as well as the scope and depth of the analysis. In the event that one wishes to analyze multiple samples together which were generated under different technical conditions, we recommend plotting the technical batch variable in UMAP or PC space as a qualitative diagnostic, and inspecting whether cells are clustered by ‘batch’. For this example, the batch effect between our two datasets was visually noticeable when plotted on a UMAP and the first two principal components. Cells were visually less segregated by batch after the correction was applied.