Cell typing is a crucial stage of analysis requiring careful attention. To ensure the integrity of your analyses, you must spend time reviewing, and possibly refining, your cell typing results.
Here we’ll demonstrate a basic cell typing workflow using the Insitutype algorithm. We’ll show the core QC plots and walk through examples of how you can refine your cell typing results.
For more details on cell typing, see these resources:
Insitutype offers 3 approaches to cell typing:
To obtain reference profiles, you can use pre-existing single cell or CosMx data, or you can pull from resources NanoString has compiled:
The CosMx-derived profiles should be preferred when available, as they escape the strong platform effects found between CosMx and scRNA-seq. For now, the library of scRNA-seq profiles covers more tissue types.
Our example dataset is from melanoma samples. We don’t have a reference profile for melanoma (and given the diversity of cancers, it’s unclear that a reference from a different melanoma would work well), but we do have reference profiles for immune and stroma cells, which have much less between-sample variability than cancer cells. So we’ll run in semi-supervised mode, looking for reference immune/stroma cells and using our de novo clusters to capture cancer cells.
We start by loading the data we’ll need, which was output by the earlier scripts from the workflow vignette:
library(pheatmap)
#> Warning: package 'pheatmap' was built under R version 4.4.0
library(InSituType)
# load CosMx data:
# note these Rmd files are for convenience during analysis, and are not a NanoString-supported format
mydir <- "../"
metadata <- readRDS(paste0(mydir, "/processed_data/metadata.RDS"))
counts <- readRDS(paste0(mydir, "/processed_data/counts.RDS"))
negcounts <- readRDS(paste0(mydir, "/processed_data/negcounts.RDS"))
um <- readRDS(paste0(mydir, "/processed_data/um.RDS"))
xy <- readRDS(paste0(mydir, "/processed_data/xy.RDS"))
Next we’ll download a reference profile matrix from a previous CosMx experiment:
refprofiles <- read.csv("https://raw.githubusercontent.com/Nanostring-Biostats/CosMx-Cell-Profiles/main/Human/IO/IO.profiles.csv", row.names = 1, header = TRUE)
refprofiles <- refprofiles[is.element(rownames(refprofiles), colnames(counts)), ]
pheatmap::pheatmap(sweep(refprofiles, 1, pmax(apply(refprofiles, 1, max), 0.2), "/"),
col = colorRampPalette(c("white", "darkblue"))(100))
Next we’ll use Insitutype to assign our cells to these previously-published cell types, while looking for 6 new clusters (it’s usually advisable to overcluster at first):
insitutypefileloc <- paste0(mydir, "/processed_data/insitutype_initial_results.RDS")
if (!file.exists(insitutypefileloc)) {
set.seed(0)
res <- InSituType::insitutype(x = counts,
neg = Matrix::rowMeans(negcounts),
reference_profiles = refprofiles,
update_reference_profiles = FALSE,
n_clust = 6)
saveRDS(res, file = insitutypefileloc)
} else {
res <- readRDS(insitutypefileloc)
}
First we choose cell type colors. Doing this by hand is usually desirable later on. But for now, before we’ve finalized our cell typing results, we can just use an automated approach.
cols <- InSituType::colorCellTypes(freqs = table(res$clust), palette = "brewers")
str(cols)
#> Named chr [1:21] "#8DD3C7" "#BEBADA" "#FB8072" "#80B1D3" "#FDB462" ...
#> - attr(*, "names")= chr [1:21] "f" "b" "a" "c" ...
cols[c("Endothelial", "Fibroblast", "B.cell")]
#> Endothelial Fibroblast B.cell
#> "#66C2A5" "#B3B3B3" "#FFFF33"
Now we enter the crucial QC phase. Below we’ll print out a variety of QC plots and discuss their interpretation. We start with UMAP and “flightpath” plots. Flightpath plots are specific to methods like Insitutype which score every cell’s probability of belonging to every cluster. They show the tendency of different cell types to be confused with each other.
# flightpath
flightpath <- InSituType::flightpath_layout(logliks = res$logliks, profiles = res$profiles)
par(mar = c(0,0,0,0))
plot(flightpath$cellpos, pch = 16, cex = 0.2, col = cols[res$clust])
text(flightpath$clustpos[, 1], flightpath$clustpos[, 2], rownames(flightpath$clustpos), cex = 0.7)
# umap
plot(um, pch = 16, cex = 0.1, col = cols[res$clust])
for (cell in unique(res$clust)) {
text(median(um[res$clust == cell, 1]), median(um[res$clust == cell, 2]), cell, cex = 0.7)
}
# xy space:
plot(xy, pch = 16, cex = 0.1, col = cols[res$clust], asp = 1)
# (to actually get a good look, print to a png to examine the full xy space in high res)
if (FALSE) {
png(paste0(mydir, "/results/cell_type_map.png"),
width = diff(range(xy[,1]))*.7, height = diff(range(xy[,2]))*.7, units = "in",
res = 500) # res of 400 is pretty good; 600 is publication-quality
par(mar = c(0,0,0,0))
plot(xy, pch = 16, col = cols[res$clust], cex = 0.1,
xlab = "", ylab = "", xaxt = "n", yaxt = "n", asp = 1)
legend("topright", pch = 16, col = cols, legend = names(cols), cex = 0.5)
dev.off()
}
Higher rates of confusion are seen between closely-related cell types, e.g. among T-cell subtypes or the (presumably tumor) clusters a, b, c, and e. But for the most part this is a good result, with no indication that any clusters are irredeemably confused.
Next we draw a heatmap of the cluster profiles. These are helpful for confirming that clusters have the expected marker genes, or for understanding what cell types unknown clusters are. It’s often useful to print out a very long pdf of this heatmap in which every gene name is visible.
# profiles heatmap
#pdf(paste0(mydir, "/results/celltypeprofiles.pdf"), height = 20)
pheatmap::pheatmap(sweep(res$profiles, 1, pmax(apply(res$profiles, 1, max), 0.2), "/")[apply(res$profiles, 1, max) > 0.2, ],
fontsize_row = 4,
col = colorRampPalette(c("white", "darkblue"))(100))
#dev.off()
# (the use of pmax(..., 0.2) prevents genes with very low signal in all cell types from appearing exciting)
And finally, we’ll want to see every cell type’s spatial spread. You generally need a pretty big plot to actually resolve details, so you’ll want to print the plot to a .png. (Don’t use vector graphics like pdf or svg - they save the location of every cell and so turn into huge files.) Code like the below is good for this:
for (cell in unique(res$clust)) {
png(paste0(mydir, "/results/celltypev1_", cell, ".png"),
width = diff(range(xy[,1]))*.7, height = diff(range(xy[,2]))*.7, units = "in",
res = 400) # res of 400 is pretty good; 600 is publication-quality
par(mar = c(0,0,0,0))
plot(xy, pch = 16, col = scales::alpha(cols[res$clust], 0.3), cex = 0.1,
xlab = "", ylab = "", xaxt = "n", yaxt = "n")
points(xy[res$clust == cell, ], pch = 16, cex = 0.1, col = "black")
legend("top", legend = cell)
dev.off()
}
Cell typing deserves careful attention: it’s a complex process, and your entire analysis depends on its results. Scrutinize the above QC plots for suspicious patterns. These could be:
For a deep dive on this topic, see the Insitutype FAQs.
Below is an example of how to refine cell typing results. We use Insitutype’s built-in function “refineClusters”.
Examining the QC plots above, we see that we have indeed overclustered a bit, resulting in novel clusters fitting the data better than our reference profiles and claiming cells that we’d like to call something else. Here, cluster “a” is stealing from stromal populations, and cluster “c” is stealing from diverse cell populations.
The solution is simple: we simply delete these clusters and force their cells to go the the next-best cell type. We’ll use the refineClusters function to perform this operation. Using the same function, we’ll also merge the T-cell clusters for simplicity, and, just for fun, we’ll subcluster tumor cluster “d”.
refinedfileloc <- paste0(mydir, "/processed_data/insitutype_refined_results.RDS")
if (!file.exists(refinedfileloc)) {
res2 <- refineClusters(
logliks = res$logliks,
counts = counts,
neg = Matrix::rowMeans(negcounts),
merges = c("T.cell.CD4" = "T-cell",
"T.cell.CD8" = "T-cell"), # merge Treg, CD4 and CD8 cells to a broader "T-cell" category
to_delete = c("a", "c"), # delete this cluster and send its cells to their next-best fit
subcluster = list("d" = 2)) # subcluster cluster a into 2 new clusters
saveRDS(res2, file = refinedfileloc)
} else {
res2 <- readRDS(refinedfileloc)
}
str(res2) # same structure as the Insitutype output, but with new cell types
#> List of 6
#> $ clust : Named chr [1:111804] "Monocyte" "Neutrophil" "Dendritic.cell" "Endothelial" ...
#> ..- attr(*, "names")= chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#> $ prob : Named num [1:111804] 0.629 0.966 0.508 1 1 ...
#> ..- attr(*, "names")= chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#> $ logliks : num [1:111804, 1:19] -157 -227 -259 -288 -304 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#> .. ..$ : chr [1:19] "T-cell" "b" "e" "f" ...
#> $ profiles : num [1:1000, 1:19] 0 0.0938 0.0172 0.0288 0.0372 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:1000] "AATK" "ABL1" "ABL2" "ACACB" ...
#> .. ..$ : chr [1:19] "T-cell" "b" "e" "f" ...
#> $ sds : NULL
#> $ logliks_from_lost_celltypes: num[1:111804, 0 ]
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#> .. ..$ : NULL
After running refineClusters, you’ll want to re-run all the above QC plots.
# choose colors:
cols2 <- InSituType::colorCellTypes(freqs = table(res2$clust), palette = "brewers")
# heatmap of profiles:
pheatmap::pheatmap(sweep(res2$profiles, 1, pmax(apply(res2$profiles, 1, max), 0.2), "/")[apply(res2$profiles, 1, max) > 0.2, ],
fontsize_row = 4,
col = colorRampPalette(c("white", "darkblue"))(100))
# flightpath:
flightpath2 <- flightpath_layout(logliks = res2$logliks, profiles = res2$profiles)
par(mar = c(0,0,0,0))
plot(flightpath2$cellpos, pch = 16, cex = 0.1, col = cols2[res2$clust])
text(flightpath2$clustpos[, 1], flightpath2$clustpos[, 2], rownames(flightpath2$clustpos), cex = 0.7)
# umap:
plot(um, pch = 16, cex = 0.1, col = cols2[res2$clust])
for (cell in unique(res2$clust)) {
text(median(um[res2$clust == cell, 1]), median(um[res2$clust == cell, 2]), cell, cex = 0.7)
}
# xy space:
plot(xy, pch = 16, cex = 0.01, col = cols2[res2$clust], asp = 1)
# xy plots highlighting individual cell types:
for (cell in unique(res2$clust)) {
png(paste0(mydir, "/results/celltypev2_", cell, ".png"),
width = diff(range(xy[,1]))*.7, height = diff(range(xy[,2]))*.7, units = "in",
res = 400) # res of 400 to limit file size; 600 is publication-quality
par(mar = c(0,0,0,0))
plot(xy, pch = 16, col = scales::alpha(cols2[res2$clust], 0.3), cex = 0.1,
xlab = "", ylab = "", xaxt = "n", yaxt = "n")
points(xy[res2$clust == cell, ], pch = 16, cex = 0.1, col = "black")
legend("top", legend = cell)
dev.off()
}
Finally we’ll save our results:
# save the whole insitutype result:
saveRDS(res2, file = paste0(mydir, "/processed_data/insitutype_results.RDS"))
# save the vector of cluster assignments:
saveRDS(res2$clust, file = paste0(mydir, "/processed_data/celltype.RDS"))