Cell typing

Introduction

Cell typing is a crucial stage of analysis requiring careful attention. To ensure the integrity of your analyses, you must spend time reviewing, and possibly refining, your cell typing results.

Here we’ll demonstrate a basic cell typing workflow using the Insitutype algorithm. We’ll show the core QC plots and walk through examples of how you can refine your cell typing results.

For more details on cell typing, see these resources:

Cell typing preparation

Insitutype offers 3 approaches to cell typing:

Unsupervised clustering: discover cell types de novo without reference to prior data or biological knowledge.
Supervised cell typing: start with “reference profiles” - a matrix of expected gene expression in each cell type - from prior data. Assign your cells to whichever reference cell type they resemble best.
Semi-supervised cell typing: assign cells to reference cell types as in supervised mode, but discover new clusters at the same time.

To obtain reference profiles, you can use pre-existing single cell or CosMx data, or you can pull from resources NanoString has compiled:

The CosMx-derived profiles should be preferred when available, as they escape the strong platform effects found between CosMx and scRNA-seq. For now, the library of scRNA-seq profiles covers more tissue types.

Our example dataset is from melanoma samples. We don’t have a reference profile for melanoma (and given the diversity of cancers, it’s unclear that a reference from a different melanoma would work well), but we do have reference profiles for immune and stroma cells, which have much less between-sample variability than cancer cells. So we’ll run in semi-supervised mode, looking for reference immune/stroma cells and using our de novo clusters to capture cancer cells.

We start by loading the data we’ll need, which was output by the earlier scripts from the workflow vignette:

library(pheatmap)
#> Warning: package 'pheatmap' was built under R version 4.4.0
library(InSituType)

# load CosMx data:
# note these Rmd files are for convenience during analysis, and are not a NanoString-supported format
mydir <- "../"
metadata <- readRDS(paste0(mydir, "/processed_data/metadata.RDS")) 
counts <- readRDS(paste0(mydir, "/processed_data/counts.RDS")) 
negcounts <- readRDS(paste0(mydir, "/processed_data/negcounts.RDS")) 
um <- readRDS(paste0(mydir, "/processed_data/um.RDS"))
xy <- readRDS(paste0(mydir, "/processed_data/xy.RDS"))

Next we’ll download a reference profile matrix from a previous CosMx experiment:

refprofiles <- read.csv("https://raw.githubusercontent.com/Nanostring-Biostats/CosMx-Cell-Profiles/main/Human/IO/IO.profiles.csv", row.names = 1, header = TRUE)
refprofiles <- refprofiles[is.element(rownames(refprofiles), colnames(counts)), ]
pheatmap::pheatmap(sweep(refprofiles, 1, pmax(apply(refprofiles, 1, max), 0.2), "/"), 
                   col = colorRampPalette(c("white", "darkblue"))(100))

Initial cell typing

Next we’ll use Insitutype to assign our cells to these previously-published cell types, while looking for 6 new clusters (it’s usually advisable to overcluster at first):

insitutypefileloc <- paste0(mydir, "/processed_data/insitutype_initial_results.RDS")
if (!file.exists(insitutypefileloc)) {
  set.seed(0)
  res <- InSituType::insitutype(x = counts, 
                               neg = Matrix::rowMeans(negcounts), 
                               reference_profiles = refprofiles,
                               update_reference_profiles = FALSE,
                               n_clust = 6)
  saveRDS(res, file = insitutypefileloc)
} else {
  res <- readRDS(insitutypefileloc)
}

First we choose cell type colors. Doing this by hand is usually desirable later on. But for now, before we’ve finalized our cell typing results, we can just use an automated approach.

cols <- InSituType::colorCellTypes(freqs = table(res$clust), palette = "brewers")
str(cols)
#>  Named chr [1:21] "#8DD3C7" "#BEBADA" "#FB8072" "#80B1D3" "#FDB462" ...
#>  - attr(*, "names")= chr [1:21] "f" "b" "a" "c" ...
cols[c("Endothelial", "Fibroblast", "B.cell")]
#> Endothelial  Fibroblast      B.cell 
#>   "#66C2A5"   "#B3B3B3"   "#FFFF33"

QC of cell typing results

Now we enter the crucial QC phase. Below we’ll print out a variety of QC plots and discuss their interpretation. We start with UMAP and “flightpath” plots. Flightpath plots are specific to methods like Insitutype which score every cell’s probability of belonging to every cluster. They show the tendency of different cell types to be confused with each other.

# flightpath
flightpath <- InSituType::flightpath_layout(logliks = res$logliks, profiles = res$profiles)
par(mar = c(0,0,0,0))
plot(flightpath$cellpos, pch = 16, cex = 0.2, col = cols[res$clust])
text(flightpath$clustpos[, 1], flightpath$clustpos[, 2], rownames(flightpath$clustpos), cex = 0.7)

# umap
plot(um, pch = 16, cex = 0.1, col = cols[res$clust])
for (cell in unique(res$clust)) {
  text(median(um[res$clust == cell, 1]), median(um[res$clust == cell, 2]), cell, cex = 0.7)  
}

# xy space:
plot(xy, pch = 16, cex = 0.1, col = cols[res$clust], asp = 1)

# (to actually get a good look, print to a png to examine the full xy space in high res)
if (FALSE) {
  png(paste0(mydir, "/results/cell_type_map.png"), 
      width = diff(range(xy[,1]))*.7, height = diff(range(xy[,2]))*.7, units = "in", 
      res = 500)  # res of 400 is pretty good; 600 is publication-quality
  par(mar = c(0,0,0,0))
  plot(xy, pch = 16, col = cols[res$clust], cex = 0.1,
       xlab = "", ylab = "", xaxt = "n", yaxt = "n", asp = 1)
  legend("topright", pch = 16, col = cols, legend = names(cols), cex = 0.5)
  dev.off()
}

Higher rates of confusion are seen between closely-related cell types, e.g. among T-cell subtypes or the (presumably tumor) clusters a, b, c, and e. But for the most part this is a good result, with no indication that any clusters are irredeemably confused.

Next we draw a heatmap of the cluster profiles. These are helpful for confirming that clusters have the expected marker genes, or for understanding what cell types unknown clusters are. It’s often useful to print out a very long pdf of this heatmap in which every gene name is visible.

# profiles heatmap
#pdf(paste0(mydir, "/results/celltypeprofiles.pdf"), height = 20)
pheatmap::pheatmap(sweep(res$profiles, 1, pmax(apply(res$profiles, 1, max), 0.2), "/")[apply(res$profiles, 1, max) > 0.2, ],  
                   fontsize_row = 4,
                   col = colorRampPalette(c("white", "darkblue"))(100))

#dev.off()
# (the use of pmax(..., 0.2) prevents genes with very low signal in all cell types from appearing exciting)

And finally, we’ll want to see every cell type’s spatial spread. You generally need a pretty big plot to actually resolve details, so you’ll want to print the plot to a .png. (Don’t use vector graphics like pdf or svg - they save the location of every cell and so turn into huge files.) Code like the below is good for this:

for (cell in unique(res$clust)) {
  png(paste0(mydir, "/results/celltypev1_", cell, ".png"), 
      width = diff(range(xy[,1]))*.7, height = diff(range(xy[,2]))*.7, units = "in", 
      res = 400)  # res of 400 is pretty good; 600 is publication-quality
  par(mar = c(0,0,0,0))
  plot(xy, pch = 16, col = scales::alpha(cols[res$clust], 0.3), cex = 0.1,
       xlab = "", ylab = "", xaxt = "n", yaxt = "n")
  points(xy[res$clust == cell, ], pch = 16, cex = 0.1, col = "black")
  legend("top", legend = cell)
  dev.off()
}

Refining cell type results

Cell typing deserves careful attention: it’s a complex process, and your entire analysis depends on its results. Scrutinize the above QC plots for suspicious patterns. These could be:

Unlikely spatial distributions
Unlikely cell type abundances, e.g. a far too many immune cells, or not enough of a basic cell type
Implausible marker genes expressed at high levels
Biologically dissimilar cell types with similar expression profiles (near each other in UMAP space, confused with each other in the flightpath, showing similar profiles in the heatmap.)

For a deep dive on this topic, see the Insitutype FAQs.

Below is an example of how to refine cell typing results. We use Insitutype’s built-in function “refineClusters”.

Examining the QC plots above, we see that we have indeed overclustered a bit, resulting in novel clusters fitting the data better than our reference profiles and claiming cells that we’d like to call something else. Here, cluster “a” is stealing from stromal populations, and cluster “c” is stealing from diverse cell populations.

The solution is simple: we simply delete these clusters and force their cells to go the the next-best cell type. We’ll use the refineClusters function to perform this operation. Using the same function, we’ll also merge the T-cell clusters for simplicity, and, just for fun, we’ll subcluster tumor cluster “d”.

refinedfileloc <- paste0(mydir, "/processed_data/insitutype_refined_results.RDS")
if (!file.exists(refinedfileloc)) {
    res2 <- refineClusters(
      logliks = res$logliks, 
      counts = counts,
      neg = Matrix::rowMeans(negcounts), 
      merges = c("T.cell.CD4" = "T-cell",
                 "T.cell.CD8" = "T-cell"),  # merge Treg, CD4 and CD8 cells to a broader "T-cell" category
      to_delete = c("a", "c"),               # delete this cluster and send its cells to their next-best fit
      subcluster = list("d" = 2))            # subcluster cluster a into 2 new clusters
  saveRDS(res2, file = refinedfileloc)
} else {
  res2 <- readRDS(refinedfileloc)
}

str(res2)   # same structure as the Insitutype output, but with new cell types
#> List of 6
#>  $ clust                      : Named chr [1:111804] "Monocyte" "Neutrophil" "Dendritic.cell" "Endothelial" ...
#>   ..- attr(*, "names")= chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>  $ prob                       : Named num [1:111804] 0.629 0.966 0.508 1 1 ...
#>   ..- attr(*, "names")= chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>  $ logliks                    : num [1:111804, 1:19] -157 -227 -259 -288 -304 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>   .. ..$ : chr [1:19] "T-cell" "b" "e" "f" ...
#>  $ profiles                   : num [1:1000, 1:19] 0 0.0938 0.0172 0.0288 0.0372 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:1000] "AATK" "ABL1" "ABL2" "ACACB" ...
#>   .. ..$ : chr [1:19] "T-cell" "b" "e" "f" ...
#>  $ sds                        : NULL
#>  $ logliks_from_lost_celltypes: num[1:111804, 0 ] 
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:111804] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>   .. ..$ : NULL

After running refineClusters, you’ll want to re-run all the above QC plots.

# choose colors:
cols2 <- InSituType::colorCellTypes(freqs = table(res2$clust), palette = "brewers")

# heatmap of profiles:
pheatmap::pheatmap(sweep(res2$profiles, 1, pmax(apply(res2$profiles, 1, max), 0.2), "/")[apply(res2$profiles, 1, max) > 0.2, ],  
                   fontsize_row = 4,
                   col = colorRampPalette(c("white", "darkblue"))(100))

# flightpath:
flightpath2 <- flightpath_layout(logliks = res2$logliks, profiles = res2$profiles)
par(mar = c(0,0,0,0))
plot(flightpath2$cellpos, pch = 16, cex = 0.1, col = cols2[res2$clust])
text(flightpath2$clustpos[, 1], flightpath2$clustpos[, 2], rownames(flightpath2$clustpos), cex = 0.7)

# umap:
plot(um, pch = 16, cex = 0.1, col = cols2[res2$clust])
for (cell in unique(res2$clust)) {
  text(median(um[res2$clust == cell, 1]), median(um[res2$clust == cell, 2]), cell, cex = 0.7)  
}

# xy space:
plot(xy, pch = 16, cex = 0.01, col = cols2[res2$clust], asp = 1)

# xy plots highlighting individual cell types:
for (cell in unique(res2$clust)) {
  png(paste0(mydir, "/results/celltypev2_", cell, ".png"), 
      width = diff(range(xy[,1]))*.7, height = diff(range(xy[,2]))*.7, units = "in", 
      res = 400)  # res of 400 to limit file size; 600 is publication-quality
  par(mar = c(0,0,0,0))
  plot(xy, pch = 16, col = scales::alpha(cols2[res2$clust], 0.3), cex = 0.1,
       xlab = "", ylab = "", xaxt = "n", yaxt = "n")
  points(xy[res2$clust == cell, ], pch = 16, cex = 0.1, col = "black")
  legend("top", legend = cell)
  dev.off()
}

Finally we’ll save our results:

# save the whole insitutype result:
saveRDS(res2, file = paste0(mydir, "/processed_data/insitutype_results.RDS"))

# save the vector of cluster assignments:
saveRDS(res2$clust, file = paste0(mydir, "/processed_data/celltype.RDS"))