Background

Cell typing is always critical and occasionally difficult. Here we’ll discuss what to do when straightforward approaches fail.

Before exploring the options here, you should:

Attempt the standard approaches - InSituType, Leiden clustering and label transfer.
If you’re using InSituType, attempt the workflows described in the InSituType FAQS. This last document provides detailed guidance on using InSituType’s many options, and its contents should be enough to guide you through most cell typing exercises.
Confirm your cell segmentation was successful. Use AtoMx or your favorite viewer to confirm that segmentation errors aren’t widespread.

In this document, we’ll address four common difficulties:

Fine subtyping, especially for immune cells
Cell typing based on marker genes
Studies with batch effects
Failure to find anchor cells / difficulties mapping reference profiles
Reference cell types claiming the wrong cells

Fine subtyping: take a hierarchical approach and rely on gene selection

Broad cell typing, e.g. discerning cancer cells vs. fibroblasts vs. immune cells, does best using a large number of genes - the whole panel for 1k data, and at least half the 6k panel. But when we turn to fine subtyping, e.g. discerning CD4 T-cells vs. Tregs, a much smaller gene set is relevant. If we use a complete high-plex panel for this problem, we’ll rely on potentially thousands of unhelpful genes: genes that are mainly in the background for our cell types of interest, genes for which cell segmentation errors have introduced high levels of contaminating signal into our cell types of interest, and genes with minimal variability among our cell types of interest. These genes make fine cell typing harder by introducing noise, or, in the case of segmentation errors, bias. You may find your subclusters to be driven by sample ID or by spatial context.

The solution is simple: after establishing your broad cell types, perform fine cell typing among related cell types using a useful subset of genes. Hierarchical approaches like this are common in the single cell world. Defining the “useful subset of genes” is straightforward: we want genes that differ meaningfully between our closely-related cell types, and we want to avoid genes at risk of contamination from segmentation errors. We advise using a generous gene list: including at least 1/3-1/2 of the genes in the 1k and 6k panels seems to work well.

The below code shows how this approach would work for supervised cell typing. Here, we’ll perform just one filter, removing genes at risk of bias from cell segmentation errors:

#### assumed data ---------------------------------------
# xy: 2-column matrix of cells' xy positions
# refprofiles: matrix of expected expression in genes * cells types, as 
#   is expected by InSituType.
# counts: counts matrix, cells * genes.
# clust: vector of broad cell type assignments.

#### source functions for gene selection --------------------------------------- 

source(url("https://github.com/Nanostring-Biostats/CosMx-Analysis-Scratch-Space/tree/Main/_code/cell-typing-advanced-strategies/getSubtypingGenes.R"))
# (if the above doesn't work, just download the file at that url then source it locally)

#### supervised subtyping workflow ---------------------------------------

## get genes safe from contamination:
safegenes <- findSafeGenes(
  counts = counts,                    # note we're entering info for all cells           
  xy = xy,                            # again entering info for all cells
  ismycelltype = (clust == "T-cell"), # logical indicating the cells to subtype
  tissue = NULL,                      # specify if tissues have xy overlap
  self_vs_neighbor_threshold = 1.75   # min ratio of expression in the cell type vs. in its neighbors
  )$safegenes            

message(print0(length(safegenes), " genes kept from a panel of ", ncol(counts), ". Try to use at least 1/3 the panel."))

## supervised subtyping of T-cells:
res <- insitutype(
  x = counts[sub, ],
  reference_profiles = refprofiles[Tcellgenes, c("CD4", "CD8", "Treg")],
  n_clusts = 0)

## update the cluster assignments:
clust[sub] <- res$clust

And for unsupervised clustering, we’ll filter on both risk of segmentation errors and for informativeness in our data. For “informativeness”, we’ll search for highly variable genes, as described by Stuart et al. (2019), and implemented in Seurat::FindVariableFeatures.

#### assumed data ---------------------------------------
# xy: 2-column matrix of cells' xy positions
# refprofiles: matrix of expected expression in genes * cells types, as 
#   is expected by InSituType.
# counts: counts matrix, cells * genes.
# clust: vector of broad cell type assignments.

#### source functions for gene selection --------------------------------------- 

source(url("https://github.com/Nanostring-Biostats/CosMx-Analysis-Scratch-Space/tree/Main/_code/cell-typing-advanced-strategies/getSubtypingGenes.R"))
# (if the above doesn't work, just download the file at that url then source it locally)

#### unsupervised subclustering workflow ---------------------------------------

## get genes safe from contamination:
safegenes <- findSafeGenes(
  counts = counts,                    # note we're entering info for all cells           
  xy = xy,                            # again entering info for all cells
  ismycelltype = (clust == "T-cell"), # logical indicating the cells to subtype
  tissue = NULL,                      # specify if tissues have xy overlap
  self_vs_neighbor_threshold = 1.75   # min ratio of expression in the cell type vs. in its neighbors
  )$safegenes           

## Get highly variable genes in your subset:
sub <- clust == "T-cell"                                 
subsetHVGs <- getSubclusteringGenes(
  mat = counts[sub, ],     # raw counts from only the cell type of interest
  varratiothresh = 1,      # how much var beyond what mean expression predicts
  expressionthresh = 0.2)  # min expression level

## keep genes meeting both the above criteria:
usegenes <- intersect(subsetHVGs, safegenes)
message(print0(length(usegenes), " genes kept from a panel of ", ncol(counts), ". Try to use at least 1/3 the panel."))

## supervised subtyping of T-cells:
res <- insitutype(
  x = counts[sub, usegenes],
  n_clusts = 4)

## update the cluster assignments:
clust[sub] <- paste0("T-cell cluster ", res$clust)

Cell typing based on marker genes

We generally advise against relying on raw marker gene expression for cell typing. However, a marker-centered view can occasionally be appropriate. For example, you might insist on only restricting your analysis to Treg’s to cells that are demonstrably FOXP3+. If you do want to base your cell typing on a marker gene, we recommend using smoothing/imputation to get cleaner marker gene expression values. The idea is to obtain more accurate expression values for a marker gene by borrowing information from other genes in its profile. This intuition is simple: if you see FOXP3 expression in all the cells with highly similar expression profiles to a cell in question, then the cell in question is very probably FOXP3+ itself, whether or not you observed any FOXP3 counts in it. Or, more formally, this approach negotiates a favorable variance-bias tradeoff: in this example, we bias cells’ FOXP3 expression values to resemble their nearest neighbors in expression space, while greatly reducing sampling variability compared to that of a single gene’s count value.

We demonstrate a smoothing/imputation approach to marker genes here. You can also use SAVER (Huang et al, 2018) or scImpute (Li & Li, 2018) or any of the many other published single cell imputation methods. Do note that performing deeper analyses of imputed expression values, e.g. differential expression testing, is hazardous (Andrews & Hemberg, 2019), though imputed counts can be useful for plotting.

Batch effects: well-considered workflows

InSituType is quite, but not completely, robust to batch effects. Leiden clustering and other methods relying on principal components of the data can be highly sensitive to them. If you see evidence of batch effects in your data, for example strong batch dependence in your cell typing results or your UMAP projection, consider the below strategies.

Unsupervised clustering: InSituType will generally work even in the presence of batch effects. To run Leiden clustering, just begin by using any standard batch-correction method, e.g. Harmony (Korsunsky et al, 2019). Seurat’s label transfer functions offer another approach. Batch corrected data isn’t well-suited InSituType, which ingests raw data.
Supervised cell typing with InSituType: run InSituType separately on each tissue/batch, using the rescale=TRUE option. This will perform a batch adjustment from the reference to each new tissue. We used this approach successfully in a study of lupus nephritis (Danaher et al., 2024). You can see the code we used here.
Semi-supervised cell typing with InSituType: initially analyze a single batch of data. Once you’re satisfied that you’ve captured all the unknown and reference cell types in that batch, use it to derive a new reference matrix (use InSituType::getRNAprofiles). Then apply the above supervised cell typing strategy using this reference matrix.

Failure to find anchor cells

InSituType has the option of using anchor cells to calibrate reference profiles for CosMx data. This calibration is generally helpful for supervised cell typing and usually essential for semi-supervised cell typing, at least if the reference came from scRNA-seq data. (For details on this calibration, see the “Updating reference profiles” section of the InSituType FAQs.)

Sometimes, InSituType will fail to discover enough “anchor cells” to perform this calibration. In this case, there are several simple next steps.

Lower the thresholds for anchor selection (min_anchor_cosine and min_anchor_llr). This is often needed in high-plex studies - the defaults were optimized for 1000-plex data.
If you only find anchors for a few cell types: use rescale = TRUE, refit = FALSE. This performs a softer calibration: it rescales each row/gene of the reference matrix, but doesn’t fully refit each profile. This operation can be powered using only a few cell types.
If you have a whole single cell reference dataset, not just profiles, then try a label transfer algorithm, e.g. Seurat’s or MaxFuse (Chen et al 2023). If you do cell type with label transfer, then using InSituType::spatialUpdate can further improve these initial results.

Reference cell types claiming the wrong cells

InSituType will occasionally assign a cell type to the wrong reference profile, e.g. we’ve seen some cancer cells get assigned to the “NK cells” profiles. Fortunately, this kind of event is easy to detect with even cursory QC of your cell type results, and easy to fix: just use InSituType::refineClusters. Correcting this kind of error should be thought of as a part of routine human-guided cell typing, an operation to perform alongside all the other renaming, merging, and subclustering operations performed with refineClusters.

References

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. cell. 2019 Jun 13;177(7):1888-902.
Danaher P, Hasle N, Nguyen ED, Roberts JE, Rosenwasser N, Rickert C, Hsieh EW, Hayward K, Okamura DM, Alpers CE, Reed RC. Childhood-onset lupus nephritis is characterized by complex interactions between kidney stroma and infiltrating immune cells. Science Translational Medicine. 2024 Nov 27;16(775):eadl1666.
Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh PR, Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods. 2019 Dec;16(12):1289-96.
Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, Zhang NR. SAVER: gene expression recovery for single-cell RNA sequencing. Nature methods. 2018 Jul;15(7):539-42.
Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature communications. 2018 Mar 8;9(1):997.
Andrews TS, Hemberg M. False signals induced by single-cell imputation. F1000Research. 2019 Mar 5;7:1740.
Chen S, Zhu B, Huang S, Hickey JW, Lin KZ, Snyder M, Greenleaf WJ, Nolan GP, Zhang NR, Ma Z. Integration of spatial and single-cell data across modalities with weakly linked features. Nature Biotechnology. 2024 Jul;42(7):1096-106.