QC and normalization of RNA data

quality control
normalization
pre-processing
Author
Affiliations

Patrick Danaher

NanoString, a Bruker Company

Github: patrickjdanaher

Published

January 29, 2024

Modified

June 19, 2024

QC and normalization of CosMx RNA data

We’ve tried a lot of options here, and we’ve settled on very simple procedures for most cases.

We discuss them here. To see them implemented in an R-based analysis workflow, see the QC and normalization page of our basic analysis vignette

QC

QC in CosMx is motivated by known error modes. Here’s a list of major things that can go wrong:

  • A cell might be undersampled, leading to excessively low counts (Either only a tip of it is in the slide, or detection efficiency is poor within it.) Solution: remove the cell.
  • A cell might suffer extremely high background, either due to intrinsic tissue stickiness (e.g. associated with necrosis) or due to optical artifacts. Solution: remove the cell.
  • Errors in cell segmentation might assign multiple cells to the same “cell”. Solution: remove these multiplets.
  • A FOV might have low counts overall. This can be caused by imaging trouble, tissue peeling, and probably other causes. Solution: remove FOVs with low quality data. (Removing low quality cells isn’t good enough. If a bad FOV has half its cells removed, the spatial pattern implied by the remaining cells, those lucky enough to survive the cell QC, won’t be representative.)
  • A FOV’s expression profile can be distorted by image registration errors or by imaging artifacts, e.g. fluorescence hiding spots of one color. These FOVs can be analyzable if you’re careful, but given the uncertainty they pose it’s usually best to remove them.

QC logic would then proceed as follows:

  1. Remove cells with too few counts. We use fairly generous thresholds of 20 counts for our 1000plex assay and 50 counts for the 6000plex assay. Higher / more stringent thresholds would also be reasonable.
# counts is the matrix of raw expression profiles, cells in rows, genes in columns
totalcounts <- Matrix::rowSums(counts)  
drop <- totalcounts < 20
  1. Remove cells with high outlier areas. You can use Grubb’s test to detect outliers, or you can draw a histogram of cell areas and choose a cutoff on your own.

  2. Remove FOVs with poor counts. AtoMx removes FOVs based on their mean count per cell, or by a user-specified quantile of counts per cell. Filtering on % of cells flagged by the above criteria would also be reasonable.

  3. Flag FOVs with distorted expression profiles. AtoMx now flags FOVs where z-stack registration is highly unstable, but older runs won’t benefit from this update, and other effects, namely background fluorescence, can still distort FOV expression profiles. To implement steps 3 and 4, you can find our FOV QC tool here.

Normalization

Unlike scRNA-seq data, where cells tend to have somewhat consistent expression levels, spatial platforms vary widely in how much of a cell’s RNA they detect. Normalizing out this effect is important for some analyses. We make the reasonable assumption that a cell’s detection efficiency is well-estimated by its total counts, which implies we can scale each cell’s profile by its total counts:

# counts is the matrix of raw expression profiles, cells in rows, genes in columns
totalcounts <- Matrix::rowSums(counts)  
norm <- sweep(counts, 1, pmax(totalcounts, 20), "/")

Note the pmax(totalcounts, 20) term in the above. This puts a floor on how much we’ll up-scale a cell. This prevents us from taking cells with very few counts and drastically scaling them up, which gives them misleadingly distinct expression profiles.

(Note: some authors have pointed out that there’s information to be had in a cell’s total counts. For example, cancer cells tend to have high overall RNA expression. When we normalize, we lose this information. But we’ve usually found that a small price to pay to control the variability in total counts that arises from unwanted technical sources. Discerning between highly distinct cell types like cancer vs. immune cells is generally easy, while uncovering trends within a cell type is a harder task where controlling technical variability can be enabling.)

Other transformations

For most cases, we recommend keeping data on the linear scale, i.e. normalizing without further transformation. This approach keeps the data on an easily-interpretable scale and tends to perform well in analyses.

For some purposes, further transformations can make sense. Transformations like square root or “log1p” (log(1 + x)) inflate the variability of low counts and shrink the variability of high counts. In some datasets, this can make UMAPs and distance-based clustering methods like Leiden and Louvain perform better. Seurat and AtoMx both offer a log1p transformation.

AtoMx also offers a “Pearson residuals” transformation. This transformation can produce great UMAPs, but is only viable in small studies: because it creates a dense (not sparse) expression matrix, it produces an expression matrix with a huge memory footprint, potentially crashing your analysis environment.

If you do perform a non-linear transformation to run UMAP or Leiden clustering, consider using linear-scale data elsewhere in your analysis.