QC

QC of CosMx data is mostly straightforward. Technical effects can be complex, but they manifest in limited ways:

First we’ll flag cells with poor signal, then we’ll flag FOVs. When we’re done, we’ll delete everything we’ve flagged. (The full original data will remain on AtoMx, so this step is not as drastic as it seems. For smaller studies it could still be reasonable to hold on to the pre-QC .RDS files.)

The QCs implemented here mimic and expand on those of AtoMx. We reimplement AtoMx QCs here to give you more granular control on how they’re implemented.

We start by loading the data we’ll need, which was output by the earlier scripts from the workflow vignette:

library(Matrix)
library(viridis)
#> Warning: package 'viridis' was built under R version 4.4.0
#> Loading required package: viridisLite
#> Warning: package 'viridisLite' was built under R version 4.4.0

# note these files are for convenience during analysis, and are not a NanoString-supported format
mydir <- "../"
metadata <- readRDS(paste0(mydir, "/processed_data/metadata_unfiltered.RDS")) 
counts <- readRDS(paste0(mydir, "/processed_data/counts_unfiltered.RDS")) 
negcounts <- readRDS(paste0(mydir, "/processed_data/negcounts_unfiltered.RDS")) 
falsecounts <- readRDS(paste0(mydir, "/processed_data/falsecounts_unfiltered.RDS")) 
xy <- readRDS(paste0(mydir, "/processed_data/xy_unfiltered.RDS"))

Cell-level QC

We’ll check for two kinds of bad cells: those with too few transcripts to use, and those that look to result from bad segmentation errors.

We generally require 20 counts per cell in 1000plex data and 50 counts in 6000plex data:

# require 20 counts per cell 
count_threshold <- 20
flag <- metadata$nCount_RNA < count_threshold
table(flag)
#> flag
#>  FALSE   TRUE 
#> 114465    754

Then we’ll look for very large cells. You can run a formal test for outliers on the vector of cell areas, or you can determine a reasonable threshold yourself:

# what's the distribution of areas?
hist(metadata$Area, breaks = 100, xlab = "Cell Area", main = "")
# based on the above, set a threshold:
area_threshold <- 30000
abline(v = area_threshold, col = "red")

# flag cells based on area:
flag <- flag | (metadata$Area > area_threshold)
table(flag)
#> flag
#>  FALSE   TRUE 
#> 113319   1900

FOV-level QC

We check FOVs for two indicators of artifacts: diminished total counts, and biased gene expression profiles. Procedures for both these checks, along with a detailed discussion, can be found in a new (as of April 2024) method described here. These procedures will appear in AtoMx in summer 2024.

To summarize briefly: for each “bit” in our barcodes (e.g. red in reporter cycle 8), we look for FOVs where genes using the bit are underexpressed. We also look for FOVs where total expression is suppressed.

Note that an additional approach to FOV QC, one looking at primary data accessible from AtoMx through S3, will be described in future post in CosMx Analysis Scratch Space.

Below we’ll run our FOV QC code and examine its results:

## preparing resources:
# source the FOV QC tool:
source("https://raw.githubusercontent.com/Nanostring-Biostats/CosMx-Analysis-Scratch-Space/Main/_code/FOV%20QC/FOV%20QC%20utils.R")
# load necessary information for the QC tool: the gene to barcode map:
allbarcodes <- readRDS(url("https://github.com/Nanostring-Biostats/CosMx-Analysis-Scratch-Space/raw/Main/_code/FOV%20QC/barcodes_by_panel.RDS"))
names(allbarcodes)
#> [1] "Hs_IO"    "Hs_UCC"   "Hs_6k"    "Mm_Neuro" "Mm_UCC"
# get the barcodes for the panel we want:
barcodemap <- allbarcodes$Hs_UCC
head(barcodemap)
#>    gene                          barcode
#> 1  AATK ....GG......BB........YY......YY
#> 2  ABL1 ............YYBBYY..........RR..
#> 3  ABL2 ......RR........BB........RR..RR
#> 4 ACACB BB............YY..YY....YY......
#> 5   ACE ....BB....BB........RR..RR......
#> 6 ACKR1 ......RR......GGRR......BB......

## run the method:
fovqcresult <- runFOVQC(counts = counts, xy = xy, fov = metadata$FOV, barcodemap = barcodemap, max_prop_loss = 0.3) 
#> The following FOVs failed QC for one or more barcode positions: s2f11

We can see which FOVs were flagged with the below plots and summaries:

# map of flagged FOVs:
mapFlaggedFOVs(fovqcresult)


# list FOVs flagged for any reason, for loss of signal, for bias:
fovqcresult$flaggedfovs
#> [1] "s2f11"
fovqcresult$flaggedfovs_fortotalcounts
#> character(0)
fovqcresult$flaggedfovs_forbias
#> [1] "s2f11"

We only flagged FOVs for bias, not for signal loss, but the map of signal strength is still worth examining:

FOVSignalLossSpatialPlot(fovqcresult) 

We see signal strength varying, but smoothly with tissue biology, and not sharply with FOV borders. No FOVs are uniformly suppressed.

The function did flags one FOV, however, not for signal loss, but for biased gene expression. Specifically, counts from one barcode position appear suppressed in this FOV. Let’s look in more detail:

# spatial plots of flagged reporter positions:
par(mfrow = c(2,2))
FOVEffectsSpatialPlots(res = fovqcresult, outdir = NULL, bits = "flagged_reportercycles") 

The above plots show the view from which the algorithm flags FOVs for failed reporter cycles. It looks not at single genes, but at the collection of all genes using a given barcode bit (reporter cycle + color). Each FOV is cut into a 7X7 grid, and each grid square is colored by how strongly the total counts from a barcode bit diverge from similar grid squares from other FOVs.

In the above plots, we see strong loss of expression across all grid squares of the flagged FOV (highlighted yellow) in all 4 colors from reporter cycle 11. In other FOVs, we see increased (red) and decreased (blue) expression, but these changes are more modest, and they vary smoothly over space rather than with the sharp boundaries of FOVs, so we attribute these changes to biology and not technical artifacts.

A final useful plot shows which barcode bits x FOVs showed evidence for lost signal:

FOVEffectsHeatmap(fovqcresult) 

The FOV QC result also reports which genes are impacted in any flagged FOVs:

# genes from all barcode bits that were flagged in any FOV:
print(unique(fovqcresult$flagged_fov_x_gene[,"gene"]))
#>   [1] "ACACB"      "ACKR4"      "ACTA2"      "ACTG2"      "ACVR1"     
#>   [6] "ACVR1B"     "ADGRE2"     "ADGRF1"     "ADGRF3"     "ADGRG3"    
#>  [11] "ADGRG5"     "ADGRV1"     "ADIPOQ"     "ADIRF"      "AGR2"      
#>  [16] "AIF1"       "ALCAM"      "ANGPTL1"    "ANKRD1"     "APOA1"     
#>  [21] "ARID5B"     "ATF3"       "ATG10"      "ATP5F1E"    "AXL"       
#>  [26] "BCL2"       "BCL2L1"     "BECN1"      "CACNA1C"    "CALD1"     
#>  [31] "CALM1"      "CAV1"       "CCL21"      "CCR10"      "CCR2"      
#>  [36] "CCRL2"      "CD164"      "CD27"       "CD274"      "CD276"     
#>  [41] "CD37"       "CD3D"       "CD44"       "CD53"       "CD5L"      
#>  [46] "CD63"       "CD8A"       "CDH1"       "CDH11"      "CDKN1A"    
#>  [51] "CEACAM1"    "CELSR1"     "CELSR2"     "CHEK2"      "CLDN4"     
#>  [56] "CLEC2B"     "CLEC2D"     "CLOCK"      "CLU"        "COL12A1"   
#>  [61] "COL14A1"    "COL16A1"    "COL17A1"    "COL4A2"     "COL5A1"    
#>  [66] "CRYAB"      "CSF2RA"     "CUZD1"      "CXCL17"     "CXCR2"     
#>  [71] "CYP2U1"     "DCN"        "DHRS2"      "DPP4"       "EFNA1"     
#>  [76] "EGF"        "EIF5A/L1"   "ENTPD1"     "EOMES"      "EPCAM"     
#>  [81] "EPHA2"      "EPHA4"      "ESAM"       "FABP4"      "FAS"       
#>  [86] "FASLG"      "FES"        "FFAR2"      "FFAR4"      "FGF1"      
#>  [91] "FGF13"      "FGF7"       "FGFR3"      "FGR"        "FKBP5"     
#>  [96] "FOXF1"      "FOXP3"      "FYB1"       "FYN"        "FZD1"      
#> [101] "FZD5"       "GDF15"      "GLUD1"      "GPX1"       "GPX3"      
#> [106] "HCST"       "HDAC1"      "HEY1"       "HIF1A"      "HLA-DQB1/2"
#> [111] "HLA-DRA"    "HSD17B2"    "IAPP"       "IER3"       "IFIH1"     
#> [116] "IFIT1"      "IFIT3"      "IFNAR2"     "IFNGR2"     "IFNL2/3"   
#> [121] "IGFBP6"     "IGHA1"      "IGKC"       "IL11"       "IL17A"     
#> [126] "IL17RA"     "IL1B"       "IL1R2"      "IL1RL1"     "IL24"      
#> [131] "IL27RA"     "IL2RG"      "IL32"       "IL34"       "IL4R"      
#> [136] "IL6"        "IL7R"       "ITGA1"      "ITGAL"      "ITM2A"     
#> [141] "JUN"        "KITLG"      "KLRB1"      "KRAS"       "KRT6A/B/C" 
#> [146] "KRT80"      "KRT86"      "LAIR1"      "LAMP2"      "LAMP3"     
#> [151] "LGALS3"     "LGALS9"     "LIF"        "LINC01781"  "LINC01857" 
#> [156] "LTB"        "LTBR"       "LY6D"       "MAP1LC3B/2" "MAPK14"    
#> [161] "MECOM"      "MGP"        "MHC I"      "MRC2"       "MX1"       
#> [166] "MYH11"      "MZT2A/B"    "NCR1"       "NDRG1"      "NEAT1"     
#> [171] "Negative03" "Negative07" "NFKB1"      "NLRP1"      "NPR2"      
#> [176] "NR1H3"      "OAS2"       "OLR1"       "OSMR"       "P2RX5"     
#> [181] "PDGFC"      "PDGFRA"     "PECAM1"     "PLAC9"      "POU5F1"    
#> [186] "PPARA"      "PPARD"      "PPIA"       "PSCA"       "PTGES"     
#> [191] "PTGES3"     "PTGIS"      "QRFPR"      "RAMP1"      "RAMP3"     
#> [196] "RARG"       "RB1"        "RYK"        "RYR2"       "S100A10"   
#> [201] "S100A8"     "SEC23A"     "SEC61G"     "SERPINA1"   "SERPINA3"  
#> [206] "SIGIRR"     "SMAD4"      "SMARCB1"    "SMO"        "SOX9"      
#> [211] "SRC"        "SRGN"       "STAT1"      "STAT3"      "STAT5A"    
#> [216] "STAT5B"     "SYK"        "TCF7"       "TFEB"       "TGFB3"     
#> [221] "TGFBR2"     "THBS1"      "TLR7"       "TNF"        "TNFRSF10A" 
#> [226] "TNFRSF21"   "UBA52"      "UBE2C"      "VHL"        "VPREB3"    
#> [231] "VSIR"       "VWF"        "WNT10B"     "YES1"       "ZFP36"

Thus while we only flagged one reporter cycle, i.e. 4 barcode bits, over a quarter of the gene panel relies on one of these bits. None of these genes can be analyzed in this FOV. The simplest way forward is to flag and remove all cells in this FOV:

flag <- flag | is.element(metadata$FOV, fovqcresult$flaggedfovs)
table(flag)
#> flag
#>  FALSE   TRUE 
#> 111804   3415

Advanced users may wish to retain FOVs with detectable but modest bias. Under this workflow, take great care that your analyses are robust to this kind of artifact.

Remove flagged cells and FOVs

Now that we’ve flagged cells we don’t want to analyze, we remove them from all of our data objects. To avoid risk of data misalignment, we’ll use cell IDs to coordinate this operation.

# how many cells are we flagging?
table(flag)
#> flag
#>  FALSE   TRUE 
#> 111804   3415
# subset all data objects to only the cells to be kept:
counts <- counts[!flag, ]
negcounts <- negcounts[!flag, ]
falsecounts <- falsecounts[!flag, ]
metadata <- metadata[!flag, ]
xy <- xy[!flag, ]
# overwrite saved data with filtered data:
# note these files are for convenience during analysis, and are not a NanoString-supported format
saveRDS(counts, paste0(mydir, "/processed_data/counts.RDS"))
saveRDS(negcounts, paste0(mydir, "/processed_data/negcounts.RDS"))
saveRDS(falsecounts, paste0(mydir, "/processed_data/falsecounts.RDS"))
saveRDS(metadata, paste0(mydir, "/processed_data/metadata.RDS"))
saveRDS(xy, paste0(mydir, "/processed_data/xy.RDS"))

These new saved data objects have omitted the flagged cells. The original data can be recovered from AtoMx or the exported flat files.

Normalization

Normalization is straightforward: we simply divide every cell’s expression profile by its total counts. To put our counts back on a more natural scale, we then multiply all cells by the mean cell’s total counts. This can be done efficiently using matrix multiplication, which avoids converting the sparse (memory-efficient) matrix into a dense matrix.

scaling_factor <- mean(metadata$nCount_RNA)
norm <- Matrix::Diagonal(x = scaling_factor/metadata$nCount_RNA,names=colnames(counts)) %*% counts
saveRDS(norm, paste0(mydir, "/processed_data/norm.RDS"))

Note: if you’ll run the same pipeline over many studies, then you should replace the mean(metadata$nCount_RNA) term with some fixed value that is the same for all studies. All positive values will produce equivalent data, but entering a value approximately equal to the average cell’s total counts will put the data on an easily-interpretable scale. Neglecting this step will result in different studies being normalized to different levels, complicating efforts to compare data generated by different pipeline runs.