Data preparation

First we’ll load the packages we need. Importantly, we’ll load “Matrix”, which handles sparse matrices. When a matrix has mainly zeroes, which is generally the case for single cell data, storing it in sparse matrix format has huge memory savings. Given the size of CosMx data, it is vital to be memory-efficient.

Note: we’ll store our count matrices with cells in rows and genes in columns. Many protocols arrange their count matrix the other way around, but our approach here will be more performant for most operations.

# necessary libraries
library(data.table) # for more memory efficient data frames
#> Warning: package 'data.table' was built under R version 4.4.0
library(Matrix) # for sparse matrices like our counts matrix

Now let’s see the files we have to work with:

# location of flat files:
myflatfiledir <- "../data/flatFiles"
dir(myflatfiledir)
#> [1] "MAR19_SlideAge_SkinCancer_4C_CP_slide4"
#> [2] "MAR19_SlideAge_SkinCancer_RT_CP_slide2"
# where to write output:
outdir <- ".."

So there are two slides, each with 4 files: an expression matrix, cell metadata, cell polygons, and a map of fov positions. Note we have omitted the transcript locations file from this analysis; that data can be useful for visualizations, but is unnecessary for the analyses we demonstrate here.

Now we’ll load in the flat files and format as we want.

Trying to load an entire matrix into memory at once, may cause out-of-memory issues for large files.
Here’s some code below which gives an example of loading in the counts matrices in chunks.
* First, we’ll use data.table::fread function which can directly read gzipped files, and determine how many rows and columns we’re working with.
* Then, we’ll read each chunk of data in, and convert it to a low-memory sparse matrix format before moving onto the next chunk.
* At the end, we’ll combine our sparse matrices together.

### automatically get slide names:
slidenames <- dir(myflatfiledir)

### lists to collect the counts matrices and metadata, one per slide
countlist <- vector(mode='list', length=length(slidenames)) 
metadatalist <- vector(mode='list', length=length(slidenames)) 

for(i in 1:length(slidenames)){
   
  slidename <- slidenames[i] 

  msg <- paste0("Loading slide ", slidename, ", ", i, "/", length(slidenames), ".")
  message(msg)    
  # slide-specific files:
  thisslidesfiles <- dir(paste0(myflatfiledir, "/", slidename))
  
  # load in metadata:
  thisslidesmetadata <- thisslidesfiles[grepl("metadata\\_file", thisslidesfiles)]
  tempdatatable <- data.table::fread(paste0(myflatfiledir, "/", slidename, "/", thisslidesmetadata))
  
  # numeric slide ID 
  slide_ID_numeric <- tempdatatable[1,]$slide_ID 
    
  # load in counts as a data table:
  thisslidescounts <- thisslidesfiles[grepl("exprMat\\_file", thisslidesfiles)]
  
  countsfile <- paste0(myflatfiledir, "/", slidename, "/", thisslidescounts)
  nonzero_elements_perchunk <- 5*10**7
  ### Safely read in the dense (0-filled ) counts matrices in chunks.
  ### Note: the file is gzip compressed, so we don't know a priori the number of chunks needed.
  lastchunk <- FALSE 
  skiprows <- 0
  chunkid <- 1
  
  required_cols <- fread(countsfile, select=c("fov", "cell_ID"))
  stopifnot("columns 'fov' and 'cell_ID' are required, but not found in the counts file" = 
              all(c("cell_ID", "fov") %in% colnames(required_cols)))
  number_of_cells <- nrow(required_cols)
  
  number_of_cols <-  ncol(fread(countsfile, nrows = 2))
  number_of_chunks <- ceiling(number_of_cols * number_of_cells / (nonzero_elements_perchunk))
  chunk_size <- floor(number_of_cells / number_of_chunks)
  sub_counts_matrix <- vector(mode='list', length=number_of_chunks)
   
  pb <- txtProgressBar(min = 0, max = number_of_chunks, initial = 0, char = "=",
                       width = NA, title, label, style = 3, file = "")
  cellcount <- 0
  while(lastchunk==FALSE){
    read_header <- FALSE
    if(chunkid==1){
      read_header <- TRUE
    }
    
    countsdatatable <- data.table::fread(countsfile,
                                         nrows=chunk_size,
                                         skip=skiprows + (chunkid > 1),
                                         header=read_header)
    if(chunkid == 1){
      header <- colnames(countsdatatable)
    } else {
      colnames(countsdatatable) <- header
    }
     
    cellcount <- nrow(countsdatatable) + cellcount     
    if(cellcount == number_of_cells) lastchunk <- TRUE
    skiprows <- skiprows + chunk_size
    slide_fov_cell_counts <- paste0("c_",slide_ID_numeric, "_", countsdatatable$fov, "_", countsdatatable$cell_ID)
    sub_counts_matrix[[chunkid]] <- as(countsdatatable[,-c("fov", "cell_ID"),with=FALSE], "sparseMatrix") 
    rownames(sub_counts_matrix[[chunkid]]) <- slide_fov_cell_counts 
    setTxtProgressBar(pb, chunkid)
    chunkid <- chunkid + 1
  }
  
  close(pb)   
  
  countlist[[i]] <- do.call(rbind, sub_counts_matrix) 
  
  # ensure that cell-order in counts matches cell-order in metadata   
  slide_fov_cell_metadata <- paste0("c_",slide_ID_numeric, "_", tempdatatable$fov, "_", tempdatatable$cell_ID)
  countlist[[i]] <- countlist[[i]][match(slide_fov_cell_metadata, rownames(countlist[[i]])),] 
  metadatalist[[i]] <- tempdatatable 
  
  # track common genes and common metadata columns across slides
  if(i==1){
    sharedgenes <- colnames(countlist[[i]]) 
    sharedcolumns <- colnames(tempdatatable)
  }  else {
    sharedgenes <- intersect(sharedgenes, colnames(countlist[[i]]))
    sharedcolumns <- intersect(sharedcolumns, colnames(tempdatatable))
  }
    
}
#> Loading slide MAR19_SlideAge_SkinCancer_4C_CP_slide4, 1/2.
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Loading slide MAR19_SlideAge_SkinCancer_RT_CP_slide2, 2/2.
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%

# reduce to shared metadata columns and shared genes
for(i in 1:length(slidenames)){
  metadatalist[[i]] <- metadatalist[[i]][, ..sharedcolumns]
  countlist[[i]] <- countlist[[i]][, sharedgenes]
}
 
counts <- do.call(rbind, countlist)
metadata <- rbindlist(metadatalist)

# add to metadata: add a global non-slide-specific FOV ID:
metadata$FOV <- paste0("s", metadata$slide_ID, "f", metadata$fov)

# remove cell_ID metadata column, which only identifies cell within slides, not across slides:
metadata$cell_ID <- NULL

# isolate negative control matrices:
negcounts <- counts[, grepl("Negative", colnames(counts))]
falsecounts <- counts[, grepl("SystemControl", colnames(counts))]

# reduce counts matrix to only genes:
counts <- counts[, !grepl("Negative", colnames(counts)) & !grepl("SystemControl", colnames(counts))]

Lets take a look at the structure of the dataset we’ve assembled:

atomxdata <- list(counts = counts,
                  metadata = metadata,
                  negcounts = negcounts,
                  falsecounts = falsecounts)
str(atomxdata)
#> List of 4
#>  $ counts     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:25061453] 14 24 31 32 39 48 54 102 119 134 ...
#>   .. ..@ p       : int [1:1001] 0 9509 35265 62128 72640 83622 89279 103387 114777 154006 ...
#>   .. ..@ Dim     : int [1:2] 115219 1000
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:115219] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>   .. .. ..$ : chr [1:1000] "AATK" "ABL1" "ABL2" "ACACB" ...
#>   .. ..@ x       : num [1:25061453] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ metadata   :Classes 'data.table' and 'data.frame':    115219 obs. of  37 variables:
#>   ..$ fov                    : int [1:115219] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ Area                   : int [1:115219] 4436 4839 3618 2965 3060 5616 1667 3278 1913 8536 ...
#>   ..$ AspectRatio            : num [1:115219] 0.85 0.96 3.28 1.09 1.38 1.58 2.53 2.26 2.42 1.16 ...
#>   ..$ CenterX_local_px       : int [1:115219] 31 98 216 425 490 1316 1373 1461 1528 1652 ...
#>   ..$ CenterY_local_px       : int [1:115219] 37 38 19 31 27 39 15 20 16 54 ...
#>   ..$ Width                  : int [1:115219] 64 74 131 70 77 126 81 95 80 126 ...
#>   ..$ Height                 : int [1:115219] 75 77 40 64 56 80 32 42 33 109 ...
#>   ..$ Mean.PanCK             : int [1:115219] 219 118 192 160 69 5823 4649 6897 5618 5015 ...
#>   ..$ Max.PanCK              : int [1:115219] 1192 1108 640 852 364 24028 9496 17380 11280 15952 ...
#>   ..$ Mean.CD68              : int [1:115219] 922 443 569 518 614 173 153 144 201 182 ...
#>   ..$ Max.CD68               : int [1:115219] 4492 2784 1624 1332 1824 596 464 700 724 672 ...
#>   ..$ Mean.Membrane          : int [1:115219] 46 225 990 2740 701 5334 4659 3833 3613 8376 ...
#>   ..$ Max.Membrane           : int [1:115219] 1768 3956 6536 8048 6860 17772 14100 12852 10164 27984 ...
#>   ..$ Mean.CD45              : int [1:115219] 416 1283 1297 810 391 1442 1536 565 838 259 ...
#>   ..$ Max.CD45               : int [1:115219] 1620 6976 3968 2204 1608 7960 6584 2260 2948 3812 ...
#>   ..$ Mean.DAPI              : int [1:115219] 490 463 883 699 575 739 744 977 946 512 ...
#>   ..$ Max.DAPI               : int [1:115219] 1128 1188 1748 1476 1400 1604 1092 1512 1428 1164 ...
#>   ..$ cell_id                : chr [1:115219] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>   ..$ assay_type             : chr [1:115219] "RNA" "RNA" "RNA" "RNA" ...
#>   ..$ version                : chr [1:115219] "v6" "v6" "v6" "v6" ...
#>   ..$ Run_Tissue_name        : chr [1:115219] "MAR19_SlideAge_SkinCancer_4C_CP_slide4" "MAR19_SlideAge_SkinCancer_4C_CP_slide4" "MAR19_SlideAge_SkinCancer_4C_CP_slide4" "MAR19_SlideAge_SkinCancer_4C_CP_slide4" ...
#>   ..$ Panel                  : chr [1:115219] "(1.0) (1.0) Human RNA Universal Cell Characterization; (1.0) (1.0) Human RNA Universal Cell Characterization Add-On" "(1.0) (1.0) Human RNA Universal Cell Characterization; (1.0) (1.0) Human RNA Universal Cell Characterization Add-On" "(1.0) (1.0) Human RNA Universal Cell Characterization; (1.0) (1.0) Human RNA Universal Cell Characterization Add-On" "(1.0) (1.0) Human RNA Universal Cell Characterization; (1.0) (1.0) Human RNA Universal Cell Characterization Add-On" ...
#>   ..$ cellSegmentationSetId  : chr [1:115219] "c03e2105-8124-46e3-91e0-72b3ca71b61f" "c03e2105-8124-46e3-91e0-72b3ca71b61f" "c03e2105-8124-46e3-91e0-72b3ca71b61f" "c03e2105-8124-46e3-91e0-72b3ca71b61f" ...
#>   ..$ cellSegmentationSetName: chr [1:115219] "Initial Segmentation" "Initial Segmentation" "Initial Segmentation" "Initial Segmentation" ...
#>   ..$ slide_ID               : int [1:115219] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ CenterX_global_px      : num [1:115219] 31 98 216 425 490 ...
#>   ..$ CenterY_global_px      : num [1:115219] 83877 83876 83895 83883 83887 ...
#>   ..$ unassignedTranscripts  : num [1:115219] 0.0494 0.0494 0.0494 0.0494 0.0494 ...
#>   ..$ nCount_RNA             : int [1:115219] 50 89 136 133 116 1009 252 590 342 1628 ...
#>   ..$ nFeature_RNA           : int [1:115219] 37 50 75 69 69 270 121 204 152 323 ...
#>   ..$ nCount_negprobes       : int [1:115219] 0 0 0 0 0 0 0 0 1 1 ...
#>   ..$ nFeature_negprobes     : int [1:115219] 0 0 0 0 0 0 0 0 1 1 ...
#>   ..$ nCount_falsecode       : int [1:115219] 0 0 2 1 1 1 1 4 1 18 ...
#>   ..$ nFeature_falsecode     : int [1:115219] 0 0 2 1 1 1 1 3 1 13 ...
#>   ..$ Area.um2               : num [1:115219] 64.2 70 52.3 42.9 44.3 ...
#>   ..$ cell                   : chr [1:115219] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>   ..$ FOV                    : chr [1:115219] "s1f1" "s1f1" "s1f1" "s1f1" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  $ negcounts  :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:61156] 9 165 169 177 211 217 268 274 278 333 ...
#>   .. ..@ p       : int [1:11] 0 5784 10671 14864 19998 30422 36424 46286 51522 55322 ...
#>   .. ..@ Dim     : int [1:2] 115219 10
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:115219] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>   .. .. ..$ : chr [1:10] "Negative1" "Negative10" "Negative2" "Negative3" ...
#>   .. ..@ x       : num [1:61156] 1 2 1 1 1 1 1 1 1 2 ...
#>   .. ..@ factors : list()
#>  $ falsecounts:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:601925] 204 237 373 378 615 622 642 686 711 744 ...
#>   .. ..@ p       : int [1:198] 0 759 1131 3234 6502 7150 7831 8715 10093 10520 ...
#>   .. ..@ Dim     : int [1:2] 115219 197
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:115219] "c_1_1_1" "c_1_1_2" "c_1_1_3" "c_1_1_4" ...
#>   .. .. ..$ : chr [1:197] "SystemControl1" "SystemControl10" "SystemControl100" "SystemControl101" ...
#>   .. ..@ x       : num [1:601925] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()

Now our data looks like this:

Next we save the data as .RDS files. These files are smaller on disk (they benefit from more compression than flat files do), and they’re faster to read back into R.

Please note that these files hold intermediate results for convenience during analysis; they are not official file formats supported by NanoString.

if(!dir.exists(paste0(outdir, "/processed_data"))){
  dir.create(paste0(outdir, "/processed_data"))
}
saveRDS(atomxdata$counts, file = paste0(outdir, "/processed_data/counts_unfiltered.RDS"))
saveRDS(atomxdata$negcounts, file = paste0(outdir, "/processed_data/negcounts_unfiltered.RDS"))
saveRDS(atomxdata$falsecounts, file = paste0(outdir, "/processed_data/falsecounts_unfiltered.RDS"))
saveRDS(atomxdata$metadata, file = paste0(outdir, "/processed_data/metadata_unfiltered.RDS"))

Now we have data in 3 locations:

The RDS files we just created
The flat files we began with
The original data on AtoMx

This is unnecessary duplication, and since CosMx data is so large, storing extra copies can incur real cost. There are two feasible approaches:

Once you’ve finished your analysis, delete all the large .RDS files created during it. They can always be regenerated by re-running the analysis scripts. If you are computing on the cloud, re-running does incur some compute costs, but these are minor compared to the cost of indefinitely storing large files.
Once you’ve confirmed this script data loading script has run successfully, you can safely delete the flat files - you won’t be using them again. If you do need to go back to them, you can always export them from AtoMx again: AtoMx acts as a persistent, authoritative source of unmodified data.

Note on data size

A reasonably high-performance compute instance can generally handle a few slides of CosMx data using the scripts we use here. But once studies get to millions of cells, they can overwhelm the memory available to most analysts. At this point, you have two options:

Use data formats that store data on disk, not in memory, e.g. TileDB, SeuratDisk, hdf5…
Analyze your data one tissue at a time. Put all tissues through a standard analysis pipeline, then, for multi-tissue analyses, judiciously load in just the data you need for a given analysis.

Neither option is terribly convenient, but having millions or tens of millions of cells to analyze is worth some hassle.