1 Introduction – A Guide to Whole Transcriptome CosMx® SMI Analysis

Our journey into the frontier of spatial biology begins! But before we jump right into the code, it’s important to build a foundational understanding of the technology that we’re analyzing data from, how to access data, and the biological context of the sample we’ll be analyzing throughout this guide.

1.1 What is the CosMx^® Spatial Molecular Imager?

Traditional methods like bulk and single-cell RNA sequencing have revolutionized biology, but they come with a critical limitation: by dissociating tissue into a suspension, they erase all spatial context. We learn which cells are present, but not where they were or how they were organized.

The CosMx SMI is an in-situ imaging technology designed to overcome this challenge. Instead of removing cells from the tissue, it measures RNA or protein molecules directly in their native environment, providing a high-plex, high-resolution map of cellular activity.

The underlying chemistry and hardware, detailed by He et al.¹, enables a multi-step process:

Labeling: Tissue sections on a glass slide are treated with a cocktail of probes designed to bind to specific target RNA molecules. These probes act as unique, programmable labels.
Imaging Cycles: The slide is placed in the instrument and imaged in a series of cycles. In each cycle, fluorescent “reporter” probes are washed over the tissue, lighting up a subset of the molecular labels.
Barcode Reading: An image is captured in each cycle. The pattern of “on” and “off” signals for a single RNA molecule across all cycles creates a unique optical barcode that identifies the gene target. The molecule’s (X, Y, Z) coordinates are precisely measured from its position in the high-resolution images.
Cell Segmentation: Finally, the identified transcripts and protein markers are assigned to individual cells. This segmentation process relies on immunofluorescence (IF) stains—such as DAPI to visualize nuclei and antibodies against proteins like CD298 to visualize cell membranes—that allow algorithms to accurately draw cell boundaries.

The final output is a rich suite of data tables, including the foundational gene-by-cell expression matrix, spatial coordinates, and morphological features for every cell. This is complemented by the raw imaging data, which enables deeper quality control and visualization, as we’ll explore in later chapters.

1.2 Accessing CosMx SMI Data

After a CosMx experiment, the instrument uploads raw imaging data to the AtoMx^® Spatial Informatics Platform (SIP), a cloud-based environment where the heavy computation of barcode decoding and cell segmentation occurs. Once this primary processing is complete, the data can be explored within the AtoMx SIP ecosystem or, as we will do, exported for custom downstream analysis.

Exporting from AtoMx^® SIP

For a detailed guide on navigating the AtoMx SIP interface and exporting your processed data, see the post: Using Squidpy with AtoMx^® SIP exports.

Public datasets are another invaluable resource. For example, Bruker provides a list of FFPE datasets, and indeed for this guide, we will be analyzing the human colon cancer dataset featured on that site.

1.3 Setting up Your Own Analysis Environment

Note

For this initial release the renv.lock, requirements.txt, and other helper files are not yet available. Check back later for the source code for this book.

This subsection in primarily for those who wish to use this book as a springboard for their own analysis.

When working in a mixed R and python environment, I make use of RStudio IDE and work within an analysis “Project”. While your version of R will likely be different, this particular version is R version 4.5.1 (2025-06-13) and managing packages with renv². Similarly, I’m using a virtual environment within pyenv that’s based on Python 3.10.18.

I highly recommend maintaining packages within your project as this field is moving at a very rapid pace and package versions can quickly create issues and conflicts. To “restore” the R packages use renv::restore() within the project folder. To create the python virtual environment using pyenv, type this into your terminal:

pyenv install 3.10.18
pyenv virtualenv 3.10.18 pyenv_simple
pyenv activate pyenv_simple
pip install -r requirements.txt

1: assumes you have pyenv installed. See Additional Resources for more information on pyenv.

To link this virtual environment to our project, I make use of symbolic links and add the path to my .Renviron project file.

For example, typing

pyenv versions

in the terminal will reveal the specific path that pyenv placed the virtual environment. For me it was:

 pyenv versions
  system
  3.10.18/envs/pyenv_simple
* pyenv_simple --> /opt/pyenv/versions/3.10.18/envs/pyenv_simple (set by /opt/pyenv/version)

and then within the project directory I can add pyenv_simple to the folder like so:

ln -s /opt/pyenv/versions/3.10.18/envs/pyenv_simple ./pyenv_simple

Now that a symbolic link to the virtual environment is present in the project directory, add this line to .Renviron and restart Rstudio.

RETICULATE_PYTHON=pyenv_simple/bin/python

The organizational structure of the data itself is as follows: within the analysis_results folder there is a folder for input_files and another folder for output_files. Within the output_files folder there will be a series of sub-folders with data as well as a file is named results_list.rds that I’ll populate throughout and use to – for example – reference the number of cells or other “small” statistics that will be handy.

Finally, since Quarto renders each chapter in isolation, at the beginning of analysis chapters we will add a “preamble” that loads common R scripts, python functions, and the results_file.rds file.

1.4 Our Example Dataset

Thoughout this book, we’ll examine these foundational WTX analyses by looking at a 400-field-of-view (FOV) section of FFPE colon adenocarcinoma. This tissue was analyzed using Bruker Spatial Biology’s pre-commercial WTX CosMx^® SMI assay and is available online. Our primary biological questions driving this analysis are:

What are the major spatial domains within this sample and what cell types do they contain?
What pathways are enriched within these domains?

--- subtitle: Technology, Data, and Your Workspace author: - name: Evelyn Metzger orcid: 0000-0002-4074-9003 affiliations: - ref: bsb - ref: eveilyeverafter execute: eval: false freeze: auto message: true warning: false self-contained: false code-fold: false code-tools: true code-annotations: hover format: live-html engine: knitr --- {{< include ./_extensions/r-wasm/live/_knitr.qmd >}} # Introduction Our journey into the frontier of spatial biology begins! But before we jump right into the code, it's important to build a foundational understanding of the technology that we're analyzing data from, how to access data, and the biological context of the sample we'll be analyzing throughout this guide. --- ## What is the CosMx® Spatial Molecular Imager? Traditional methods like bulk and single-cell RNA sequencing have revolutionized biology, but they come with a critical limitation: by dissociating tissue into a suspension, they erase all spatial context. We learn *which* cells are present, but not *where* they were or how they were organized. The CosMx SMI is an **in-situ imaging** technology designed to overcome this challenge. Instead of removing cells from the tissue, it measures RNA or protein molecules directly in their native environment, providing a high-plex, high-resolution map of cellular activity. The underlying chemistry and hardware, detailed by He et al. [@He], enables a multi-step process: * **Labeling:** Tissue sections on a glass slide are treated with a cocktail of probes designed to bind to specific target RNA molecules. These probes act as unique, programmable labels. * **Imaging Cycles:** The slide is placed in the instrument and imaged in a series of cycles. In each cycle, fluorescent "reporter" probes are washed over the tissue, lighting up a subset of the molecular labels. * **Barcode Reading:** An image is captured in each cycle. The pattern of "on" and "off" signals for a single RNA molecule across all cycles creates a unique optical barcode that identifies the gene target. The molecule's `(X, Y, Z)` coordinates are precisely measured from its position in the high-resolution images. * **Cell Segmentation:** Finally, the identified transcripts and protein markers are assigned to individual cells. This segmentation process relies on immunofluorescence (IF) stains—such as DAPI to visualize nuclei and antibodies against proteins like CD298 to visualize cell membranes—that allow algorithms to accurately draw cell boundaries. The final output is a rich suite of data tables, including the foundational gene-by-cell expression matrix, spatial coordinates, and morphological features for every cell. This is complemented by the raw imaging data, which enables deeper quality control and visualization, as we'll explore in later chapters. --- ## Accessing CosMx SMI Data After a CosMx experiment, the instrument uploads raw imaging data to the **AtoMx® Spatial Informatics Platform (SIP)**, a cloud-based environment where the heavy computation of barcode decoding and cell segmentation occurs. Once this primary processing is complete, the data can be explored within the AtoMx SIP ecosystem or, as we will do, exported for custom downstream analysis. ::: {.column-margin} ::: {.callout-tip} ## Exporting from AtoMx® SIP For a detailed guide on navigating the AtoMx SIP interface and exporting your processed data, see the post: [Using Squidpy with AtoMx® SIP exports](https://nanostring-biostats.github.io/CosMx-Analysis-Scratch-Space/posts/squidpy-essentials/squidpy-essentials.html). ::: ::: Public datasets are another invaluable resource. For example, [Bruker provides a list of FFPE datasets](https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/), and indeed for this guide, we will be analyzing the **human colon cancer dataset** featured on that site. --- ## Setting up Your Own Analysis Environment {#sec-setup-env} :::{.callout-note} For this initial release the `renv.lock`, `requirements.txt`, and other helper files are not yet available. Check back later for the source code for this book. ::: This subsection in primarily for those who wish to use this book as a springboard for their own analysis. When working in a mixed R and python environment, I make use of RStudio IDE and work within an analysis "Project". While your version of R will likely be different, this particular version is `r sessionInfo()$R.version$version.string` and managing packages with `renv` @renv. Similarly, I'm using a virtual environment within [`pyenv`](https://github.com/pyenv/pyenv) that's based on Python 3.10.18. I highly recommend maintaining packages within your project as this field is moving at a very rapid pace and package versions can quickly create issues and conflicts. To "restore" the R packages use `renv::restore()` within the project folder. To create the python virtual environment using `pyenv`, type this into your terminal: ```{.bash} pyenv install 3.10.18 # <1> pyenv virtualenv 3.10.18 pyenv_simple pyenv activate pyenv_simple pip install -r requirements.txt ``` 1. assumes you have `pyenv` installed. See [Additional Resources](#sec-resources) for more information on `pyenv`. To link this virtual environment to our project, I make use of symbolic links and add the path to my `.Renviron` project file. For example, typing ```{.bash} pyenv versions ``` in the terminal will reveal the specific path that `pyenv` placed the virtual environment. For me it was: ```{.bash} pyenv versions system 3.10.18/envs/pyenv_simple * pyenv_simple --> /opt/pyenv/versions/3.10.18/envs/pyenv_simple (set by /opt/pyenv/version) ``` and then within the project directory I can add `pyenv_simple` to the folder like so: ```{.bash} ln -s /opt/pyenv/versions/3.10.18/envs/pyenv_simple ./pyenv_simple ``` Now that a symbolic link to the virtual environment is present in the project directory, add this line to `.Renviron` and restart Rstudio. ``` RETICULATE_PYTHON=pyenv_simple/bin/python ``` The organizational structure of the data itself is as follows: within the `analysis_results` folder there is a folder for `input_files` and another folder for `output_files`. Within the `output_files` folder there will be a series of sub-folders with data as well as a file is named `results_list.rds` that I'll populate throughout and use to -- for example -- reference the number of cells or other "small" statistics that will be handy. Finally, since Quarto renders each chapter in isolation, at the beginning of analysis chapters we will add a "preamble" that loads common R scripts, python functions, and the `results_file.rds` file. --- ## Our Example Dataset Thoughout this book, we'll examine these foundational WTX analyses by looking at a 400-field-of-view (FOV) section of FFPE colon adenocarcinoma. This tissue was analyzed using Bruker Spatial Biology’s pre-commercial WTX CosMx® SMI assay and is [available online](https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/cosmx-human-whole-transcriptome-colon-dataset/). Our primary biological questions driving this analysis are: - What are the major spatial domains within this sample and what cell types do they contain? - What pathways are enriched within these domains?

1.1 What is the CosMx® Spatial Molecular Imager?

1.2 Accessing CosMx SMI Data

1.3 Setting up Your Own Analysis Environment

1.4 Our Example Dataset

1.1 What is the CosMx^® Spatial Molecular Imager?