4 Normalization, Reduction, and Clustering – A Guide to Whole Transcriptome CosMx<sup>®</sup> SMI Analysis

R code

source("./preamble.R")
reticulate::source_python("./preamble.py")
analysis_dir <- file.path(getwd(), "analysis_results")
input_dir <- file.path(analysis_dir, "input_files")
output_dir <- file.path(analysis_dir, "output_files")
analysis_asset_dir <- "./assets/analysis_results"
qc_dir <- file.path(output_dir, "qc")
if(!dir.exists(qc_dir)){
  dir.create(qc_dir, recursive = TRUE)
}
pp_dir <- file.path(output_dir, "processing")
if(!dir.exists(pp_dir)){
  dir.create(pp_dir, recursive = TRUE)
}
results_list_file = file.path(analysis_dir, "results_list.rds")
if(!file.exists(results_list_file)){
  results_list <- list()
  saveRDS(results_list, results_list_file)
} else {
  results_list <- readRDS(results_list_file)  
}
source("./helpers.R") # contains the plotDots function

Our quality control steps in Chapter 3 identified the analyzable cells, but these their raw counts need further processing to reveal biological patterns. This chapter covers the essential pre-processing steps to make the data interpretable.

First, normalization adjusts counts to account for technical factors like cell area (i.e., larger cells typically have more counts), enabling fair comparisons between cells. Second, dimensionality reduction (using PCA and UMAP) helps reduce the total number of dimensions in the data (almost 19,000) to a manageable representation that captures key biological variation and allows visualization. Finally, clustering leverages this reduced space to group cells with similar expression profiles, which can be a foundation for downstream analyses like cell typing.

By the end of this chapter, we will have transformed the filtered count matrix into a clustered, low-dimensional dataset and classified cells into groups which, taken together, form our foundation for biological interpretation and spatial analysis. I often get asked “What is the best parameter combination for CosMx data?” and I hope by the end of this chapter you’re able to understand how parameter adjustments impact many of these processing steps.

Let’s begin by loading the data.

Python Code

adata = ad.read_h5ad(os.path.join(r.analysis_dir, "anndata-1-qc-filtered.h5ad"))

4.1 Two pre-processing workflows

There are two different pre-processing workflows that I use. The first one is borrowed from spatially-unaware scRNA-seq. The main advantage of this approach is that it is fast, the normalization matrix can remain sparse, the values are easy to understand (e.g., counts of gene X per 10,000 or some transformation of such), and often times for WTX data provides the basis for good clustering of cells in downstream UMAP. Moreover, Scanpy and Seurat each have built-in methods for these steps which make it relatively low friction. However, there are cases were this standard (“legacy”) workflow provides unexpected results. One reason is that total counts-based normalization might have unstable variance¹ such that high-expressing housekeeping genes might have high variance and disproportionately effect downstream dimensional reduction and clustering.

The second, newer workflow that I have been using is based on normalization using Pearson Residuals (PR). The approach itself is not new and scanpy has a built-in method for it (i.e., scanpy.experimental.pp.normalize_pearson_residuals). Until recently the main disadvantage for using PR is that it doesn’t scale well to large number of cells that are typical in CosMx SMI¹ and it creates a large, dense normalization matrix that can make analysis and read/writing slower. Fortunately, in many cases I do not need to compute and store such a dense matrix. For example, Dan McGuire recently created the R package scPearsonPCA that can estimate the Principle Components (PCs) of the PR normalized data without needing to convert and use very dense matrices (which we will show below). While this workflow involves converting data between python and R, I generally find the clusters that are derived from such data are comparable to the standard workflow or superior and so I recommend this workflow overall.

In this chapter, I’ll run both the standard and recommended workflows on the colon cancer dataset to provide code for both approaches. I will also do a deep-dive on the choice of UMAP parameters. This particular dataset is a case where the results are more-or-less comparable and I’ll pick the PR data to use throughout the rest of the colon analysis.

Let’s begin. Both options begin the same way: computing the highly variable genes (HVGs). I have found that the pearson_residulas method to detect HVGs (not to be confused with PR normalization) provides stable results for a variety of CosMx SMI WTX experiments.

Python Code

adata.layers["counts"] = adata.X.copy()

sc.experimental.pp.highly_variable_genes(
    adata,
    flavor='pearson_residuals',
    n_top_genes=4000,
    layer='counts'
)

1: the pearson_residuals ‘flavor’ implement in this function is not the same as the pearson residuals normalization method (see below).
2: generally 3000-5000 is a good starting point here.

Alternative Approach

The pre-processing workflow presented here can be considered a good overall starting point towards understanding the structure of the data. As such I have omitted more complex workflows and tasks such as integrating morphology-based and/or protein-based PCs into the analysis. For a more involved workflow, see our upcoming WTX paper.

The tabs below show both of these workflows. In each workflow, I set the umap parameters to be identical (40 PCs, 15 nearest neighbors, spread = 2 and minimum distance = 0.02). While these are fixed here, see the box Choosing UMAP Parameters below to see how these parameters effect the UMAP visual and the Leiden cluster assignments.

Workflow: Total counts-based normalization > log transformation > PCA

The procedure below adapts a standard scanpy pipeline. We’ll first normalize each cell the median total counts and then log tranform the data. With this approach I find that scaling the log1p data tends to generate more distinct and biologically relevant clusters so I add that step. Thus, we’ll focus our analysis on the most variable genes found in the dataset. After subsetting down to only these 4000 HVGs, we’ll scale each gene, run PCA, and put these HVG-derivied PCs into the full annData object. On this dataset (466820 cells), this whole process took about a minute.

Python Code

sc.pp.normalize_total(adata)
sc.pp.log1p(adata) # -> adata.X
adata.layers['TC'] = adata.X.copy()

# Subset to HVGs and get PCs
adata_hvg = adata[:, adata.var.highly_variable].copy()
sc.pp.scale(adata_hvg) # -> adata.X
sc.tl.pca(adata_hvg, svd_solver='auto') 
adata.obsm['X_pca_TC'] = adata_hvg.obsm['X_pca']

Run UMAP and Leiden with the top 40 PCs.

Python Code

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=40, metric = 'cosine',
  use_rep='X_pca_TC', key_added='neighbors_TC')

UMAP = sc.tl.umap(adata, min_dist=0.02, spread=2, neighbors_key='neighbors_TC', copy=True)

adata.obsm['umap_TC'] = UMAP.obsm['X_umap']
adata.uns['umap_TC_params'] = UMAP.uns['umap']

sc.tl.leiden(adata, resolution=0.5, 
  key_added='leiden_tc', flavor='igraph', 
  n_iterations=2, neighbors_key='neighbors_TC')

Workflow: Py to R > scPearsonPCA

The scPearsonPCA package estimates the PCs of the PR data without computing and storing a dense normalized matrix. Since it is written in R, we’ll add it to this workflow via reticulate after converting a few objects.

Python Code

total_counts_per_cell = adata.layers['counts'].sum(axis=1)
total_counts_per_cell = np.asarray(total_counts_per_cell).flatten()

total_counts_per_target = adata.layers['counts'].sum(axis=0)
total_counts_per_target = np.asarray(total_counts_per_target).flatten()

target_frequencies = total_counts_per_target / total_counts_per_target.sum()

adata_hvg = adata[:, adata.var.highly_variable].copy()
hvgs_counts = adata_hvg.layers['counts'].copy().astype(np.int64)

Convert relevant data to R.

R Code

library(Matrix)
hvgs_counts <- py$hvgs_counts
rownames(hvgs_counts) <- py$adata_hvg$obs_names$to_list()
colnames(hvgs_counts) <- py$adata_hvg$var_names$to_list()

# the approach we use below expects a sparse matrix with genes (rows) by cells (columns)
t_hvgs_counts <- Matrix::t(hvgs_counts)
t_hvgs_counts <- as(t_hvgs_counts, "CsparseMatrix")

Install the scPearsonPCA package (and remotes), if needed.

R Code

remotes::install_github("Nanostring-Biostats/CosMx-Analysis-Scratch-Space",
                         subdir = "_code/scPearsonPCA", ref = "Main")

Run Seurat-style PCA using scPearsonPCA.

R Code

library(scPearsonPCA)

total_counts_per_cell <- py$total_counts_per_cell

target_frequencies <- py$target_frequencies
names(target_frequencies) <- py$adata$var_names$to_list()

pcaobj <- sparse_quasipoisson_pca_seurat(
              x = t_hvgs_counts,
              totalcounts = total_counts_per_cell,
              grate = target_frequencies[rownames(t_hvgs_counts)],
              scale.max = 10,
              do.scale = TRUE,
              do.center = TRUE,
              ncores = 5
          )
pca <- pcaobj$reduction.data@cell.embeddings

Add these PCs to the annData object in Python. Run UMAP and Leiden with the top 40 PCs.

Python Code

adata.obsm["X_pca_pr"] = r.pca

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=40, metric = 'cosine',
  use_rep='X_pca_pr', key_added='neighbors_pr')

UMAP = sc.tl.umap(adata, min_dist=0.02, spread=2, neighbors_key='neighbors_pr', copy=True)
adata.obsm['umap_pr'] = UMAP.obsm['X_umap']
adata.uns['umap_pr_params'] = UMAP.uns['umap']

sc.tl.leiden(adata, resolution=0.5, 
  key_added='leiden_pr', flavor='igraph', 
  n_iterations=2, neighbors_key='neighbors_pr')

Save the annData object.

Python Code

adata.write_h5ad(
  os.path.join(r.analysis_dir, "anndata-2-leiden.h5ad"),
  compression=hdf5plugin.FILTERS["zstd"],
  compression_opts=hdf5plugin.Zstd(clevel=5).filter_options
)

4.2 Visualizations

Now that we have the UMAP embeddings and the Leiden cluster assignments in the annData object, we can plot the results of the two methods and compare. The easiest way to plot the Leiden clusters in UMAP space is with scnapy’s built-in method, sc.pl.umap; however, since we have non-typical obsm keys, we’ll use the more generic sc.pl.embedding method.

Python Code

import matplotlib.pyplot as plt
sc.settings.figdir = r.pp_dir
save_dir = r.pp_dir

adata.obs['log10_nCount_RNA'] = np.log10(adata.obs['nCount_RNA'] + 1)
adata.obs['qcflag'] = adata.obs['qcflag'].astype('category')

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(15, 10))

sc.pl.embedding(adata,
                basis='umap_TC',
                color='leiden_tc',
                title='Total Counts (leiden_tc)',
                palette=sc.pl.palettes.default_20,
                ax=ax1,
                show=False)

sc.pl.embedding(adata,
                basis='umap_pr',
                color='leiden_pr',
                title='Pearson Residuals (leiden_pr)',
                palette=sc.pl.palettes.default_20,
                ax=ax2,
                show=False)
                
sc.pl.embedding(adata,
                basis='umap_TC',
                color='log10_nCount_RNA',
                title='Total Counts (log10_nCount_RNA)',
                ax=ax3,
                show=False)

sc.pl.embedding(adata,
                basis='umap_pr',
                color='log10_nCount_RNA',
                title='Pearson Residuals (log10_nCount_RNA)',
                ax=ax4,
                show=False)

sc.pl.embedding(adata,
                basis='umap_TC',
                color='qcflag',
                title='Total Counts (FOV QC Flag)',
                palette=sc.pl.palettes.default_20,
                ax=ax5,
                show=False)

sc.pl.embedding(adata,
                basis='umap_pr',
                color='qcflag',
                title='Pearson Residuals (FOV QC Flag)',
                palette=sc.pl.palettes.default_20,
                ax=ax6,
                show=False)

plt.tight_layout()
filename = "umap_umap_compare.png"
save_path = os.path.join(save_dir, filename)
plt.savefig(save_path)

Figure 4.1: UMAP with Leiden clusters from total counts normalization (left) and scPearsonPCA PCs (right). Note that a given cluster label in one plot is not the same as that label in the other plot. Top: Leiden clusters. Middle: cells colored by their total counts. Bottom: cells colored by their QC flag (0 = no flag; 32 = FOV flag; no other flaggeed cells were processed).

Both normalization methods demonstrate effective cluster separation (Figure 4.1). When evaluating these visualizations, it is important to watch for artifacts, particularly sub-clustering driven primarily by total transcript counts, which can signal suboptimal normalization. Additionally, because we retained cells flagged by FOV QC (flag 32), we must verify that these cells do not aggregate into spurious “satellite” clusters which could indicate technical artifacts rather than biological signal. In this dataset, while FOV QC-flagged cells are present in the upper right cluster (Leiden 5 in the total counts workflow and Leiden 11 in the Pearson Residuals workflow), they do not form a distinct artifactual group. Instead, they appear integrated with the dominant cell population, suggesting their distribution is driven by the biology of that region – a hypothesis we will elucidate further in ?sec-celltyping.

We can compare which combination of Leiden clusters a given cell was assigned to:

import pandas as pd
import seaborn as sns

crosstab = pd.crosstab(
    adata.obs['leiden_tc'],
    adata.obs['leiden_pr']
)

crosstab_norm = crosstab.div(crosstab.sum(axis=1), axis=0)

plt.figure(figsize=(12, 10))
sns.heatmap(
    crosstab_norm,
    annot=True,
    fmt=".2f", cmap="viridis",
    linewidths=.5
)

plt.title("Comparison of Leiden clusters derived from TC or PR workflows", fontsize=16)
plt.ylabel("Total Counts (leiden_tc)", fontsize=12)
plt.xlabel("Pearson Residuals (leiden_pr)", fontsize=12)
filename = "leiden_compare_crosstab.png"
save_path = os.path.join(save_dir, filename)
plt.savefig(save_path)

Figure 4.2: Proportion of cells classified with the pure scanpy-based approach compared to the scPearsonPCA-based appraoch.

Figure 4.2 shows a relationship between the clusters derived from our two normalization approaches. Most (15 of 16) Leiden clusters with the total counts approach have more than 85% of cells that are found in a single Leiden cluster for the Pearson Residuals method (e.g., 95% of tc_3 cells mapped to pr_1). One cluster in the total counts approach (tc_1) “split” amongst PR clusters (primarily clusters pr_5, pr_6, and pr_8).

So which is better? With this particular dataset the two approaches are comparable. For the majority of my CosMx SMI WTX analyses I have found the Pearson Residual-based normalization to provide more meaningful biological clusters consistently across datasets so that might be a good overall recommendation – especially if you are working within the R ecosystem. If you are looking for a quick analysis, then the total counts-based approach may work in many cases.

Choosing UMAP Parameters

4.3 Choosing UMAP Parameters

UMAP helps visualize how groups of cells relate to each other in high-dimensional space. While there isn’t one “correct” UMAP layout, some embeddings are more effective than others at revealing underlying biological structure. We have seen in this chapter how having choice in normalization alone (via PCA) can alter the shape of a UMAP plot even when all other parameters are identical (Figure 4.1).

This section explores how commonly adjusted parameters and inputs other than the choice of normalization influence the resulting UMAP visualization, aiming to build intuition for how to tune these parameters effectively in your dataset.

Let’s systematically vary five key factors:

The quality filtering applied to the input cells. Using only high-quality cells often leads to clearer, tighter clusters representing core biological populations. If filtering was too stringent then entire cell populations might be unrepresented.
The number of Principal Components (PCs) used. Using a lower number will concetrate on the strongest axes of variation at the cost of reducing any distinction between cell subtypes if that distinction is found in higher PCs. Choosing a higher number of PCs can potentially resolve finer subtypes but also risks incorporating noise, which might slightly blur cluster boundaries or create small, potentially spurious groupings.
The number of nearest neighbors (n_neighbors). Lower n_neighbors tends to emphasize local cluster separation while higher n_neigbhors tends to form broader clusters.
The distance metric (metric). Defines the rule for calculating “closeness” between cells. Euclidean measures straight-line distance while cosine measures the angle between gene expression profiles. Since changing this alters which cells are considered “neighbors,” the parameter can have large impacts on the global shape. Euclidean is the default many packages but cosine often provides more biologically meaningful clusters for CosMx SMI data.
The minimum distance parameter (min_dist). Controls how tightly packed points are within a cluster. Higher values tend to make more diffuse clusters.
The spread parameter (spread). Controls the separation between clusters. Lower spread tends to bring clusters closer together.

This list isn’t exhaustive, but varying these while keeping others constant will illustrate their core effects. To make this exploration computationally feasible, we will use a representative subsample of 50,000 cells from our quality-flagged dataset. And to make visualization smooth, we’ll only plot at most 500 cells.

Click the radio buttons below to see what effect these parameter have on the UMAP visualization. Some of these changes are gradual while some dramatically alter the shape and configuration of the UMAP embeddings. You’ll note that filtering out the poor quality cells doesn’t merely remove the cells from the UMAP; instead, the inclusion of poorer-quality data can effect the entire UMAP topology. I have also included the Leiden clusters for these cells. The number of PCs and the number of neighbors effect Leiden cluster cell assignment while spread and min_dist do not.

Encouragingly, in this dataset there doesn’t appear to be any UMAP clusters that appear to be made up of cells that were flagged in our FOV QC analysis (Section 3.1). If this was the case, it would suggest that we would want to remove the cells in the effected FOVs.

viewof qc_level = Inputs.radio(["no QC filter", "include FOV QC cells", "filter all flags"], {label: "QC Level", value: "include FOV QC cells"})
viewof n_pcs = Inputs.radio([10, 20, 30, 40, 50], {label: "PCs", value: 30})
viewof n_neighbors = Inputs.radio([15, 30, 50], {label: "Neighbors", value: 30})
viewof metric = Inputs.radio(["euclidean", "cosine"], {label: "Metric", value: "cosine"})
viewof mdist = Inputs.radio([0.01, 0.05, 0.2, 0.5], {label: "Min. distance", value: 0.2})
viewof spread = Inputs.radio([0.5, 1.0, 2.0], {label: "Spread", value: 2.0})
viewof colorby = Inputs.radio(["QC", "Leiden"], {label: "Color by", value: "QC"})

qcFlagLegend = {
  try {
    const uniqueFlagsWithZero = Object.keys(flagColors).map(Number).sort((a,b)=> a-b);
    const tableRows = uniqueFlagsWithZero.map(flagValue => {
      const flagColor = flagColors[flagValue]; // Use static map
      const circles = flagNames.map(name => {
         const mask = qcMasks[name];
         const isSet = (flagValue & mask) > 0;
         const circleFill = isSet ? flagColor : "#e0e0e0";
         const circleSvg = `<svg width="15" height="15"><circle cx="7.5" cy="7.5" r="6" fill="${circleFill}" stroke="${isSet ? 'black' : '#ccc'}" stroke-width="1"></circle></svg>`;
         return `<td style="text-align: center;">${circleSvg}</td>`;
      }).join('');
      const rowHeader = `<th scope="row" style="text-align: right; padding-right: 10px; white-space: nowrap;">
                           <span style="display: inline-block; width: 12px; height: 12px; background-color: ${flagColor}; border: 1px solid #555; margin-right: 5px; vertical-align: middle;"></span>
                           ${flagValue === 0 ? 'Passed' : flagValue}
                         </th>`;
      return `<tr>${rowHeader}${circles}</tr>`;
    }).join('');
    const headerCells = flagNames.map(name =>
      `<th style="height: 100px; text-align: center; vertical-align: bottom; padding-bottom: 5px; font-size: 0.8em; white-space: nowrap;">
         <span style="writing-mode: vertical-lr; transform: rotate(180deg);">${name.replace(/_/g, ' ')}</span>
       </th>`
    ).join('');
    const tableHeader = `<thead><tr><th style="vertical-align: bottom;">Flag Value</th>${headerCells}</tr></thead>`;
    return html`<table style="border-collapse: collapse; margin-top: 10px; font-size: 0.9em; table-layout: fixed;">
                   ${tableHeader}<tbody>${tableRows}</tbody>
                 </table>`;
   } catch (error) {
     console.error("Error generating QC Legend:", error);
     return html`<div>Error generating QC Legend</div>`;
   }
}

leidenScaleAndLegend = {
  let scale, legend;
  try { // Add try...catch
    // Check if query_data is ready and has 'leiden'
    if (!query_data || query_data.length === 0 || !query_data[0].hasOwnProperty('leiden')) {
        console.warn("Leiden data (query_data) not ready for legend/scale.");
        scale = d3.scaleOrdinal().domain(["0"]).range(["grey"]); // Placeholder scale
        legend = html`<div><ul style="list-style: none; padding-left: 0; margin-top: 10px; font-size: 0.9em;">
                        <li style="margin-bottom: 2px;"><span style="display: inline-block; width: 12px; height: 12px; background-color: grey; border: 1px solid #555; margin-right: 5px; vertical-align: middle;"></span> Cluster 0</li>
                      </ul>(Loading...)</div>`; // placeholder
    } else {
        const uniqueClusters = Array.from(new Set(query_data.map(d => d.leiden))).sort((a, b) => a - b);
        scale = d3.scaleOrdinal(d3.schemeTableau10).domain(uniqueClusters);

        const legendItems = uniqueClusters.map(cluster => {
           const color = scale(cluster);
           const clusterLabel = String(cluster);
           return `<li style="margin-bottom: 2px;"><span style="display: inline-block; width: 12px; height: 12px; background-color: ${color}; border: 1px solid #555; margin-right: 5px; vertical-align: middle;"></span> Cluster ${clusterLabel}</li>`;
        }).join('');
        legend = html`<ul style="list-style: none; padding-left: 0; margin-top: 10px; font-size: 0.9em;">${legendItems}</ul>`;
    }
  } catch(error) {
     console.error("Error generating Leiden Legend:", error);
     scale = () => 'grey';
     legend = html`<div>Error generating Leiden Legend</div>`;
  }
  return { scale, legend };
}

leidenColorScale = leidenScaleAndLegend.scale
leidenLegendHtml = leidenScaleAndLegend.legend

d3 = require("d3@7")

width = 500
height = 500
marginTop = 20
marginRight = 20
marginBottom = 30
marginLeft = 40

xScale = d3.scaleLinear().domain([0, 1]).range([marginLeft, width - marginRight])
yScale = d3.scaleLinear().domain([0, 1]).range([height - marginBottom, marginTop])

svg = d3.create("svg")
    .attr("width", width)
    .attr("height", height)
    .attr("viewBox", [0, 0, width, height])
    .attr("style", "max-width: 100%; height: auto;");

xAxis = d3.axisBottom(xScale)
    .ticks(width / 80)
    .tickSizeOuter(0)
    .tickSizeInner(-(height - marginTop - marginBottom));

yAxis = d3.axisLeft(yScale)
    .ticks(height / 40)
    .tickSizeOuter(0)
    .tickSizeInner(-(width - marginLeft - marginRight));

gx = svg.append("g")
    .attr("class", "x-axis")
    .attr("transform", `translate(0,${height - marginBottom})`)
    .call(xAxis);

gy = svg.append("g")
    .attr("class", "y-axis")
    .attr("transform", `translate(${marginLeft},0)`)
    .call(yAxis);

svg.selectAll(".tick line")
   .attr("stroke-opacity", 0.1);

svg.append("text")
    .attr("x", width - marginRight)
    .attr("y", height - marginBottom - 4)
    .attr("fill", "currentColor")
    .attr("text-anchor", "end")
    .text("UMAP 1 →");

svg.append("text")
    .attr("x", marginLeft + 4)
    .attr("y", marginTop)
    .attr("fill", "currentColor")
    .attr("text-anchor", "start")
    .attr("dominant-baseline", "hanging")
    .text("↑ UMAP 2");

-
svg.append("clipPath")
    .attr("id", "clip")
  .append("rect")
    .attr("x", marginLeft)
    .attr("y", marginTop)
    .attr("width", width - marginLeft - marginRight)
    .attr("height", height - marginTop - marginBottom);

pointGroup = svg.append("g")
    .attr("fill", "black")
    .attr("clip-path", "url(#clip)");

{ // Reactive block
  const updateVisualization = () => {
    const duration = 750;
    
    const currentScale = colorby === "QC" ? qcColorScale : leidenColorScale;

    if (!currentScale) { 
        console.error("Scale missing"); 
        return; 
    }
    
    if (!query_data || query_data.length === 0) {
       console.warn("No data available for plotting.");
       pointGroup.selectAll("circle").transition().duration(duration).attr("opacity", 0).remove();
       return; 
    }

    const xExtent = d3.extent(query_data, d => d.umap_1);
    const yExtent = d3.extent(query_data, d => d.umap_2);

    const xPad = (xExtent[1] - xExtent[0]) * 0.05;
    const yPad = (yExtent[1] - yExtent[0]) * 0.05;

    xScale.domain([xExtent[0] - xPad || -1, xExtent[1] + xPad || 1]);
    yScale.domain([yExtent[0] - yPad || -1, yExtent[1] + yPad || 1]);

    svg.select(".x-axis")
      .transition().duration(duration)
      .call(xAxis)
      .on("start", () => {
         svg.selectAll(".x-axis .tick line").attr("stroke-opacity", 0.1);
      });

    svg.select(".y-axis")
      .transition().duration(duration)
      .call(yAxis)
      .on("start", () => {
         svg.selectAll(".y-axis .tick line").attr("stroke-opacity", 0.1);
      });

    pointGroup.selectAll("circle")
      .data(query_data, d => d.cell_id)
      .join(
        enter => enter.append("circle")
            .attr("cx", d => xScale(d.umap_1)) 
            .attr("cy", d => yScale(d.umap_2))
            .attr("r", 2.5)
            .attr("fill", d => {
                try {
                    const valueToColor = colorby === "QC" ? qcFlagMap.get(d.cell_id) : d.leiden;
                    return currentScale(valueToColor) || 'grey';
                } catch(e) { return 'grey'; }
            }) 
            .attr("opacity", 0)
            .call(enter => enter.transition().duration(duration).attr("opacity", 0.8)),
        
        update => update
            .call(update => update.transition().duration(duration)
                .attr("cx", d => xScale(d.umap_1))
                .attr("cy", d => yScale(d.umap_2))
                .attr("fill", d => { 
                    try {
                        const valueToColor = colorby === "QC" ? qcFlagMap.get(d.cell_id) : d.leiden;
                        return currentScale(valueToColor) || 'grey';
                    } catch(e) { return 'grey'; }
                })
                .attr("opacity", 0.8)
             ),
        
        exit => exit
            .call(exit => exit.transition().duration(duration).attr("opacity", 0).remove())
      );
  };
  
  try { 
    updateVisualization();
  } catch(error) {
    console.error("Error during D3 updateVisualization:", error);
  }

  return svg.node(); 
}

In the dot plot on the bottom left provides a QC flag legend. For example, QC flag ‘5’ represents cells that were flagged for having low counts and low features whereas QC flag ‘96’ represents cells that wee flagged for FOV QC as wells as low SBR.

If you want to further explore this subset, you can examine the query_data and qc_flags_data dataframes below.

For the remainder of this chapter, I will focus on the PR-based method.

I find it’s helpful to:

Have more control of the plotting aesthetics and label multiple sub-clusters of a given Leiden cluster for clarity and
isolate the Leiden clusters and visualize them in both UMAP space and physical (XY) space to see if there are regions in the tissue that might provide clues about the cell types.

Instead of the default plotDots colors, let’s define our own.

R Code

# define colors semi-manually instead of using the default colors
column <- 'leiden_pr'
column_levels <- levels(py$adata$obs[[column]])
n_levels <- length(column_levels)

if(n_levels<27){
  colors_use <- pals::alphabet(n_levels)
} else if(n < 53){
  colors_use <- c(pals::alphabet(26), pals::alphabet2(n_levels-26))
} else {
  stop("consider fewer groups")
}

names(colors_use) <- column_levels

# saving colors for later
results_list[['leiden_global_names']] <- names(colors_use)
results_list[['leiden_global_colors']] <- as.character(colors_use)
saveRDS(results_list, results_list_file)

Plot the cells in XY space and UMAP space and color based on leiden_pr.

R Code

group_prefix <- "leiden"

pp_assets_dir <- file.path(analysis_asset_dir, "preprocessing")

# XY (global only)
plotDots(py$adata, color_by='leiden_pr',
              plot_global = TRUE,
              facet_by_group = FALSE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001
                  ),
                  scale_bar_params = list(
                    location = c(5, 0),
                    width = 2,
                    n = 3,
                    height = 0.1,
                    scale_colors = c("black", "grey30"),
                    label_nudge_y = -0.3
                  ),
                  directory = pp_assets_dir,
                  fileType = "png",
                  dpi = 200,
                  width = 8, 
                  height = 8,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("X (mm)"),
                ylab("Y (mm)"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                theme(legend.position = c(0.8, 0.4)),
                guides(color = guide_legend(
                  title="Leiden (PR)",
                  override.aes = list(size = 3) ) )
              )
              )

# XY (facets only)
plotDots(py$adata, color_by='leiden_pr',
              plot_global = FALSE,
              facet_by_group = TRUE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001
                  ),
                  scale_bar_params = list(
                    location = c(5, 0),
                    width = 2,
                    n = 3,
                    height = 0.1,
                    scale_colors = c("black", "grey30"),
                    label_nudge_y = -0.3
                  ),
                  directory = pp_assets_dir,
                  fileType = "png",
                  dpi = 100,
                  width = 5,
                  height = 5,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("X (mm)"),
                ylab("Y (mm)"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                theme(legend.position = c(0.8, 0.4)),
                guides(color = guide_legend(
                  title="Leiden (PR)",
                  override.aes = list(size = 3) ) )
              )
              )

# UMAP (global only)
plotDots(py$adata, 
              obsm_key = "umap_pr",
              color_by='leiden_pr',
              plot_global = TRUE,
              facet_by_group = FALSE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001, alpha=0.1
                  ),
                  geom_label_params = list(
                    size = 4
                  ),
                  labels_on_plot = data.frame(),
                  directory = pp_assets_dir,
                  fileType = "png",
                  dpi = 200,
                  width = 8,
                  height = 8,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("UMAP 1"),
                ylab("UMAP 2"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                # guides(color = guide_legend(
                #   title="Cell Type",
                #   override.aes = list(size = 3) ) ), 
                theme(legend.position = "none")
              )
            )

# UMAP (facets only)
plotDots(py$adata, 
              obsm_key = "umap_pr",
              color_by='leiden_pr',
              plot_global = FALSE,
              facet_by_group = TRUE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001, alpha=0.1
                  ),
                  geom_label_params = list(
                    size = 2
                  ),
                  labels_on_plot = data.frame(),
                  directory = pp_assets_dir,
                  fileType = "png",
                  dpi = 100,
                  width = 5,
                  height = 5,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("UMAP 1"),
                ylab("UMAP 2"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                guides(color = guide_legend(
                  title="Cell Type",
                  override.aes = list(size = 3) ) )
              )
            )

1: in the plotting below, we will use this name to parse out files.

Here are the overall (global) plots.

Figure 4.3: UMAP with Leiden (PR-based) cells.

Figure 4.4: XY with Leiden (PR-based) cells.

Here are the individual plots² we created organized by Leiden cluster. From these pairs of plots one can quickly see how some Leiden clusters are spatially restricted. For example, Leiden 0 consists of cells in the top part of the tissue (i.e., the non-tumoral epithelium). Similarly, cluster 12 occupies the left-most region of the tissue.

4.4 Conclusion

And with that, we’ve navigated the essential pre-processing steps. Starting with our quality-controlled cell data, we applied normalization to account for technical variations, identified the most informative genes, and used PCA to reduce the data’s dimensionality. Building upon this foundation, we generated UMAP embeddings for visualization and performed Leiden clustering to partition the cells into distinct groups based on their expression profiles.

The interactive exploration highlighted a critical point: parameters matter. Choices made during normalization, dimensionality reduction, and clustering significantly influence the final UMAP layout and cluster assignments. There isn’t a single “correct” set of parameters, but understanding their effects allows you to tailor the analysis to best reveal the biological structures relevant to your specific questions.

The resulting AnnData object, now enriched with normalized data layers, PCA coordinates, UMAP embeddings, and Leiden cluster labels, is primed for the next crucial phase: assigning biological meaning to these patterns. In the following chapters we will delve into grouping cells based on their spatial domain and their cell types.

when I tried using the sc.experimental.pp.normalize_pearson_residuals method on this dataset, the OS crashed (>200 GBs of RAM).↩︎
To see how to generate this type of output, click the </> icon on the top right of this chapter and then click view source.↩︎

--- subtitle: Pre-processing – From Counts to Clusters author: - name: Evelyn Metzger orcid: 0000-0002-4074-9003 affiliations: - ref: bsb - ref: eveilyeverafter execute: eval: true freeze: auto message: true warning: false self-contained: false code-fold: false code-tools: true code-annotations: hover engine: knitr prefer-html: true format: live-html webr: packages: - dplyr - duckdb repos: - https://r-lib.r-universe.dev resources: - assets/interactives/umap_explore.parquet - assets/interactives/umap_explore_qc_flags.parquet --- {{< include ./_extensions/r-wasm/live/_knitr.qmd >}} # Normalization, Reduction, and Clustering {#sec-preprocessing} ```{r} #| label: Preamble #| eval: true #| echo: true #| message: false #| code-fold: true #| code-summary: "R code" source("./preamble.R") reticulate::source_python("./preamble.py") analysis_dir <- file.path(getwd(), "analysis_results") input_dir <- file.path(analysis_dir, "input_files") output_dir <- file.path(analysis_dir, "output_files") analysis_asset_dir <- "./assets/analysis_results" # <1> qc_dir <- file.path(output_dir, "qc") if(!dir.exists(qc_dir)){ dir.create(qc_dir, recursive = TRUE) } pp_dir <- file.path(output_dir, "processing") if(!dir.exists(pp_dir)){ dir.create(pp_dir, recursive = TRUE) } results_list_file = file.path(analysis_dir, "results_list.rds") if(!file.exists(results_list_file)){ results_list <- list() saveRDS(results_list, results_list_file) } else { results_list <- readRDS(results_list_file) } source("./helpers.R") # contains the plotDots function ``` Our quality control steps in @sec-quality-processing identified the analyzable cells, but these their raw counts need further processing to reveal biological patterns. This chapter covers the essential pre-processing steps to make the data interpretable. First, normalization adjusts counts to account for technical factors like cell area (_i.e._, larger cells typically have more counts), enabling fair comparisons between cells. Second, dimensionality reduction (using PCA and UMAP) helps reduce the total number of dimensions in the data (almost 19,000) to a manageable representation that captures key biological variation and allows visualization. Finally, clustering leverages this reduced space to group cells with similar expression profiles, which can be a foundation for downstream analyses like cell typing. By the end of this chapter, we will have transformed the filtered count matrix into a clustered, low-dimensional dataset and classified cells into groups which, taken together, form our foundation for biological interpretation and spatial analysis. I often get asked "What is the best parameter combination for CosMx data?" and I hope by the end of this chapter you're able to understand how parameter adjustments impact many of these processing steps. Let's begin by loading the data. ```{python} #| label: load-anndata #| eval: false #| code-summary: "Python Code" #| message: true #| warning: false #| code-fold: show adata = ad.read_h5ad(os.path.join(r.analysis_dir, "anndata-1-qc-filtered.h5ad")) ``` ## Two pre-processing workflows There are two different pre-processing workflows that I use. The first one is borrowed from spatially-unaware scRNA-seq. The main advantage of this approach is that it is fast, the normalization matrix can remain sparse, the values are easy to understand (_e.g.,_ counts of gene X per 10,000 or some transformation of such), and often times for WTX data provides the basis for good clustering of cells in downstream UMAP. Moreover, Scanpy and Seurat each have built-in methods for these steps which make it relatively low friction. However, there are cases were this standard ("legacy") workflow provides unexpected results. One reason is that total counts-based normalization might have unstable variance @Hafemeister2019 such that high-expressing housekeeping genes might have high variance and disproportionately effect downstream dimensional reduction and clustering. The second, newer workflow that I have been using is based on normalization using Pearson Residuals (PR). The approach itself is not new and scanpy has a built-in method for it (_i.e._, `scanpy.experimental.pp.normalize_pearson_residuals`). Until recently the main disadvantage for using PR is that it doesn't scale well to large number of cells that are typical in CosMx SMI[^1] and it creates a large, dense normalization matrix that can make analysis and read/writing slower. Fortunately, in many cases I do not need to compute and store such a dense matrix. For example, Dan McGuire recently created the R package [scPearsonPCA](https://nanostring-biostats.github.io/CosMx-Analysis-Scratch-Space/posts/pearsonpca) that can estimate the Principle Components (PCs) of the PR normalized data without needing to convert and use very dense matrices (which we will show below). While this workflow involves converting data between python and R, I generally find the clusters that are derived from such data are comparable to the standard workflow or superior and so I recommend this workflow overall. In this chapter, I'll run both the standard and recommended workflows on the colon cancer dataset to provide code for both approaches. I will also do a deep-dive on the choice of UMAP parameters. This particular dataset is a case where the results are more-or-less comparable and I'll pick the PR data to use throughout the rest of the colon analysis. Let's begin. Both options begin the same way: computing the highly variable genes (HVGs). I have found that the `pearson_residulas` method to detect HVGs (not to be confused with PR normalization) provides stable results for a variety of CosMx SMI WTX experiments. ```{python} #| label: RNA-preprocessing #| eval: false #| code-summary: "Python Code" #| message: false #| warning: false #| code-fold: show adata.layers["counts"] = adata.X.copy() sc.experimental.pp.highly_variable_genes( # <1> adata, flavor='pearson_residuals', n_top_genes=4000, # <2> layer='counts' ) ``` 1. the pearson_residuals 'flavor' implement in this function is not the same as the pearson residuals _normalization method_ (see below). 2. generally 3000-5000 is a good starting point here. ::: {.column-margin} ::: {.otherapproachesbox title="Alternative Approach"} The pre-processing workflow presented here can be considered a good overall starting point towards understanding the structure of the data. As such I have omitted more complex workflows and tasks such as integrating morphology-based and/or protein-based PCs into the analysis. For a more involved workflow, see our upcoming WTX paper. ::: ::: The tabs below show both of these workflows. In each workflow, I set the umap parameters to be identical (40 PCs, 15 nearest neighbors, spread = 2 and minimum distance = 0.02). While these are fixed here, see the box `Choosing UMAP Parameters` below to see how these parameters effect the UMAP visual and the Leiden cluster assignments. ::: {.panel-tabset} ### Total Counts Workflow {#sec-tc-workflow} > Workflow: Total counts-based normalization > log transformation > PCA The procedure below adapts a standard `scanpy` [pipeline](https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#normalization). We'll first normalize each cell the median total counts and then log tranform the data. With this approach I find that scaling the log1p data tends to generate more distinct and biologically relevant clusters so I add that step. Thus, we'll focus our analysis on the most variable genes found in the dataset. After subsetting down to only these 4000 HVGs, we'll scale each gene, run PCA, and put these HVG-derivied PCs into the full annData object. On this dataset (`r results_list$n_cells_after_qc` cells), this whole process took about a minute. ```{python} #| label: TC-1 #| eval: false #| code-summary: "Python Code" #| message: false #| warning: false #| code-fold: show sc.pp.normalize_total(adata) sc.pp.log1p(adata) # -> adata.X adata.layers['TC'] = adata.X.copy() # Subset to HVGs and get PCs adata_hvg = adata[:, adata.var.highly_variable].copy() sc.pp.scale(adata_hvg) # -> adata.X sc.tl.pca(adata_hvg, svd_solver='auto') adata.obsm['X_pca_TC'] = adata_hvg.obsm['X_pca'] ``` Run UMAP and Leiden with the top 40 PCs. ```{python} #| label: TC-2 #| eval: false #| code-summary: "Python Code" #| message: false #| warning: false #| code-fold: show sc.pp.neighbors(adata, n_neighbors=15, n_pcs=40, metric = 'cosine', use_rep='X_pca_TC', key_added='neighbors_TC') UMAP = sc.tl.umap(adata, min_dist=0.02, spread=2, neighbors_key='neighbors_TC', copy=True) adata.obsm['umap_TC'] = UMAP.obsm['X_umap'] adata.uns['umap_TC_params'] = UMAP.uns['umap'] sc.tl.leiden(adata, resolution=0.5, key_added='leiden_tc', flavor='igraph', n_iterations=2, neighbors_key='neighbors_TC') ``` ### Pearson Residuals Workflow > Workflow: Py to R > scPearsonPCA The `scPearsonPCA` package estimates the PCs of the PR data without computing and storing a dense normalized matrix. Since it is written in R, we'll add it to this workflow via `reticulate` after converting a few objects. ```{python} #| label: PR-1 #| eval: false #| code-summary: "Python Code" #| message: false #| warning: false #| code-fold: show total_counts_per_cell = adata.layers['counts'].sum(axis=1) total_counts_per_cell = np.asarray(total_counts_per_cell).flatten() total_counts_per_target = adata.layers['counts'].sum(axis=0) total_counts_per_target = np.asarray(total_counts_per_target).flatten() target_frequencies = total_counts_per_target / total_counts_per_target.sum() adata_hvg = adata[:, adata.var.highly_variable].copy() hvgs_counts = adata_hvg.layers['counts'].copy().astype(np.int64) ``` Convert relevant data to R. ```{r} #| label: PR-2 #| eval: false #| code-summary: "R Code" #| message: true #| warning: false #| code-fold: show library(Matrix) hvgs_counts <- py$hvgs_counts rownames(hvgs_counts) <- py$adata_hvg$obs_names$to_list() colnames(hvgs_counts) <- py$adata_hvg$var_names$to_list() # the approach we use below expects a sparse matrix with genes (rows) by cells (columns) t_hvgs_counts <- Matrix::t(hvgs_counts) t_hvgs_counts <- as(t_hvgs_counts, "CsparseMatrix") ``` Install the scPearsonPCA package (and remotes), if needed. ```{r} #| label: PR-3 #| eval: false #| code-summary: "R Code" #| message: true #| warning: false #| code-fold: show remotes::install_github("Nanostring-Biostats/CosMx-Analysis-Scratch-Space", subdir = "_code/scPearsonPCA", ref = "Main") ``` Run Seurat-style PCA using scPearsonPCA. ```{r} #| label: PR-4 #| eval: false #| code-summary: "R Code" #| message: true #| warning: false #| code-fold: show library(scPearsonPCA) total_counts_per_cell <- py$total_counts_per_cell target_frequencies <- py$target_frequencies names(target_frequencies) <- py$adata$var_names$to_list() pcaobj <- sparse_quasipoisson_pca_seurat( x = t_hvgs_counts, totalcounts = total_counts_per_cell, grate = target_frequencies[rownames(t_hvgs_counts)], scale.max = 10, do.scale = TRUE, do.center = TRUE, ncores = 5 ) pca <- pcaobj$reduction.data@cell.embeddings ``` Add these PCs to the annData object in Python. Run UMAP and Leiden with the top 40 PCs. ```{python} #| label: PR-6 #| eval: false #| code-summary: "Python Code" #| message: false #| warning: false #| code-fold: show adata.obsm["X_pca_pr"] = r.pca sc.pp.neighbors(adata, n_neighbors=15, n_pcs=40, metric = 'cosine', use_rep='X_pca_pr', key_added='neighbors_pr') UMAP = sc.tl.umap(adata, min_dist=0.02, spread=2, neighbors_key='neighbors_pr', copy=True) adata.obsm['umap_pr'] = UMAP.obsm['X_umap'] adata.uns['umap_pr_params'] = UMAP.uns['umap'] sc.tl.leiden(adata, resolution=0.5, key_added='leiden_pr', flavor='igraph', n_iterations=2, neighbors_key='neighbors_pr') ``` ::: Save the annData object. ```{python} #| label: save-anndata #| eval: false #| code-summary: "Python Code" #| message: false #| warning: false #| code-fold: show adata.write_h5ad( os.path.join(r.analysis_dir, "anndata-2-leiden.h5ad"), compression=hdf5plugin.FILTERS["zstd"], compression_opts=hdf5plugin.Zstd(clevel=5).filter_options ) ``` ## Visualizations Now that we have the UMAP embeddings and the Leiden cluster assignments in the annData object, we can plot the results of the two methods and compare. The easiest way to plot the Leiden clusters in UMAP space is with scnapy's built-in method, [`sc.pl.umap`](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.umap.html); however, since we have non-typical `obsm` keys, we'll use the more generic [`sc.pl.embedding`](https://scanpy.readthedocs.io/en/latest/api/generated/scanpy.pl.embedding.html) method. ```{python} #| label: make-umap-figs #| code-summary: "Python Code" #| code-fold: show #| message: false #| warning: false #| echo: true #| eval: false import matplotlib.pyplot as plt sc.settings.figdir = r.pp_dir save_dir = r.pp_dir adata.obs['log10_nCount_RNA'] = np.log10(adata.obs['nCount_RNA'] + 1) adata.obs['qcflag'] = adata.obs['qcflag'].astype('category') fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(15, 10)) sc.pl.embedding(adata, basis='umap_TC', color='leiden_tc', title='Total Counts (leiden_tc)', palette=sc.pl.palettes.default_20, ax=ax1, show=False) sc.pl.embedding(adata, basis='umap_pr', color='leiden_pr', title='Pearson Residuals (leiden_pr)', palette=sc.pl.palettes.default_20, ax=ax2, show=False) sc.pl.embedding(adata, basis='umap_TC', color='log10_nCount_RNA', title='Total Counts (log10_nCount_RNA)', ax=ax3, show=False) sc.pl.embedding(adata, basis='umap_pr', color='log10_nCount_RNA', title='Pearson Residuals (log10_nCount_RNA)', ax=ax4, show=False) sc.pl.embedding(adata, basis='umap_TC', color='qcflag', title='Total Counts (FOV QC Flag)', palette=sc.pl.palettes.default_20, ax=ax5, show=False) sc.pl.embedding(adata, basis='umap_pr', color='qcflag', title='Pearson Residuals (FOV QC Flag)', palette=sc.pl.palettes.default_20, ax=ax6, show=False) plt.tight_layout() filename = "umap_umap_compare.png" save_path = os.path.join(save_dir, filename) plt.savefig(save_path) ``` ```{r} #| label: fig-umap-leiden-simple #| message: false #| warning: false #| echo: false #| fig.width: 8 #| fig.height: 12 #| fig-cap: "UMAP with Leiden clusters from total counts normalization (left) and scPearsonPCA PCs (right). Note that a given cluster label in one plot is not the same as that label in the other plot. Top: Leiden clusters. Middle: cells colored by their total counts. Bottom: cells colored by their QC flag (0 = no flag; 32 = FOV flag; no other flaggeed cells were processed)." #| eval: true render(file.path(analysis_asset_dir, "preprocessing"), "umap_umap_compare.png", pp_dir, overwrite=TRUE) ``` Both normalization methods demonstrate effective cluster separation (@fig-umap-leiden-simple). When evaluating these visualizations, it is important to watch for artifacts, particularly sub-clustering driven primarily by total transcript counts, which can signal suboptimal normalization. Additionally, because we retained cells flagged by FOV QC (flag 32), we must verify that these cells do not aggregate into spurious "satellite" clusters which could indicate technical artifacts rather than biological signal. In this dataset, while FOV QC-flagged cells are present in the upper right cluster (Leiden 5 in the total counts workflow and Leiden 11 in the Pearson Residuals workflow), they do not form a distinct artifactual group. Instead, they appear integrated with the dominant cell population, suggesting their distribution is driven by the biology of that region -- a hypothesis we will elucidate further in @sec-celltyping. We can compare which combination of Leiden clusters a given cell was assigned to: ```{python} #| echo: true #| eval: false import pandas as pd import seaborn as sns crosstab = pd.crosstab( adata.obs['leiden_tc'], adata.obs['leiden_pr'] ) crosstab_norm = crosstab.div(crosstab.sum(axis=1), axis=0) plt.figure(figsize=(12, 10)) sns.heatmap( crosstab_norm, annot=True, fmt=".2f", cmap="viridis", linewidths=.5 ) plt.title("Comparison of Leiden clusters derived from TC or PR workflows", fontsize=16) plt.ylabel("Total Counts (leiden_tc)", fontsize=12) plt.xlabel("Pearson Residuals (leiden_pr)", fontsize=12) filename = "leiden_compare_crosstab.png" save_path = os.path.join(save_dir, filename) plt.savefig(save_path) ``` ```{r} #| label: fig-umap-leiden-compare #| message: false #| warning: false #| echo: false #| fig.width: 12 #| fig.height: 12 #| fig-cap: "Proportion of cells classified with the pure scanpy-based approach compared to the scPearsonPCA-based appraoch." #| eval: true render(file.path(analysis_asset_dir, "preprocessing"), "leiden_compare_crosstab.png", pp_dir, overwrite=TRUE) ``` @fig-umap-leiden-compare shows a relationship between the clusters derived from our two normalization approaches. Most (15 of 16) Leiden clusters with the total counts approach have more than 85% of cells that are found in a single Leiden cluster for the Pearson Residuals method (e.g., 95% of tc_3 cells mapped to pr_1). One cluster in the total counts approach (tc_1) "split" amongst PR clusters (primarily clusters pr_5, pr_6, and pr_8). So which is _better_? With this particular dataset the two approaches are comparable. For the majority of my CosMx SMI WTX analyses I have found the Pearson Residual-based normalization to provide more meaningful biological clusters consistently across datasets so that might be a good overall recommendation -- especially if you are working within the R ecosystem. If you are looking for a quick analysis, then the total counts-based approach may work in many cases. :::{.column-body-outset} :::{.noteworthybox title="Choosing UMAP Parameters" collapse="true"} ## Choosing UMAP Parameters UMAP helps visualize how groups of cells relate to each other in high-dimensional space. While there isn't one "correct" UMAP layout, some embeddings are more effective than others at revealing underlying biological structure. We have seen in this chapter how having choice in normalization alone (via PCA) can alter the shape of a UMAP plot even when all other parameters are identical (@fig-umap-leiden-simple). This section explores how commonly adjusted parameters and inputs other than the choice of normalization influence the resulting UMAP visualization, aiming to build intuition for how to tune these parameters effectively in your dataset. Let's systematically vary five key factors: - **The quality filtering applied to the input cells.** Using only high-quality cells often leads to clearer, tighter clusters representing core biological populations. If filtering was too stringent then entire cell populations might be unrepresented. - **The number of Principal Components (PCs) used.** Using a lower number will concetrate on the strongest axes of variation at the cost of reducing any distinction between cell subtypes if that distinction is found in higher PCs. Choosing a higher number of PCs can potentially resolve finer subtypes but also risks incorporating noise, which might slightly blur cluster boundaries or create small, potentially spurious groupings. - **The number of nearest neighbors (n_neighbors).** Lower `n_neighbors` tends to emphasize local cluster separation while higher `n_neigbhors` tends to form broader clusters. - **The distance metric (metric).** Defines the rule for calculating "closeness" between cells. Euclidean measures straight-line distance while cosine measures the angle between gene expression profiles. Since changing this alters which cells are considered "neighbors," the parameter can have large impacts on the global shape. Euclidean is the default many packages but cosine often provides more biologically meaningful clusters for CosMx SMI data. - **The minimum distance parameter (min_dist).** Controls how tightly packed points are within a cluster. Higher values tend to make more diffuse clusters. - **The spread parameter (spread).** Controls the separation between clusters. Lower `spread` tends to bring clusters closer together. This list isn't exhaustive, but varying these while keeping others constant will illustrate their core effects. To make this exploration computationally feasible, we will use a representative subsample of 50,000 cells from our quality-flagged dataset. And to make visualization smooth, we'll only plot at most 500 cells. Click the radio buttons below to see what effect these parameter have on the UMAP visualization. Some of these changes are gradual while some dramatically alter the shape and configuration of the UMAP embeddings. You'll note that filtering out the poor quality cells doesn't merely remove the cells from the UMAP; instead, the inclusion of poorer-quality data can effect the entire UMAP topology. I have also included the Leiden clusters for these cells. The number of PCs and the number of neighbors effect Leiden cluster cell assignment while `spread` and `min_dist` do not. Encouragingly, in this dataset there doesn't appear to be any UMAP clusters that appear to be made up of cells that were flagged in our FOV QC analysis (@sec-fov-qc). If this _was_ the case, it would suggest that we would want to remove the cells in the effected FOVs. ```{python} #| eval: false #| echo: false # This section is for the example only -- not part of a regular analysis import itertools import pandas as pd import numpy as np import pyarrow as pa import pyarrow.parquet as pq import scanpy as sc import os bdata = sc.read_h5ad(os.path.join(r.analysis_dir, "anndata-1-qc-flagged.h5ad")) n_subsample = 50000 random_seed = 98103 sc.pp.subsample(bdata, n_obs=n_subsample, random_state=random_seed) # Define Parameters qc_levels = { "no QC filter": None, "include FOV QC cells": [0, 32], "filter all flags": [0] } # Graph Parameters (affect the topology/connections) graph_params = list(itertools.product( [10, 20, 30, 40, 50], # n_pcs [15, 30, 50], # n_neighbors ['euclidean', 'cosine'] # metric )) # Layout Parameters (affect only the visualization) layout_params = list(itertools.product( [0.01, 0.05, 0.2, 0.5], # min_dist [0.5, 1.0, 2.0] # spread )) all_embeddings_list = [] for qc_level, qc_flags in qc_levels.items(): print(f"--- Starting QC Level: {qc_level} ---") # Create base object for this QC level if qc_level != "no QC filter": adata_qc = bdata[bdata.obs['qcflag'].isin(qc_flags), :].copy() else: adata_qc = bdata.copy() # Preprocessing adata_qc.layers["counts"] = adata_qc.X.copy() sc.pp.normalize_total(adata_qc) sc.pp.log1p(adata_qc) sc.experimental.pp.highly_variable_genes( adata_qc, flavor='pearson_residuals', n_top_genes=4000, layer='counts' ) adata_qc = adata_qc[:, adata_qc.var.highly_variable].copy() sc.pp.scale(adata_qc) sc.tl.pca(adata_qc, svd_solver='auto') for n_pcs, n_neighbors, metric in graph_params: print(f"----- Starting graph_params: {n_pcs} {n_neighbors} {metric} ---") adata_graph = adata_qc.copy() try: sc.pp.neighbors( adata_graph, n_pcs=n_pcs, n_neighbors=n_neighbors, use_rep='X_pca', metric=metric, random_state=random_seed ) for min_dist, spread in layout_params: param_string = f"qc_{qc_level}_pcs_{n_pcs}_nn_{n_neighbors}_met_{metric}_md_{min_dist}_sp_{spread}" print(f"Running: {param_string}") sc.tl.umap(adata_graph, min_dist=min_dist, spread=spread, random_state=random_seed) sc.tl.leiden(adata_graph, resolution=0.5, key_added='leiden', flavor='igraph', n_iterations=2) embedding_df = pd.DataFrame( adata_graph.obsm['X_umap'], index=adata_graph.obs.index, columns=['umap_1', 'umap_2'] ) # Add metadata embedding_df['leiden'] = adata_graph.obs['leiden'] embedding_df['qc'] = qc_level embedding_df['n_pcs'] = n_pcs embedding_df['n_neighbors'] = n_neighbors embedding_df['metric'] = metric embedding_df['min_dist'] = min_dist embedding_df['spread'] = spread embedding_df = embedding_df.reset_index().rename(columns={'index': 'cell_id'}) all_embeddings_list.append(embedding_df) except Exception as e: print(f"!!! Error params ({n_pcs}, {n_neighbors}, {metric}): {e}") df = pd.concat(all_embeddings_list, ignore_index=True) n_subsubsample = 500 random_seed = 98103 umap_output_file = f"./assets/interactives/umap_explore{str(n_subsubsample)}.parquet" qc_flags_output_file = f"./assets/interactives/umap_explore_qc_flags{str(n_subsubsample)}.parquet" np.random.seed(random_seed) unique_cell_ids = df['cell_id'].unique() cell_ids_pick = np.random.choice( unique_cell_ids, size=n_subsubsample, replace=False ) cell_ids_pick_set = set(cell_ids_pick) df_filtered = df[df['cell_id'].isin(cell_ids_pick_set)].copy() pq.write_table(pa.Table.from_pandas(df_filtered), umap_output_file) pq.write_table(pa.Table.from_pandas(df_filtered), "./assets/interactives/umap_explore.parquet") bmeta = bdata.obs.loc[bdata.obs.index.isin(cell_ids_pick_set), ['cell_ID', 'qcflag']].copy() bmeta['cell_id'] = bmeta.index bmeta_filtered = bmeta.loc[bmeta['qcflag'] != 0, ['cell_id', 'qcflag']].reset_index(drop=True) pq.write_table(pa.Table.from_pandas(bmeta_filtered), qc_flags_output_file) pq.write_table(pa.Table.from_pandas(bmeta_filtered), "./assets/interactives/umap_explore_qc_flags.parquet") ``` :::: {.columns} ::: {.column width="40%"} ```{webr} #| echo: false #| eval: true #| edit: false #| define: #| - qc_flags_data # This block fetches the QC # flag codes for the cell ids (numeric) so that it can be merged in the plots # and tables. qc_flags_file <- "assets/interactives/umap_explore_qc_flags.parquet" qc_flags_data <- data.frame(cell_id = character(), qcflag = integer()) con <- tryCatch({ dbConnect(duckdb()) }, error = function(e) { print(e); NULL }) if (!is.null(con)) { if (file.exists(qc_flags_file)) { # print("Loading QC flags data...") qc_flags_data <- tryCatch({ dbGetQuery(con, paste0("SELECT * FROM '", qc_flags_file, "'")) }, error = function(e) { # print("Error reading QC flags parquet:") print(e) qc_flags_data # Return default empty }) # print(paste("Loaded", nrow(qc_flags_data), "QC flag entries.")) } else { print(paste("QC flags file not found:", qc_flags_file)) } dbDisconnect(con, shutdown = TRUE) } ``` ```{ojs} //| eval: true //| echo: false viewof qc_level = Inputs.radio(["no QC filter", "include FOV QC cells", "filter all flags"], {label: "QC Level", value: "include FOV QC cells"}) viewof n_pcs = Inputs.radio([10, 20, 30, 40, 50], {label: "PCs", value: 30}) viewof n_neighbors = Inputs.radio([15, 30, 50], {label: "Neighbors", value: 30}) viewof metric = Inputs.radio(["euclidean", "cosine"], {label: "Metric", value: "cosine"}) viewof mdist = Inputs.radio([0.01, 0.05, 0.2, 0.5], {label: "Min. distance", value: 0.2}) viewof spread = Inputs.radio([0.5, 1.0, 2.0], {label: "Spread", value: 2.0}) viewof colorby = Inputs.radio(["QC", "Leiden"], {label: "Color by", value: "QC"}) ``` ```{webr} #| echo: false #| eval: true #| edit: false #| input: #| - qc_level #| - n_pcs #| - n_neighbors #| - metric #| - mdist #| - spread #| - colorby #| define: #| - query_data # connection to an in-memory DuckDB database. con <- dbConnect(duckdb()) # build the SQL query query <- paste0( "SELECT * FROM 'assets/interactives/umap_explore.parquet' WHERE qc = '", qc_level, "' AND n_pcs = '", n_pcs, "' AND n_neighbors = '", n_neighbors, "' AND metric = '", metric, "' AND min_dist = '", mdist, "' AND spread = '", spread, "'" ) query_data <- tryCatch({ dbGetQuery(con, query) }, error = function(e) { print("--- SQL Query Error ---") print("Query that failed:") print(query) print("Error message:") print(e) data.frame( cell_id = character(), umap_1 = numeric(), umap_2 = numeric() ) }) dbDisconnect(con, shutdown = TRUE) ```  ```{ojs} //| echo: false //| eval: true //| output: false // Define the meaning of each bit qcMasks = ({ LOW_COUNTS: 1 << 0, HIGH_COUNTS: 1 << 1, LOW_FEAT: 1 << 2, HIGH_FEAT: 1 << 3, LOW_SPLIT: 1 << 4, FOV_QC: 1 << 5, LOW_SBR: 1 << 6 }) flagNames = Object.keys(qcMasks).sort((a, b) => qcMasks[a] - qcMasks[b]); qcFlagMap = new Map(qc_flags_data.map(d => [d.cell_id, d.qcflag])); flagColors = { const uniqueFlagsInData = Array.from(new Set(qc_flags_data.map(d => d.qcflag))).sort((a, b) => a - b); const uniqueFlagsWithZero = Array.from(new Set([0, ...uniqueFlagsInData])).sort((a,b)=> a-b); const colors = d3.schemeTableau10; const map = {}; let colorIndex = 0; uniqueFlagsWithZero.forEach(flagValue => { if (flagValue === 0) { map[flagValue] = "black"; } else { map[flagValue] = colors[colorIndex++ % colors.length]; } }); return map; } qcColorScale = (flagValue) => flagColors[flagValue ?? 0]; ``` ::: {.columns} ::: {.column width="70%"}  ```{ojs} //| echo: false //| eval: true //| output: true qcFlagLegend = { try { const uniqueFlagsWithZero = Object.keys(flagColors).map(Number).sort((a,b)=> a-b); const tableRows = uniqueFlagsWithZero.map(flagValue => { const flagColor = flagColors[flagValue]; // Use static map const circles = flagNames.map(name => { const mask = qcMasks[name]; const isSet = (flagValue & mask) > 0; const circleFill = isSet ? flagColor : "#e0e0e0"; const circleSvg = `<svg width="15" height="15"><circle cx="7.5" cy="7.5" r="6" fill="${circleFill}" stroke="${isSet ? 'black' : '#ccc'}" stroke-width="1"></circle></svg>`; return `<td style="text-align: center;">${circleSvg}</td>`; }).join(''); const rowHeader = `<th scope="row" style="text-align: right; padding-right: 10px; white-space: nowrap;"> <span style="display: inline-block; width: 12px; height: 12px; background-color: ${flagColor}; border: 1px solid #555; margin-right: 5px; vertical-align: middle;"></span> ${flagValue === 0 ? 'Passed' : flagValue} </th>`; return `<tr>${rowHeader}${circles}</tr>`; }).join(''); const headerCells = flagNames.map(name => `<th style="height: 100px; text-align: center; vertical-align: bottom; padding-bottom: 5px; font-size: 0.8em; white-space: nowrap;"> <span style="writing-mode: vertical-lr; transform: rotate(180deg);">${name.replace(/_/g, ' ')}</span> </th>` ).join(''); const tableHeader = `<thead><tr><th style="vertical-align: bottom;">Flag Value</th>${headerCells}</tr></thead>`; return html`<table style="border-collapse: collapse; margin-top: 10px; font-size: 0.9em; table-layout: fixed;"> ${tableHeader}<tbody>${tableRows}</tbody> </table>`; } catch (error) { console.error("Error generating QC Legend:", error); return html`<div>Error generating QC Legend</div>`; } } ``` ::: ::: {.column width="3%"}  ::: ::: {.column width="27%"} ```{ojs} //| echo: false //| eval: true //| output: true leidenScaleAndLegend = { let scale, legend; try { // Add try...catch // Check if query_data is ready and has 'leiden' if (!query_data || query_data.length === 0 || !query_data[0].hasOwnProperty('leiden')) { console.warn("Leiden data (query_data) not ready for legend/scale."); scale = d3.scaleOrdinal().domain(["0"]).range(["grey"]); // Placeholder scale legend = html`<div><ul style="list-style: none; padding-left: 0; margin-top: 10px; font-size: 0.9em;"> <li style="margin-bottom: 2px;"><span style="display: inline-block; width: 12px; height: 12px; background-color: grey; border: 1px solid #555; margin-right: 5px; vertical-align: middle;"></span> Cluster 0</li> </ul>(Loading...)</div>`; // placeholder } else { const uniqueClusters = Array.from(new Set(query_data.map(d => d.leiden))).sort((a, b) => a - b); scale = d3.scaleOrdinal(d3.schemeTableau10).domain(uniqueClusters); const legendItems = uniqueClusters.map(cluster => { const color = scale(cluster); const clusterLabel = String(cluster); return `<li style="margin-bottom: 2px;"><span style="display: inline-block; width: 12px; height: 12px; background-color: ${color}; border: 1px solid #555; margin-right: 5px; vertical-align: middle;"></span> Cluster ${clusterLabel}</li>`; }).join(''); legend = html`<ul style="list-style: none; padding-left: 0; margin-top: 10px; font-size: 0.9em;">${legendItems}</ul>`; } } catch(error) { console.error("Error generating Leiden Legend:", error); scale = () => 'grey'; legend = html`<div>Error generating Leiden Legend</div>`; } return { scale, legend }; } leidenColorScale = leidenScaleAndLegend.scale leidenLegendHtml = leidenScaleAndLegend.legend ``` ::: ::: ::: ::: {.column width="5%"} ::: ::: {.column width="55%"} ```{ojs} //| echo: false //| eval: true //| message: false //| output: false d3 = require("d3@7") width = 500 height = 500 marginTop = 20 marginRight = 20 marginBottom = 30 marginLeft = 40 xScale = d3.scaleLinear().domain([0, 1]).range([marginLeft, width - marginRight]) yScale = d3.scaleLinear().domain([0, 1]).range([height - marginBottom, marginTop]) svg = d3.create("svg") .attr("width", width) .attr("height", height) .attr("viewBox", [0, 0, width, height]) .attr("style", "max-width: 100%; height: auto;"); xAxis = d3.axisBottom(xScale) .ticks(width / 80) .tickSizeOuter(0) .tickSizeInner(-(height - marginTop - marginBottom)); yAxis = d3.axisLeft(yScale) .ticks(height / 40) .tickSizeOuter(0) .tickSizeInner(-(width - marginLeft - marginRight)); gx = svg.append("g") .attr("class", "x-axis") .attr("transform", `translate(0,${height - marginBottom})`) .call(xAxis); gy = svg.append("g") .attr("class", "y-axis") .attr("transform", `translate(${marginLeft},0)`) .call(yAxis); svg.selectAll(".tick line") .attr("stroke-opacity", 0.1); svg.append("text") .attr("x", width - marginRight) .attr("y", height - marginBottom - 4) .attr("fill", "currentColor") .attr("text-anchor", "end") .text("UMAP 1 →"); svg.append("text") .attr("x", marginLeft + 4) .attr("y", marginTop) .attr("fill", "currentColor") .attr("text-anchor", "start") .attr("dominant-baseline", "hanging") .text("↑ UMAP 2"); - svg.append("clipPath") .attr("id", "clip") .append("rect") .attr("x", marginLeft) .attr("y", marginTop) .attr("width", width - marginLeft - marginRight) .attr("height", height - marginTop - marginBottom); pointGroup = svg.append("g") .attr("fill", "black") .attr("clip-path", "url(#clip)"); ``` ```{ojs} //| echo: false //| eval: true //| message: false { // Reactive block const updateVisualization = () => { const duration = 750; const currentScale = colorby === "QC" ? qcColorScale : leidenColorScale; if (!currentScale) { console.error("Scale missing"); return; } if (!query_data || query_data.length === 0) { console.warn("No data available for plotting."); pointGroup.selectAll("circle").transition().duration(duration).attr("opacity", 0).remove(); return; } const xExtent = d3.extent(query_data, d => d.umap_1); const yExtent = d3.extent(query_data, d => d.umap_2); const xPad = (xExtent[1] - xExtent[0]) * 0.05; const yPad = (yExtent[1] - yExtent[0]) * 0.05; xScale.domain([xExtent[0] - xPad || -1, xExtent[1] + xPad || 1]); yScale.domain([yExtent[0] - yPad || -1, yExtent[1] + yPad || 1]); svg.select(".x-axis") .transition().duration(duration) .call(xAxis) .on("start", () => { svg.selectAll(".x-axis .tick line").attr("stroke-opacity", 0.1); }); svg.select(".y-axis") .transition().duration(duration) .call(yAxis) .on("start", () => { svg.selectAll(".y-axis .tick line").attr("stroke-opacity", 0.1); }); pointGroup.selectAll("circle") .data(query_data, d => d.cell_id) .join( enter => enter.append("circle") .attr("cx", d => xScale(d.umap_1)) .attr("cy", d => yScale(d.umap_2)) .attr("r", 2.5) .attr("fill", d => { try { const valueToColor = colorby === "QC" ? qcFlagMap.get(d.cell_id) : d.leiden; return currentScale(valueToColor) || 'grey'; } catch(e) { return 'grey'; } }) .attr("opacity", 0) .call(enter => enter.transition().duration(duration).attr("opacity", 0.8)), update => update .call(update => update.transition().duration(duration) .attr("cx", d => xScale(d.umap_1)) .attr("cy", d => yScale(d.umap_2)) .attr("fill", d => { try { const valueToColor = colorby === "QC" ? qcFlagMap.get(d.cell_id) : d.leiden; return currentScale(valueToColor) || 'grey'; } catch(e) { return 'grey'; } }) .attr("opacity", 0.8) ), exit => exit .call(exit => exit.transition().duration(duration).attr("opacity", 0).remove()) ); }; try { updateVisualization(); } catch(error) { console.error("Error during D3 updateVisualization:", error); } return svg.node(); } ``` In the dot plot on the bottom left provides a QC flag legend. For example, QC flag '5' represents cells that were flagged for having low counts and low features whereas QC flag '96' represents cells that wee flagged for FOV QC as wells as low SBR. ::: :::: If you want to further explore this subset, you can examine the `query_data` and `qc_flags_data` dataframes below. ```{webr} #| autorun: true #| input: #| - query_data #| - qc_flags_data head(query_data) head(qc_flags_data) ``` ::: ::: For the remainder of this chapter, I will focus on the PR-based method. I find it's helpful to: 1. Have more control of the plotting aesthetics and label multiple sub-clusters of a given Leiden cluster for clarity and 2. isolate the Leiden clusters and visualize them in both UMAP space and physical (XY) space to see if there are regions in the tissue that might provide clues about the cell types. Instead of the default `plotDots` colors, let's define our own. ```{r} #| label: V-1 #| eval: false #| code-summary: "R Code" #| code-fold: true # define colors semi-manually instead of using the default colors column <- 'leiden_pr' column_levels <- levels(py$adata$obs[[column]]) n_levels <- length(column_levels) if(n_levels<27){ colors_use <- pals::alphabet(n_levels) } else if(n < 53){ colors_use <- c(pals::alphabet(26), pals::alphabet2(n_levels-26)) } else { stop("consider fewer groups") } names(colors_use) <- column_levels # saving colors for later results_list[['leiden_global_names']] <- names(colors_use) results_list[['leiden_global_colors']] <- as.character(colors_use) saveRDS(results_list, results_list_file) ``` Plot the cells in XY space and UMAP space and color based on `leiden_pr`. ```{r} #| label: V-2 #| eval: false #| code-summary: "R Code" #| code-fold: true group_prefix <- "leiden" # <1> pp_assets_dir <- file.path(analysis_asset_dir, "preprocessing") # XY (global only) plotDots(py$adata, color_by='leiden_pr', plot_global = TRUE, facet_by_group = FALSE, additional_plot_parameters = list( geom_point_params = list( size=0.001 ), scale_bar_params = list( location = c(5, 0), width = 2, n = 3, height = 0.1, scale_colors = c("black", "grey30"), label_nudge_y = -0.3 ), directory = pp_assets_dir, fileType = "png", dpi = 200, width = 8, height = 8, prefix=group_prefix ), additional_ggplot_layers = list( theme_bw(), xlab("X (mm)"), ylab("Y (mm)"), coord_fixed(), scale_color_manual(values = colors_use), theme(legend.position = c(0.8, 0.4)), guides(color = guide_legend( title="Leiden (PR)", override.aes = list(size = 3) ) ) ) ) # XY (facets only) plotDots(py$adata, color_by='leiden_pr', plot_global = FALSE, facet_by_group = TRUE, additional_plot_parameters = list( geom_point_params = list( size=0.001 ), scale_bar_params = list( location = c(5, 0), width = 2, n = 3, height = 0.1, scale_colors = c("black", "grey30"), label_nudge_y = -0.3 ), directory = pp_assets_dir, fileType = "png", dpi = 100, width = 5, height = 5, prefix=group_prefix ), additional_ggplot_layers = list( theme_bw(), xlab("X (mm)"), ylab("Y (mm)"), coord_fixed(), scale_color_manual(values = colors_use), theme(legend.position = c(0.8, 0.4)), guides(color = guide_legend( title="Leiden (PR)", override.aes = list(size = 3) ) ) ) ) # UMAP (global only) plotDots(py$adata, obsm_key = "umap_pr", color_by='leiden_pr', plot_global = TRUE, facet_by_group = FALSE, additional_plot_parameters = list( geom_point_params = list( size=0.001, alpha=0.1 ), geom_label_params = list( size = 4 ), labels_on_plot = data.frame(), directory = pp_assets_dir, fileType = "png", dpi = 200, width = 8, height = 8, prefix=group_prefix ), additional_ggplot_layers = list( theme_bw(), xlab("UMAP 1"), ylab("UMAP 2"), coord_fixed(), scale_color_manual(values = colors_use), # guides(color = guide_legend( # title="Cell Type", # override.aes = list(size = 3) ) ), theme(legend.position = "none") ) ) # UMAP (facets only) plotDots(py$adata, obsm_key = "umap_pr", color_by='leiden_pr', plot_global = FALSE, facet_by_group = TRUE, additional_plot_parameters = list( geom_point_params = list( size=0.001, alpha=0.1 ), geom_label_params = list( size = 2 ), labels_on_plot = data.frame(), directory = pp_assets_dir, fileType = "png", dpi = 100, width = 5, height = 5, prefix=group_prefix ), additional_ggplot_layers = list( theme_bw(), xlab("UMAP 1"), ylab("UMAP 2"), coord_fixed(), scale_color_manual(values = colors_use), guides(color = guide_legend( title="Cell Type", override.aes = list(size = 3) ) ) ) ) ``` 1. in the plotting below, we will use this name to parse out files. Here are the overall (global) plots. ::: {.panel-tabset} #### Leiden - UMAP ```{r} #| label: fig-umap-leiden #| message: false #| warning: false #| echo: false #| fig.width: 6 #| fig.height: 5 #| fig-cap: "UMAP with Leiden (PR-based) cells." #| eval: true knitr::include_graphics(file.path(analysis_asset_dir, "preprocessing", "leiden__umap_pr__S0__plot.png")) ``` #### Leiden - XY ```{r} #| label: fig-xy-celltype-broad #| message: false #| warning: false #| echo: false #| fig.width: 6 #| fig.height: 5 #| fig-cap: "XY with Leiden (PR-based) cells." #| eval: true knitr::include_graphics(file.path(analysis_asset_dir, "preprocessing", "leiden__spatial__S0__plot.png")) ``` ::: Here are the individual plots[^2] we created organized by Leiden cluster. From these pairs of plots one can quickly see how some Leiden clusters are spatially restricted. For example, Leiden 0 consists of cells in the top part of the tissue (_i.e._, the non-tumoral epithelium). Similarly, cluster 12 occupies the left-most region of the tissue. ::: {.panel-tabset} ```{r} #| output: asis #| eval: true #| code-fold: true #| echo: false fig_dir <- file.path(analysis_asset_dir, "preprocessing") group_prefix <- "leiden" slide_id <- "S0" xy_files <- Sys.glob(file.path(fig_dir, paste0(group_prefix, "__spatial__", slide_id, "__facet_*png"))) org_order <- as.numeric(gsub(".png", "", unlist(lapply(strsplit(xy_files, split="_facet_"), "[[", 2L)))) xy_files <- xy_files[match(sort(org_order), org_order)] umap_files <- Sys.glob(file.path(fig_dir, paste0(group_prefix, "__umap_pr__", slide_id, "__facet_*png"))) org_order <- as.numeric(gsub(".png", "", unlist(lapply(strsplit(umap_files, split="_facet_"), "[[", 2L)))) umap_files <- umap_files[match(sort(org_order), org_order)] res <- purrr::map2_chr(xy_files, umap_files, \(current_xy_file, current_umap_file){ knitr::knit_child(text = c( "### `r gsub('.png', '', strsplit(current_umap_file, split='__facet_')[[1]][2])`", "", ":::{columns}", "", "::: {.column width='40%'}", "", "```{r eval=TRUE}", "#| echo: false", "knitr::include_graphics(current_xy_file)", "```", "", ":::", "", "::: {.column width='10%'}", "", ":::", "", "::: {.column width='40%'}", "", "```{r eval=TRUE}", "#| echo: false", "knitr::include_graphics(current_umap_file)", "```", ":::", "", ":::", "", "", "" ), envir = environment(), quiet = TRUE) }) cat(res, sep = '\n') ``` ::: ## Conclusion And with that, we've navigated the essential pre-processing steps. Starting with our quality-controlled cell data, we applied normalization to account for technical variations, identified the most informative genes, and used PCA to reduce the data's dimensionality. Building upon this foundation, we generated UMAP embeddings for visualization and performed Leiden clustering to partition the cells into distinct groups based on their expression profiles. The interactive exploration highlighted a critical point: parameters matter. Choices made during normalization, dimensionality reduction, and clustering significantly influence the final UMAP layout and cluster assignments. There isn't a single "correct" set of parameters, but understanding their effects allows you to tailor the analysis to best reveal the biological structures relevant to your specific questions. The resulting AnnData object, now enriched with normalized data layers, PCA coordinates, UMAP embeddings, and Leiden cluster labels, is primed for the next crucial phase: assigning biological meaning to these patterns. In the following chapters we will delve into grouping cells based on their spatial domain and their cell types. [^1]: when I tried using the `sc.experimental.pp.normalize_pearson_residuals` method on this dataset, the OS crashed (>200 GBs of RAM). [^2]: To see how to generate this type of output, click the `</>` icon on the top right of this chapter and then click `view source`.