Classifying cells into types

Author
Affiliations

Evelyn Metzger

Bruker Spatial Biology

Github: eveilyeverafter

Okay, let’s chart the course for this crucial phase of our analysis expedition: cell typing. This is where we assign identities to the cell clusters discovered in the previous chapter, transforming abstract groups into recognizable biological entities like T cells, macrophages, or tumor cells.

If pre-processing was about plotting our trajectory through high-dimensional space, cell typing is akin to navigating a dense asteroid field. It’s a critical step, but one fraught with potential obstacles. Assigning definitive labels can be challenging, often requiring careful consideration of marker genes and existing biological knowledge. What constitutes a “correct” cell type label can even be subjective, depending entirely on the resolution needed for your specific biological questions. Are you aiming to distinguish broad categories like “immune” versus “tumor,” or do you need to pinpoint specific T cell subtypes? Like adjusting a microscope’s focus, the level of detail required dictates the approach. Navigating this field requires careful maneuvering and the right instruments. Autopilot hasn’t been invented just yet and so the best recommendation need manual input.

One common hazard is attempting to directly apply analysis pipelines designed for spatially-dissociated scRNA-seq. While scRNA-seq vignettes offer valuable starting points, CosMx SMI data, generated by direct in situ hybridization, has fundamentally different characteristics. Simply copy-pasting scRNA-seq methods without adaptation can lead to suboptimal or incomplete cell type assignments. Instead, we must employ or adapt methods that leverage the strengths of high-plex, direct hybridization data to achieve reliable annotations. In doing so it’s possible to achieve the most comprehensive and biologically-rich cell typing and sub-typing possible in spatial transcriptomics.

It’s also crucial to remember that cell typing, while essential, isn’t the final destination of our journey. It’s a powerful tool for organizing our cellular map, allowing us to ask spatially-aware questions. However, other organizational frameworks, such as classifying cells based on their spatial neighborhood or “niche,” are equally important for understanding the tissue’s complex ecosystem. Ultimately, cell typing is foundational: it establishes the structural anatomy of the dataset, paving the way for us to investigate its functional physiology in the next chapters.

In this chapter, we will navigate the cell typing process in the colon dataset. The main goal is to generate cell type labels that are “close” while also being approachable. I’ll begin (Section 6.1) by surveying some of the spatially unaware methods and then summarize the state of the direct hybridization-inspired approaches. Then I will guide you through the specific, hybrid workflow used for this colon dataset. We will start with the broad Leiden clusters defined in the previous chapter, refine them based on spatial observations, and assign primary identities using marker gene analysis. Finally, for the more complex immune populations where standard clustering often struggles, we will employ HieraType, a probabilistic method explicitly designed to handle the specific characteristics of CosMx SMI data.

R code
source("./preamble.R")
reticulate::source_python("./preamble.py")
analysis_dir <- file.path(getwd(), "analysis_results")
input_dir <- file.path(analysis_dir, "input_files")
output_dir <- file.path(analysis_dir, "output_files")
analysis_asset_dir <- "./assets/analysis_results"
ct_dir <- file.path(output_dir, "ct")
if(!dir.exists(ct_dir)){
  dir.create(ct_dir, recursive = TRUE)
}
results_list_file = file.path(analysis_dir, "results_list.rds")
if(!file.exists(results_list_file)){
  results_list <- list()
  saveRDS(results_list, results_list_file)
} else {
  results_list <- readRDS(results_list_file)  
}

library(scales)
library(Matrix)
source("./helpers.R")

6.1 Cell Typing Approaches

6.1.1 Non-spatial Cell Typing Approaches

Before discussing spatially-aware or technology-specific approaches, let’s briefly cover common methods designed for non-spatial, single-cell RNA sequencing analysis. These techniques primarily leverage the gene expression matrix (adata.X and normalized layers) and the cluster assignments (e.g., adata.obs['leiden']) derived in the previous chapter. Many excellent vignettes exist for these methods in popular frameworks like scanpy (Python) and Seurat (R), providing a rough “10,000-foot view” of the data’s cellular composition.

Cluster Marker Gene Identification (Leiden/Louvain): This is often a foundational step in scRNA-seq. After obtaining clusters (like the Leiden clusters from Chapter 4), differential expression tests are performed between each cluster and all other cells (or between pairs of clusters). This identifies genes significantly upregulated within a specific cluster. Tools like scanpy’s rank_genes_groups or Seurat’s FindMarkers are standard. The resulting lists of marker genes are then manually compared against known canonical markers from literature or cell atlases to assign putative cell type identities (e.g., observing high CD3E, CD8A suggests a T cell cluster; high EPCAM, KRT19 suggests epithelial cells). While effective, this relies heavily on prior biological knowledge and the quality of the initial clustering. It also assumes that transcripts found in the cell do not arise from any spatial co-localization or imperfect cell-cell segmentation boundaries that can arise when assigning transcripts to cells.

Nested / Hierarchical Clustering Analysis: Sometimes, broad initial clusters (like “Immune Cells” or “Fibroblasts”) contain sub-populations that are meaningful for your analysis. A common refinement is then to subset the data to include only cells from a specific Leiden cluster (or set of clusters) and then re-run the clustering process within that subset. Identifying marker genes for these sub-clusters can reveal finer cell types or states (e.g., distinguishing CD4+ from CD8+ T cells, or identifying different fibroblast subtypes. This iterative approach allows for a hierarchical understanding of cell identity; however, the imperfect segmentation noted above is amplified in this approach. Consider a workflow where one uses HVGs to create an initial set of “Elementary” clusters and uses a marker gene-based approach to label these broad clusters. When subsetting down to, say, fibroblasts and rerunning the above HVG -> Leiden -> marker gene workflow, one may get non-fibroblast marker genes returned. The reason for this is twofold. First the general fibroblast marker genes are no longer as differentially expressed in the subset of Fibroblasts as they were in the full data and 2) if there are cell segmentation errors of fibroblast sub-populations that are spatially-associated with unrelated cell types – like plasma cells – the HVG algorithm may identify spurious gene markers as highly variable. The result would be markers like JCHAIN in fibroblasts that happen to be located near a germinal center, tertiary lymphoid structure, or similar. Thus, while this nested approach can be valuable, it should be used with caution. For more information on this overlap concept, see Dan McGuire’s blog post on smiDE.

Reference-Based Annotation: Instead of relying solely on de novo marker finding, many methods leverage existing annotated single-cell datasets (atlases) as references. These algorithms typically project the query data onto the reference or use statistical models trained on the reference to directly assign cell type labels to your individual cells or clusters. Popular tools include:

  • scanpy.tl.ingest: Integrates query data onto an existing annotated AnnData object.

  • Seurat’s FindTransferAnchors and TransferData: A widely used method in the R ecosystem.

  • InSituType’s fully supervised method: InSituType – while designed for spatial data – can be used as a fully supervised classifier. I’m including it here since many reference datasets and atlases at the time of writing are based on scRNA-seq data.

6.1.2 Spatially-Informed and Hybridization-Aware Methods

Unlike standard scRNA-seq workflows, these approaches leverage the unique features of in situ data: specifically the spatial coordinates, the negative probe background, and the available protein data.

Background-Aware Modeling. Methods like InSituType explicitly model the background noise using the negative control probes. This allows for more accurate classification of cells with low transcript counts that might be discarded or misclassified by standard scRNA-seq tools

Spatial-Molecular Joint Inference. Tools like Banksy utilize the expression of a cell’s physical neighbors to inform its identity. This leverages the biological reality that cells of the same type often cluster spatially, helping to smooth out technical noise in individual cells.

Multi-modal Integration. CosMx SMI often includes protein markers (e.g., CD45, PanCK) in the form of IF markers or – more recently – with true RNA and protein same-slide multiomics. “Dual-mode” typing strategies use these robust protein signals to define major lineages (e.g., “This cell is definitely CD45+ Immune”) before using the RNA data to determine the specific subtype (e.g., “It is a CD8+ T Cell”). Similarly, HieraType is a new R package for supervised classification that can combine multi-omics information to cell type. HieraType also classifies RNA-only data (as we’ll see below).

Reference-Based Annotation: Reference profiles based on CosMx SMI can be used to more quickly cell type new datasets. This approach can be much faster than cell typing de novo but requires such annotations to be available.

6.2 Cell Type Determination

Just like there are many paths through an asteroid field, there are many ways to navigate cell typing. I find the following approach makes a good starting place for any analysis as it’s relatively fast while still allowing for human intervention.

  1. Use Leiden clusters identified in Chapter 4 as the basis. 1.1. Optional: refine Leiden clusters based on discovered biology. For example, there may be some Leiden clusters that makes sense to merge together or some evidence that splitting a Leiden cluster into two or splitting a given Leiden cluster into multiple sub-clusters based on the spatial layout of cells.
  2. Run HieraType to classify cells and use its integration function to merge its supervised classifications with the unsupervised Leiden clusters.
  3. Of the remaining unclassified cells, use marker genes to infer the most likely cell type.

6.2.1 Step 1: Read Leiden clusters

Load the data and make a copy of the leiden_pr metadata column that we’ll use.

Python Code
adata = ad.read_h5ad(os.path.join(r.analysis_dir, "anndata-2-leiden.h5ad"))

cluster_renaming = {}
initial = "C"
for name in adata.obs['leiden_pr'].unique():
  cluster_renaming[name] = f"{initial}-{name}"
adata.obs['leiden'] = adata.obs['leiden_pr'].map(cluster_renaming)

6.2.2 Step 2: Run HieraType

Install the HieraType package (and remotes), if needed. Then extract the lists of marker genes.

R Code
install.packages("mvtnorm")

remotes::install_github("Nanostring-Biostats/CosMx-Analysis-Scratch-Space",
                         subdir = "_code/HieraType", ref = "Main")

library(HieraType)

markerg <- unlist(lapply(c(
  HieraType::markerslist_l1
  ,HieraType::markerslist_immune
  ,HieraType::markerslist_cd8tminor
  ,HieraType::markerslist_cd4tminor
  ,HieraType::markerslist_tcellmajor
), "[[", "predictors"))

marker_genes <- as.character(markerg)

Convert data to R.

Python Code
total_counts_per_cell = adata.layers['counts'].sum(axis=1)
total_counts_per_cell = np.asarray(total_counts_per_cell).flatten()

total_counts_per_target = adata.layers['counts'].sum(axis=0)
total_counts_per_target = np.asarray(total_counts_per_target).flatten()

target_frequencies = total_counts_per_target / total_counts_per_target.sum()

# Include all HVGs and marker genes (even if not in the HVGs)
marker_genes = r.marker_genes
marker_mask = adata.var.index.isin(marker_genes)
highly_variable_mask = adata.var['highly_variable'].to_numpy()
adata.var['hieratype_target'] = (highly_variable_mask | marker_mask)

adata_ht = adata[:, adata.var.hieratype_target].copy()
ht_counts = adata_ht.layers['counts'].copy().astype(np.int64)
1
In order to use this matrix in R, we need to convert the 32 bit integer counts into 64 bit integers.

Follow this tutorial to refine Leiden clusters with immune cells.

R Code
ht_counts <- py$ht_counts
rownames(ht_counts) <- py$adata_ht$obs_names$to_list()
colnames(ht_counts) <- py$adata_ht$var_names$to_list()
ht_counts <- as(ht_counts, "CsparseMatrix")

pca_embeddings <- py$adata_ht$obsm['X_pca_pr'] # n_cells x n_pcs

target_frequencies <- py$target_frequencies 
names(target_frequencies) <- py$adata$var_names$to_list()
target_frequencies <- target_frequencies[colnames(ht_counts)] 

total_counts_per_cell <- py$total_counts_per_cell

# Assumes this was based on cosine.
sim_graph <- py$adata_ht$obsp['neighbors_pr_connectivities']
sim_graph <- as(sim_graph, "CsparseMatrix")
rownames(sim_graph) <- colnames(sim_graph) <- rownames(ht_counts)

Run the pipeline.

R Code
pipeline <- HieraType::make_pipeline(
  markerslists = list(
    "l1" = HieraType::markerslist_multiomic_l1
    ,"l2" = HieraType::markerslist_multiomic_immune
    ,"lt" = HieraType::markerslist_multiomic_tcellmajor
    ,"lt4minor" = HieraType::markerslist_multiomic_cd4tminor
    ,"lt8minor" = HieraType::markerslist_multiomic_cd8tminor
  )
  ,priors = list(
    "l2" = "l1"
    ,"lt" = "l2"
    ,"lt4minor" = "lt"
    ,"lt8minor" = "lt"
  )
  ,priors_category = list(
    "l2" = "immune"
    ,"lt" = "tcell"
    ,"lt4minor" = "cd4t"
    ,"lt8minor" = "cd8t"
  )
)

rnactobj <- HieraType::run_pipeline(
  pipeline = pipeline,
  counts_matrix = ht_counts,
  adjacency_matrix = sim_graph[rownames(ht_counts), rownames(ht_counts)],
  totalcounts = total_counts_per_cell,
  gene_wise_frequency = target_frequencies
)
saveRDS(rnactobj, file=file.path(ct_dir, "rnactobj.rds"))

Merge the Leiden cluster calls with the HieraType results. There are several ways we could do this but for this tutorial we’ll use HieraType’s built-in celltype_label_integration function using the default parameters. celltype_label_integration integrates the immune celltype annotations from HieraType with unsupervised cluster labels. The goal of this step is to enable granular detection of novel or tissue-specific cell types (unsupervised), along with well known immune cell types (HieraType). The logic for merging these annotations works like this:

  1. All cells which are annotated with an immune-celltype HieraType label keep their label.
  2. All cells with a non-immune HieraType label take the unsupervised cluster label.
  3. Unsupervised clusters that had a high rate of cells taking immune labels (say 90%), and make up only a small proportion of all cells (say 1%) after step 1 are ‘dissolved’. These dissoved cells are assigned the most common cluster label (could be immune or unsupervised) amongst most the similar neighboring cells. Here we used the neighbor cells from the adjacency matrix that was used to run HieraType.
R Code
cell_type_dt <- rnactobj$post_probs$l1
cell_type_dt <- rename(cell_type_dt, cell_id = cell_ID,
                       Hieratype_celltype_fine = celltype_granular)

to_merge <- py$adata$obs %>% select(leiden)

cell_type_dt$Leiden <- to_merge$leiden[match(rownames(to_merge), cell_type_dt$cell_id)]

cell_type_dt <- celltype_label_integration(
  metadata = cell_type_dt, 
  adjacency_mat = sim_graph[rownames(ht_counts), rownames(ht_counts)],
  cellid_colname = 'cell_id',
  supervised_colname = 'Hieratype_celltype_fine',
  unsupervised_colname = 'Leiden')

cell_type_dt <- rename(cell_type_dt, merged = celltype)

saveRDS(cell_type_dt, file=file.path(ct_dir, "cell_type_dt.rds"))
1
cell_id here is the row name from the anndata object (as opposed to the c_i_j_k cell_ID notation that is sometimes used).

Save the new calumns back to the anndata object.

Python Code
cell_type_df = r.cell_type_dt

adata.obs = adata.obs.join(cell_type_df[['cell_id', 'Hieratype_celltype_fine', 'merged']].set_index('cell_id'))

6.2.3 Step 3: Classify remaining cells using marker genes

For the cells still assigned a Leiden cluster label, run a marker gene analysis to determine the dominant cell type per cluster. We’ll subset larger clusters down to 10,000 cells to speed up the computation. Since scanpy’s tl.rank_genes_groups function needs normalizaed data, we’ll use the log1p-transformed matrix that we generated in Chapter 4.

bdata = adata[adata.obs['leiden']==adata.obs['merged'], :].copy()

random_seed = 98103
np.random.seed(random_seed)

target_cells_per_cluster = 10000
cells_to_keep_indices = []
cluster_counts = bdata.obs['merged'].value_counts()

for cluster_label in cluster_counts.index:
    cluster_cells = bdata.obs_names[bdata.obs['merged'] == cluster_label]
    num_cells_in_cluster = len(cluster_cells)

    if num_cells_in_cluster > target_cells_per_cluster:
        print(f"Cluster '{cluster_label}' has {num_cells_in_cluster} cells, subsampling to {target_cells_per_cluster}.")
        subsampled_cells = np.random.choice(
            cluster_cells,
            size=target_cells_per_cluster,
            replace=False
        )
        cells_to_keep_indices.extend(subsampled_cells)
    else:
      cells_to_keep_indices.extend(cluster_cells)


bdata_subsampled = bdata[cells_to_keep_indices, :].copy()

bdata_subsampled.X = bdata_subsampled.layers['TC'].copy()
1
If you do not have the log1p-transformed total counts layer ‘TC’ saved, run sc.pp.normalize_total and sc.pp.log1p (see ?sec-tc-workflow).

Run rank_genes_groups and save results.

ct_dir = r.ct_dir
sc.settings.figdir = ct_dir

sc.tl.rank_genes_groups(bdata_subsampled, 'leiden', method='wilcoxon',
  solver='saga', max_iter=200, pts=True) # ~an hour using logreg
sc.pl.rank_genes_groups(bdata_subsampled, n_genes=8, sharey=True, fontsize=10, save="_marker_gene_ranks_global.png")
sc.pl.rank_genes_groups_dotplot(bdata_subsampled, n_genes=8, save="_marker_gene_dotplot_global.png")

filename = os.path.join(r.analysis_dir, "anndata-5-subset-marker-genes.h5ad")

bdata_subsampled.write_h5ad(
  filename,
  compression=hdf5plugin.FILTERS["zstd"],
  compression_opts=hdf5plugin.Zstd(clevel=5).filter_options
)

The next task is determine from the marker gene results the dominant cell types that are likely found within each Leiden cluster. While some Leiden clusters will have a high proportion of cells that are of the same type or state, keep in mind that other Leiden clusters might be composites of cell types. The tools we have for this exercise are:

  • visual aids from the marker gene results and
  • compare with external sources. The literature and subject matter expertise is the gold-standard for this.
Figure 6.1: Marker gene ranks by Leiden cluster (global).
Figure 6.2: Marker gene dot plot by Leiden cluster (global).

Examine the clusters by extracting the top 25 markers and their relative scores and comparing to the literature. Large Language Models can provide supportive information here but it’s always recommended to use LLMs with caution and to verify any of the results. In other words, LLMs can be fast but at the cost of accuracy. In the example below, I created an LLM prompt that includes the top marker gene statistics. Let’s see what it generates.

marker_dict, marker_df = format_ranked_genes(bdata_subsampled, 25)

csv_file_path = os.path.join(ct_dir, "colon_cancer_top_marker_genes_global.csv")
marker_df.to_csv(csv_file_path, index=True)

print(f"I am analyzing a CosMx SMI dataset of from Colon cancer and have run Leiden clustering.\nI have already classified several of the immune cells within the larger dataset and now I would like your help in predicting the most like cell type given the marker gene profile from this subset of data.\nBelow is a dictionary of results where the key represents the Leiden cluster\nname and its corresponding item is a list of ordered tuples where each tuple\nprovides the marker name and the relative score for that marker.\n\n Please provide a table with the columns that represent 1. inferred cell type 2. your confidence in that cell type 3. up to five markers from the list that are key to your result and 4. a brief justification of your result. Please present the results in a markdown table format to make it easier to copy and paste.\n\n{marker_dict}")
Table 6.1: Estimating most likely cell type from Leiden clusters’ marker gene rankings. Note: predictions were generated with the aid of an LLM.
Cluster Inferred Cell Type Confidence Top 5 Key Markers Brief Justification
C-0 Epithelial (Goblet/Secretory) High PIGR, MUC2, EPCAM, KRT8, CLDN7 Expresses broad epithelial markers (EPCAM, KRT8) alongside specific secretory/mucosal markers (PIGR, MUC2) typical of normal or well-differentiated colon epithelium.
C-3 Plasma Cells (IgG) High IGHG1, IGKC, MZB1, JCHAIN, XBP1 Defined by very high expression of Immunoglobulin G heavy chains (IGHG1) and plasma cell maturation factors (MZB1, JCHAIN).
C-4 CAF High MMP14, MMP2, FN1, COL1A1, SPARC Spatially coincident with tumor. The high levels of matrix metalloproteinases (MMP14, MMP2) and Fibronectin (FN1) identify these as the activated myofibroblasts driving tumor desmoplasia.
C-5 Unclassified Low MPV17L, ZNF91, DUX4, CXCL8, USP17L Lacks clear lineage markers (No EPCAM, CD45, or COL1A1).
C-6 Immune (Mixed) Moderate KIR3DL1, CD79A, CSF2, CD53, RAG1 Contains conflicting a mix of immune markers: NK cell (KIR3DL1) and B cell (CD79A) markers appear together. Likely represents lymphoid aggregates.
C-7 Endothelial Cells High PECAM1, VWF, PLVAP, EGFL7, SHANK3 Distinct expression of vascular endothelium markers, including PECAM1 (CD31) and VWF. PLVAP suggests fenestrated capillaries common in the gut.
C-8 Plasma Cells (IgA) High IGHA1, JCHAIN, IGKC, GRP, MZB1 Similar to C-3 but distinguished by the specific expression of IgA (IGHA1), the dominant isotype in mucosal immunity of the colon.
C-9 Tumor Epithelial (Invasive) High CEACAM5, MMP7, LCN2, KRT8, CLDN4 Malignant epithelial profile (EPCAM, KRT8) with high expression of tumor markers (CEACAM5/CEA) and invasion/stress markers (MMP7, LCN2).
C-10 Structural Fibroblasts High SFRP2, MGP, FBLN1, C3, CXCL12 Spatially coincident with muscle (SMC). The profile (SFRP2+, MGP+, FBLN1+) matches “Adventitial” or “Universal” fibroblasts that act as the structural scaffold for muscle layers and vessels.
C-11 Tumor Epithelial (High Cycling) High RPS19, CEACAM5, EPCAM, RPS18, PIGR Epithelial tumor cells (CEACAM5) dominated by ribosomal proteins (RPS/RPL), indicating high protein synthesis and rapid cell cycling.
C-12 Smooth Muscle Cells High MYH11, ACTG2, DES, CNN1, MYLK Expression of contractile proteins (MYH11, ACTG2) and Desmin (DES) identifies these as mural smooth muscle cells (muscularis mucosae/propria).
C-13 Enteric Glia / Schwann Cells High CDH19, PMP22, SCN7A, LGI4, CLU Presence of peripheral nervous system markers (PMP22, CDH19, SCN7A) identifies these as supportive glial cells of the Enteric Nervous System.
C-14 Pericytes High NOTCH3, PDGFRB, MCAM, COL4A1, RGS5 Expresses pericyte markers (PDGFRB, MCAM, NOTCH3) and basement membrane collagen (COL4A1), distinguishing them from the broader fibroblast clusters.

Following a little back and forth and helping provide the LLM with additional spatial context, we have the marker gene results in Table 6.1, we have most of the structural cell types identified from the remaining clusters. Cluster C-5 lacks a clear and dominant lineage signature. This cluster also occupies roughly the center of the UMAP where poorer quality cells sometimes occupy. For the purposes of this tutorial and the downstream questions, we’ll leave those cells unclassified and label them as “undetermined”. Similarly, cluster C-6 showed two distinct lienage markers for NK cells (KIR3DL1) and B cells (CD79A). Since these markers are unlikely to be present in the same cell it is likely that C-6 comprises a composite of NK and lymphoid cell types. Interestingly, HieraType did not classify the cells within as either NK or B cell. Since our primary interest are on tumor and CAFs, we’ll not unravel this cluster any further and instead label it as a mix of cell types.

From the HieraType-derived merged data, map these new cell type labels.

remap = {
    'C-0': 'epithelial',
    'C-3': 'plasma_IgG',
    'C-4': 'CAF',
    'C-5': 'undetermined',
    'C-6': 'mix_NK_Lymphoid',
    'C-7': 'endothelial',
    'C-8': 'plasma_IgA',
    'C-9': 'tumor',
    'C-10': 'fibroblast',
    'C-11': 'tumor_cyling',
    'C-12': 'smc',
    'C-13': 'glial',
    'C-14': 'pericyte'
}


adata.obs['celltype_fine'] = adata.obs['merged'].astype(str).replace(remap)

Often we want to plot or work with broader classifications of cell types. The code below combines, for example, all the various T cell subtypes into a single group.

Python Code
broad_map = {
    'cd8_tem': 'tcell',
    'cd4_treg': 'tcell',
    'cd4_tem': 'tcell',
    'cd8_cytotoxic': 'tcell',
    'cd8_naive': 'tcell',
    'cd8_exhausted': 'tcell',
    'cd4_tcm': 'tcell',
    'cd4_th2': 'tcell',
    'cd4_th17': 'tcell',
    'cd4_naive': 'tcell',
    'cd4_th1': 'tcell',
    'cd8_tcm': 'tcell',

    'bcell': 'bcell',
    'plasma_IgA': 'plasma',
    'plasma_IgG': 'plasma',

    'tumor': 'tumor',
    'tumor_cyling': 'tumor',
    'epithelial': 'epithelial',

    'CAF': 'CAF',
    'fibroblast': 'fibroblast',
    'smc': 'smc',
    'pericyte': 'pericyte',
    'endothelial': 'endothelial',
    'glial': 'glial',

    'macrophage': 'macrophage',
    'monocyte': 'monocyte',
    'dendritic': 'dendritic',
    'mast': 'mast',
    'neutrophil': 'neutrophil',
    'nk': 'nk',
    'mix_NK_Lymphoid': 'mix_NK_Lymphoid',

    'undetermined': 'undetermined'
}

adata.obs['celltype_broad'] = adata.obs['celltype_fine'].map(broad_map)

Which provides 18 broad cell types and 32 fine cell types.

Save the results to disk.

R Code
filename = os.path.join(r.analysis_dir, "anndata-6-celltypes.h5ad")
adata.write_h5ad(
  filename,
  compression=hdf5plugin.FILTERS["zstd"],
  compression_opts=hdf5plugin.Zstd(clevel=5).filter_options
)

6.3 Cell Type Visualization

It is useful at this point to visualize the distributions of cells in a variety of ways to get an understanding of their composition. We can visualize these cell types in UMAP space and XY space.

6.3.1 Broad cell types

Instead of the default colors in plotDots, let’s create manual ones.

R Code
# define colors semi-manually instead of using the default colors
column <- 'celltype_broad'
column_levels <- levels(py$adata$obs[[column]])
n_levels <- length(column_levels)

if(n_levels<27){
  colors_use <- pals::alphabet(n_levels)
} else if(n < 53){
  colors_use <- c(pals::alphabet(26), pals::alphabet2(n_levels-26))
} else {
  stop("consider fewer groups")
}

names(colors_use) <- column_levels

colors_use['mast'] = "#FFCC99"
colors_use['tumor'] = "#C20088"
colors_use['undetermined'] = "#808080"
colors_use['mix_NK_Lymphoid'] = "#8080FF"


# saving colors for later
results_list[['celltype_broad_names']] <- names(colors_use)
results_list[['celltype_broad_colors']] <- as.character(colors_use)
saveRDS(results_list, results_list_file)

Plot the cells in XY space and UMAP space and color based on celltype_broad.

R Code
group_prefix <- "broad"

ct_assets_dir <- file.path(analysis_asset_dir, "ct")

# XY (global only)
plotDots(py$adata, color_by='celltype_broad',
              plot_global = TRUE,
              facet_by_group = FALSE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001
                  ),
                  scale_bar_params = list(
                    location = c(5, 0),
                    width = 2,
                    n = 3,
                    height = 0.1,
                    scale_colors = c("black", "grey30"),
                    label_nudge_y = -0.3
                  ),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 200,
                  width = 8, 
                  height = 8,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("X (mm)"),
                ylab("Y (mm)"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                theme(legend.position = c(0.8, 0.4)),
                guides(color = guide_legend(
                  title="Cell Type",
                  override.aes = list(size = 3) ) )
              )
              )

# XY (facets only)
plotDots(py$adata, color_by='celltype_broad',
              plot_global = FALSE,
              facet_by_group = TRUE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001
                  ),
                  scale_bar_params = list(
                    location = c(5, 0),
                    width = 2,
                    n = 3,
                    height = 0.1,
                    scale_colors = c("black", "grey30"),
                    label_nudge_y = -0.3
                  ),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 100,
                  width = 5,
                  height = 5,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("X (mm)"),
                ylab("Y (mm)"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                theme(legend.position = c(0.8, 0.4)),
                guides(color = guide_legend(
                  title="Cell Type",
                  override.aes = list(size = 3) ) )
              )
              )

# UMAP (global only)
plotDots(py$adata, 
              obsm_key = "umap_pr",
              color_by='celltype_broad',
              plot_global = TRUE,
              facet_by_group = FALSE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001, alpha=0.1
                  ),
                  geom_label_params = list(
                    size = 2
                  ),
                  labels_on_plot = data.frame(),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 200,
                  width = 8,
                  height = 8,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("UMAP 1"),
                ylab("UMAP 2"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                # guides(color = guide_legend(
                #   title="Cell Type",
                #   override.aes = list(size = 3) ) ), 
                theme(legend.position = "none")
              )
            )

# UMAP (facets only)
plotDots(py$adata, 
              obsm_key = "umap_pr",
              color_by='celltype_broad',
              plot_global = FALSE,
              facet_by_group = TRUE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001, alpha=0.1
                  ),
                  geom_label_params = list(
                    size = 2
                  ),
                  labels_on_plot = data.frame(),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 100,
                  width = 5,
                  height = 5,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("UMAP 1"),
                ylab("UMAP 2"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                guides(color = guide_legend(
                  title="Cell Type",
                  override.aes = list(size = 3) ) )
              )
            )
1
in the plotting below, we will use this name to parse out files.

Here are the overall (global) plots.

Figure 6.3: UMAP with broad cell type in the full dataset.
Figure 6.4: XY with broad cell type in the full dataset.

And here are the facetted plots.

6.3.2 Fine cell types

This section performs a similar routine as above but with the fine cell type classifications.

For the fine cell types, we’ll create colors that are various shades of the parent type.

R Code
broad_map <- unlist(py$broad_map)

generate_fine_colors <- function(mapping, broad_colors, mode = "shades") {
  groups <- split(names(mapping), mapping)
  fine_colors <- c()
  for (broad_name in names(groups)) {
    subtypes <- groups[[broad_name]]
    n <- length(subtypes)
    base_col <- broad_colors[broad_name]
    if (n == 1) {
      cols <- base_col
      names(cols) <- subtypes
    } else {
      if (mode == "shades") {
        ramp_func <- colorRampPalette(c(
          adjustcolor(base_col, transform=diag(c(1,1,1,1))*1.4), # Lighten
          base_col,
          adjustcolor(base_col, transform=diag(c(1,1,1,1))*0.6)  # Darken
        ))
        cols <- ramp_func(n)
      } else {
        ramp_func <- colorRampPalette(c("#FFFFFF", base_col, "#000000"))
        cols <- ramp_func(n + 2)[2:(n + 1)]
      }
      names(cols) <- subtypes
    }
    fine_colors <- c(fine_colors, cols)
  }
  return(fine_colors)
}

fine_colors <- generate_fine_colors(broad_map, colors_use)

results_list[['celltype_fine_names']] <- names(fine_colors)
results_list[['celltype_fine_colors']] <- as.character(fine_colors)

saveRDS(results_list, results_list_file)
R Code
group_prefix <- "fine"

ct_assets_dir <- file.path(analysis_asset_dir, "ct")

# XY (global only)
plotDots(py$adata, color_by='celltype_fine',
              plot_global = TRUE,
              facet_by_group = FALSE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001
                  ),
                  scale_bar_params = list(
                    location = c(5, 0),
                    width = 2,
                    n = 3,
                    height = 0.1,
                    scale_colors = c("black", "grey30"),
                    label_nudge_y = -0.3
                  ),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 200,
                  width = 8, 
                  height = 8,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("X (mm)"),
                ylab("Y (mm)"), 
                coord_fixed(),
                scale_color_manual(values = fine_colors),
                theme(legend.position = c(0.8, 0.4)),
                guides(color = guide_legend(
                  title="Cell Type",
                  override.aes = list(size = 3) ) )
              )
              )

# XY (facets only)
plotDots(py$adata, color_by='celltype_fine',
              plot_global = FALSE,
              facet_by_group = TRUE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001
                  ),
                  scale_bar_params = list(
                    location = c(5, 0),
                    width = 2,
                    n = 3,
                    height = 0.1,
                    scale_colors = c("black", "grey30"),
                    label_nudge_y = -0.3
                  ),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 100,
                  width = 5,
                  height = 5,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("X (mm)"),
                ylab("Y (mm)"), 
                coord_fixed(),
                scale_color_manual(values = fine_colors),
                theme(legend.position = c(0.8, 0.4)),
                guides(color = guide_legend(
                  title="Cell Type",
                  override.aes = list(size = 3) ) )
              )
              )

# UMAP (global only)
plotDots(py$adata, 
              obsm_key = "umap_pr",
              color_by='celltype_fine',
              plot_global = TRUE,
              facet_by_group = FALSE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001, alpha=0.1
                  ),
                  geom_label_params = list(
                    size = 2
                  ),
                  labels_on_plot = data.frame(),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 200,
                  width = 8,
                  height = 8,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("UMAP 1"),
                ylab("UMAP 2"), 
                coord_fixed(),
                scale_color_manual(values = fine_colors),
                # guides(color = guide_legend(
                #   title="Cell Type",
                #   override.aes = list(size = 3) ) ), 
                theme(legend.position = "none")
              )
            )

# UMAP (facets only)
plotDots(py$adata, 
              obsm_key = "umap_pr",
              color_by='celltype_fine',
              plot_global = FALSE,
              facet_by_group = TRUE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001, alpha=0.1
                  ),
                  geom_label_params = list(
                    size = 2
                  ),
                  labels_on_plot = data.frame(),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 100,
                  width = 5,
                  height = 5,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("UMAP 1"),
                ylab("UMAP 2"), 
                coord_fixed(),
                scale_color_manual(values = fine_colors),
                guides(color = guide_legend(
                  title="Cell Type",
                  override.aes = list(size = 3) ) )
              )
            )
Figure 6.5: UMAP with fine cell type labels.
Figure 6.6: XY with fine cell type labels.

And here are the facetted plots.

Note that within the SMC XY plots we can see an “outline” that appears to be the muscularis mucosa.

6.4 Integration with Spatial Domains

Recall in Chapter 5 that we used novae to define spatial domains. Now that we have cell type information, let’s look at the composition of cell types within each domain and assign a biological function to them.

Since we only need the metadata associated with the spatial domain results, we’ll only load the obs.

Python Code
focal_domain_resolution = 'novae_domains_6'

if 'adata_domains' not in dir():
  filename = os.path.join(r.analysis_dir, "anndata-3-novae.h5ad")
  adata_domains = ad.read_h5ad(filename, backed='r')

meta_domains = adata_domains.obs.copy()
columns_to_copy = [focal_domain_resolution]

adata.obs[columns_to_copy] = meta_domains[columns_to_copy]

Create a contingency table for the domain and cell type assignments.

R Code
focal_domain_resolution <- py$focal_domain_resolution
focal_celltype_resolution <- "celltype_fine"

df <- py$adata$obs %>% select(
    all_of(focal_domain_resolution), 
    all_of(focal_celltype_resolution))

plot_data <- df %>% 
  group_by(.data[[focal_domain_resolution]], 
           .data[[focal_celltype_resolution]]) %>% 
  summarise(count = n()) %>% 
  mutate(proportion = count / sum(count))

# Create the geom_tile plot
p <- ggplot(plot_data, aes(x = .data[[focal_domain_resolution]],
                           y = .data[[focal_celltype_resolution]], 
                           fill = proportion)) +
  geom_tile(color = "white") +
  geom_text(aes(label = paste0(round(100*proportion, 3), "%")), color = "black", size=2) +
  scale_fill_gradient(low = "white", high = "dodgerblue") +  # Set color scale
  labs(y = "Cell Type", x = "Spatial Domain", fill = "Proportion") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

ggsave(p, filename = file.path(ct_dir, "contingency_table.png"), 
       width=5, height=7)
write.table(plot_data, "./tmp.csv", col.names=T, row.names=F, quote=F)
Figure 6.7: The proportion of a given cell type found in each domain. For example, domain D1015 is comprised of 65.3% smooth muscle cells (SMCs).

From this information and the spatial layout of domains we can assign a function names Table 6.2. Just looking at the composition, the differences between three of the domains (D1011, D1015, and D1016) appear to have varying degrees of tumor, cycling tumor, and CAFs. D1015 has relatively more CAFs so let’s call that a Desmoplastic Stroma domain and let’s label D1016, which has the highest concentration of tumor cells, the Tumor Core. D1011 has abundant tumor (cycling, specifically) and also 12% B cell and 20% (normal) epithelial cells. Spatially, D1011 occurs throughout lymphoid structures within the epithelium and around the tumor region (invasive margin).

Table 6.2: Functional names associated with each novae domain.
Domain Inferred Biological Identity Top 3 Cell Types (Proportion)
D1000 Normal Mucosa 1. epithelial (53.4%), 2. plasma_IgA (17.2%), 3. mix_NK_Lymphoid (6.8%)
D1003 Muscularis Layer 1. smc (65.3%), 2. fibroblast (7.7%), 3. macrophage (6.2%)
D1011 Invasive Margin (Immune-Rich) 1. epithelial (20.4%), 2. tumor_cyling (18.5%), 3. bcell (12.0%)
D1015 Desmoplastic Stroma 1. CAF (34.7%), 2. tumor_cyling (11.6%), 3. monocyte (8.3%)
D1016 Tumor Core 1. tumor_cyling (31.8%), 2. CAF (22.2%), 3. tumor (15.4%)
D1017 Heterogeneous 1. tumor_cyling (16.1%), 2. undetermined (12.2%), 3. epithelial (12.1%)

Let’s add a new column in the metadata with these values.

Python Code
domain_map = {
    'D1000': 'Normal Mucosa',
    'D1003': 'Muscularis Layer',
    'D1011': 'Invasive Margin',
    'D1015': 'Desmoplastic Stroma',
    'D1016': 'Tumor Core',
    'D1017': 'Heterogeneous'
}

adata.obs['annotated_domain'] = adata.obs[focal_domain_resolution].map(domain_map)

filename = os.path.join(r.analysis_dir, "anndata-7-domain_annotations.h5ad")
adata.write_h5ad(
  filename,
  compression=hdf5plugin.FILTERS["zstd"],
  compression_opts=hdf5plugin.Zstd(clevel=5).filter_options
)

And while were at it, let’s plot the domains in XY space using plotDots with the new labels.

R Code
domain_colors <- results_list[['novae_model_1_domain_colors']]
names(domain_colors) <- results_list[['novae_model_1_domain_names']]
  
domain_map <- py$domain_map
domain_map <- data.frame('domain_id' = names(domain_map), 
                         'domain_description'=unlist(domain_map))

domain_colors <- domain_colors[match(domain_map$domain_id, names(domain_colors))]
domain_map$domain_color <- domain_colors

results_list[['domain_colors']] <- domain_map
saveRDS(results_list, results_list_file)

colors_use <- domain_map$domain_color
names(colors_use) <- domain_map$domain_description

group_prefix <- "annotated_domain"

plotDots(py$adata, color_by='annotated_domain',
              plot_global = TRUE,
              facet_by_group = FALSE,
              additional_plot_parameters = list(
                  geom_point_params = list(
                    size=0.001
                  ),
                  scale_bar_params = list(
                    location = c(5, 0),
                    width = 2,
                    n = 3,
                    height = 0.1,
                    scale_colors = c("black", "grey30"),
                    label_nudge_y = -0.3
                  ),
                  directory = ct_assets_dir,
                  fileType = "png",
                  dpi = 200,
                  width = 8, 
                  height = 8,
                  prefix=group_prefix
                ),
              additional_ggplot_layers = list(
                theme_bw(),
                xlab("X (mm)"),
                ylab("Y (mm)"), 
                coord_fixed(),
                scale_color_manual(values = colors_use),
                theme(legend.position = c(0.8, 0.4)),
                guides(color = guide_legend(
                  title="Spatial Domain",
                  override.aes = list(size = 3) ) )
              )
              )
Figure 6.8: XY with spatial domains.

6.5 Conclusion

We have now mapped the cellular landscape of our tissue. By combining broad unsupervised clustering and the specialized HieraType algorithm, we have moved from a matrix of counts to a spatially annotated map. Then we used this information to create functional names to the domains that we found in Chapter 5.

It is important to recognize that this cell typing workflow is a foundational draft. Depending on your specific research questions, such as defining specialized epithelial cell types or mapping complex stromal niches, more advanced methods may be required.

With our cells now labeled with a type and a domain, we are ready to ask “tertiary” questions. At the moment, this includes an analysis on pathway enrichment but more analysis chapters are in the works.