Skip to contents

core wrapper for resegmentation pipeline using transcript score matrix derived from external reference profiles and preset cutoffs

Usage

fastReseg_perFOV_full_process(
  score_GeneMatrix,
  transcript_df,
  transID_coln = "UMI_transID",
  transGene_coln = "target",
  cellID_coln = "UMI_cellID",
  spatLocs_colns = c("x", "y", "z"),
  extracellular_cellID = NULL,
  flagModel_TransNum_cutoff = 50,
  flagCell_lrtest_cutoff = 5,
  svmClass_score_cutoff = -2,
  svm_args = list(kernel = "radial", scale = FALSE, gamma = 0.4),
  molecular_distance_cutoff = 2.7,
  cellular_distance_cutoff = NULL,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum = NULL,
  groupTranscripts_method = c("dbscan", "delaunay"),
  spatialMergeCheck_method = c("leidenCut", "geometryDiff"),
  cutoff_spatialMerge = 0.5,
  leiden_config = list(objective_function = "CPM", resolution_parameter = 1, beta = 0.01,
    n_iterations = 200),
  config_spatNW_transcript = NULL,
  return_intermediates = TRUE,
  return_perCellData = TRUE,
  includeAllRefGenes = FALSE,
  seed_process = NULL
)

Arguments

score_GeneMatrix

the gene x cell-type matrix of log-like score of gene in each cell type

transcript_df

the data.frame for each transcript with columns for transcript_id, target or gene name, original cell_id, spatial coordinates.

transID_coln

the column name of transcript_ID in transcript_df

transGene_coln

the column name of target or gene name in transcript_df

cellID_coln

the column name of cell_ID in transcript_df

spatLocs_colns

column names for 1st, 2nd and optional 3rd dimension of spatial coordinates in transcript_df

extracellular_cellID

a vector of cell_ID for extracellular transcripts which would be removed from the resegmention pipeline (default = NULL)

flagModel_TransNum_cutoff

the cutoff of transcript number to do spatial modeling for identification of wrongly segmented cells (default = 50)

flagCell_lrtest_cutoff

the cutoff of lrtest_nlog10P to identify putative wrongly segemented cells with strong spatial dependency in transcript score profile

svmClass_score_cutoff

the cutoff of transcript score to separate between high and low score transcripts in SVM (default = -2)

svm_args

a list of arguments to pass to svm function for identifying low-score transcript groups in space, typically involve kernel, gamma, scale

molecular_distance_cutoff

maximum molecule-to-molecule distance within connected transcript group, same unit as input spatial coordinate (default = 2.7 micron). If set to NULL, the pipeline would first randomly choose no more than 2500 cells from up to 10 random picked ROIs with search radius to be 5 times of cellular_distance_cutoff, and then calculate the minimal molecular distance between picked cells. The pipeline would further use the 5 times of 90% quantile of minimal molecular distance as molecular_distance_cutoff. This calculation is slow and is not recommended for large transcript data.frame.

cellular_distance_cutoff

maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, same unit as input spatial coordinate. Default = NULL to use the 2 times of average 2D cell diameter.

score_baseline

a named vector of score baseline under each cell type listed in score_GeneMatrix such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence

lowerCutoff_transNum

a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is

higherCutoff_transNum

a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type.

groupTranscripts_method

use either "dbscan" or "delaunay method" to group transcripts in space (default = "dbscan")

spatialMergeCheck_method

use either "leidenCut" (in 2D or 3D) or "geometryDiff" (in 2D only) method to determine whether a cell pair merging event is allowed in space (default = "leidenCut")

cutoff_spatialMerge

spatial constraint on a valid merging event between two source transcript groups, default = 0.5 for 50% cutoff, set to 0 to skip spatial constraint evaluation for merging. For spatialMergeCheck_method = "leidenCut", this is the minimal percentage of transcripts shared membership between query cell and neighbor cells in leiden clustering results for a valid merging event. For spatialMergeCheck_method = "geometryDiff", this is the maximum percentage of white space change upon merging of query cell and neighbor cell for a valid merging event.

leiden_config

(leidenCut) a list of configuration to pass to reticulate and igraph::cluster_leiden function, including objective_function, resolution_parameter, beta, n_iterations.

config_spatNW_transcript

configuration list to create spatial network at transcript level, see manual for createSpatialDelaunayNW_from_spatLocs for more details, set to NULL to use default config (default = NULL)

return_intermediates

flag to return intermediate outputs, including data.frame for spatial modeling statistics of each cell

return_perCellData

flag to return gene x cell count matrix and per cell DF with updated mean spatial coordinates and new cell type

includeAllRefGenes

flag to include all genes in score_GeneMatrix in the returned updated_perCellExprs with missing genes of value 0 (default = FALSE)

seed_process

seed for per FOV processing, used in transcript error detection and correction steps, default = NULL to skip the seed

Value

a list

modStats_ToFlagCells

a data.frame for spatial modeling statistics of each cell, output of score_cell_segmentation_error function, return when return_intermediates = TRUE

groupDF_ToFlagTrans

data.frame for the group assignment of transcripts within putative wrongly segmented cells, merged output of flag_bad_transcripts and groupTranscripts_Delaunay or groupTranscripts_dbscan functions, return when return_intermediates = TRUE

neighborhoodDF_ToReseg

a data.frame for neighborhood environment of low-score transcript groups, output of get_neighborhood_content function, return when return_intermediates = TRUE

reseg_actions

a list of 4 elements describing how the resegmenation would be performed on original transcript_df by the group assignment of transcripts listed in groupDF_ToFlagTrans, output of decide_ReSegment_Operations function, return when return_intermediates = TRUE

updated_transDF

the updated transcript_df with updated_cellID and updated_celltype column based on reseg_full_converter

updated_perCellDT

a per cell data.table with mean spatial coordinates, new cell type and resegmentation action after resegmentation, return when return_perCellData = TRUE

updated_perCellExprs

a gene x cell count sparse matrix for updated transcript data.frame after resegmentation, return when return_perCellData = TRUE

Details

The pipeline would score each transcript based on the provided cell type-specific reference profiles, evaluate the goodness-of-fit of each transcript within original cell segment, identify the low-score transcript groups within cells that has strong spatial dependency in transcript score profile, evaluate the neighborhood environment of low-score transcript groups and perform resegmentation actions including triming to extracellular space, merging to neighbor cell or labeling as new cell.

To account for genes missing in score_GeneMatrix but present in input transcript data.frame, genes in ctrl_genes would be assigned with goodness-of-fit score equal to svmClass_score_cutoff for all cell types to minimize the impact of those genes on the identification of low-score transcript groups via SVM. To avoid significant interference from those ctrl_genes, it's recommended to have total counts of those genes below 1% of total counts of all genes in each cell.

Examples

data(example_refProfiles)
data(mini_transcriptDF)
data(example_baselineCT)
#' cell_ID for extracellualr transcripts
extracellular_cellID <- mini_transcriptDF[which(mini_transcriptDF$CellId ==0), 'cell_ID'] 
score_baseline <- example_baselineCT[['span_score']][,"25%"]
lowerCutoff_transNum  <- example_baselineCT[['span_transNum']][,"25%"]
higherCutoff_transNum  <- example_baselineCT[['span_transNum']][,"50%"]

# calculate log-likelihood of each gene under each cell type and center the 
# score matrix on per gene basis
score_GeneMatrix <- scoreGenesInRef(genes = intersect(unique(mini_transcriptDF[["target"]]), 
                                                      rownames(example_refProfiles)),
                                    ref_profiles = pmax(example_refProfiles, 1e-5))
                                    
# case 1: run with default methods: "dbscan" for transcript grouping, 
# "leidenCut" for merging check 
final_res1 <- fastReseg_perFOV_full_process(score_GeneMatrix= score_GeneMatrix,
                                           transcript_df = mini_transcriptDF,
                                           extracellular_cellID = extracellular_cellID,
                                           molecular_distance_cutoff = 2.7,
                                           cellular_distance_cutoff = 25,
                                           score_baseline = score_baseline,
                                           lowerCutoff_transNum = lowerCutoff_transNum,
                                           higherCutoff_transNum= higherCutoff_transNum)
#> 3 Dimension of spaital coordinates are provided.
#> Perform leiden clustering at resolution_parameter = 1.000.
#> Create Delanay network when config$method is NULL.
#> Name the spatial network based on method as `Delaunay_network` when config$name is NULL.
#> Use cellular_distance_cutoff = 25.0000 for searching of neighbor cells.
#> Use `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> 960 unique genes are found in `transcript_df`.
#> 960 unique genes are shared by the provided `score_GeneMatrix` that contains cluster profiles of 960 genes for 22 clusters.
#> Found 756783 transcript records and 1375 cells in input `transcript_df`.
#> Use the providied `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 25.0000 for searching of neighbor cells.
#> Found 960 common genes among transcript_df and score_GeneMatrix. 
#> Found 1375 cells and assigned cell type based on the provided `refProfiles` cluster profiles.
#> Run linear regreassion in 3 Dimension.
#> Warning: Below model_cutoff = 50, skip 37 cells with fewer transcripts. Move forward with remaining 1338 cells.
#> 373 cells, 0.2788 of all evaluated cells, are flagged for resegmentation with lrtest_nlog10P > 5.0.
#> Run SVM in 3 Dimension.
#> Found 373 common cells and 960 common genes among chosen_cells, transcript_df, and score_GeneMatrix. 
#> Warning: Below model_cutoff = 50, skip 0 cells with fewer transcripts. Move forward with remaining 373 cells.
#> Warning: Skip 0 cells with all transcripts in same class given `score_cutoff = -2`. Move forward with remaining 373 cells.
#> Remove 0 cells with raw transcript score all in same class based on cutoff -2.00 when running spatial SVM model.
#> Do spatial network analysis in 3 Dimension.
#> 10652 chosen_transcripts are found in common cells.
#> SVM spatial model further identified 17 cells with transcript score all in same class, exclude from transcript group analysis.
#> Found 960 common genes among transcript_df and score_GeneMatrix. 
#> Perform leiden clustering at resolution_parameter = 1.000.
#> Use neighbor_distance_xy = 25.0000 for searching of neighbor cells.
#> Use distance_cutoff = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use first 2D for searching cell neighborhood, but all 3 Dimension to identify direct neighbors based on molecular distance.
#> Found 2289 common cells and 960 common genes among transcript_df, cell_networkDT, and score_GeneMatrix. 
#> 914 chosen_cells are found in common cells.
#> Use `leidenCut` method to evaluate putative merging event in space. 
#> Perform leiden clustering at resolution_parameter = 1.000.
#> A valid merging event must have query cell with 0.500 transcript shared same membership as neighbor cell of consistent cell type. 
#> Run delanuay network in 3 Dimension.
#> Perform ledien clustering on 59 potential merging events. 
#> (`c_1_2_1339_g3`, `c_1_2_1403_g5`) cell pair with all 7 transcripts in same z plane, run 2D network analysis.
#> (`c_1_2_1403_g5`, `c_1_2_1339_g3`) cell pair with all 7 transcripts in same z plane, run 2D network analysis.
#> (`c_1_2_1731_g2`, `c_1_2_1792_g6`) cell pair with all 5 transcripts in same z plane, run 2D network analysis.
#> (`c_1_2_1792_g6`, `c_1_2_1731_g2`) cell pair with all 5 transcripts in same z plane, run 2D network analysis.
#> (`c_1_2_921_g5`, `c_1_2_941_g1`) cell pair with all 15 transcripts in same z plane, run 2D network analysis.
#> (`c_1_2_941_g1`, `c_1_2_921_g5`) cell pair with all 15 transcripts in same z plane, run 2D network analysis.
#> Found 915 common cells and 960 common genes among `names(reseg_full_converter)`, `transcript_df`, and `score_GeneMatrix`. 


# case 2: run with alternative methods: "delaunay" for transcript grouping, 
# "geometryDiff" for merging check 
final_res2 <- fastReseg_perFOV_full_process(score_GeneMatrix= score_GeneMatrix,
                                           transcript_df = mini_transcriptDF,
                                           extracellular_cellID = extracellular_cellID,
                                           molecular_distance_cutoff = 2.7,
                                           cellular_distance_cutoff = 25,
                                           score_baseline = score_baseline,
                                           lowerCutoff_transNum = lowerCutoff_transNum,
                                           higherCutoff_transNum= higherCutoff_transNum,
                                           groupTranscripts_method = "delaunay",
                                           spatialMergeCheck_method = "geometryDiff")
#> 3 Dimension of spaital coordinates are provided.
#> Create Delanay network when config$method is NULL.
#> Name the spatial network based on method as `Delaunay_network` when config$name is NULL.
#> Use cellular_distance_cutoff = 25.0000 for searching of neighbor cells.
#> Use `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> 960 unique genes are found in `transcript_df`.
#> 960 unique genes are shared by the provided `score_GeneMatrix` that contains cluster profiles of 960 genes for 22 clusters.
#> Found 756783 transcript records and 1375 cells in input `transcript_df`.
#> Use the providied `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 25.0000 for searching of neighbor cells.
#> Found 960 common genes among transcript_df and score_GeneMatrix. 
#> Found 1375 cells and assigned cell type based on the provided `refProfiles` cluster profiles.
#> Run linear regreassion in 3 Dimension.
#> Warning: Below model_cutoff = 50, skip 37 cells with fewer transcripts. Move forward with remaining 1338 cells.
#> 373 cells, 0.2788 of all evaluated cells, are flagged for resegmentation with lrtest_nlog10P > 5.0.
#> Run SVM in 3 Dimension.
#> Found 373 common cells and 960 common genes among chosen_cells, transcript_df, and score_GeneMatrix. 
#> Warning: Below model_cutoff = 50, skip 0 cells with fewer transcripts. Move forward with remaining 373 cells.
#> Warning: Skip 0 cells with all transcripts in same class given `score_cutoff = -2`. Move forward with remaining 373 cells.
#> Remove 0 cells with raw transcript score all in same class based on cutoff -2.00 when running spatial SVM model.
#> Do spatial network analysis in 3 Dimension.
#> 10652 chosen_transcripts are found in common cells.
#> c_1_2_1172 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1184 with all 4 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_1211 with all 7 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_1403 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1412 with all 7 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_1436 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1585 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1620 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1625 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1674 with all 5 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_1689 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1732 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1736 with all 12 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_1736 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1801 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1830 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1887 has 2 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_1930 with all 11 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_1968 with all 5 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_2157 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_2279 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_2466 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_2503 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_2549 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_2566 with all 6 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_2590 with all 12 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_2786 with all 5 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_417 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_4201 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_4364 with all 13 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_4421 with all 4 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> c_1_2_477 has 2 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_562 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_712 has 1 solo transcripts, flag based on distance cutoff = 2.7000.
#> c_1_2_941 with all 9 flagged transcripts of unique transcripts in same z plane, run 2D network analysis.
#> SVM spatial model further identified 17 cells with transcript score all in same class, exclude from transcript group analysis.
#> Found 960 common genes among transcript_df and score_GeneMatrix. 
#> Use neighbor_distance_xy = 25.0000 for searching of neighbor cells.
#> Use distance_cutoff = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use first 2D for searching cell neighborhood, but all 3 Dimension to identify direct neighbors based on molecular distance.
#> Found 2292 common cells and 960 common genes among transcript_df, cell_networkDT, and score_GeneMatrix. 
#> 917 chosen_cells are found in common cells.
#> Use `geometryDiff` method to evaluate putative merging event in space. 
#> A valid merging event to neighbor cell of consistent cell type must have no more than 0.500 area change in white space upon merging with respect to the concave area of either source cells. 
#> Perform geometry analysis in 2D for potential merging events despite the provided 3 Dimension data.
#> Perform geometry analysis on 127 potential merging events. 
#> Found 922 common cells and 960 common genes among `names(reseg_full_converter)`, `transcript_df`, and `score_GeneMatrix`. 


# case 3: run with default "dbscan" for transcript grouping, but apply no 
# spatial constraint on merging
final_res3 <- fastReseg_perFOV_full_process(score_GeneMatrix= score_GeneMatrix,
                                            transcript_df = mini_transcriptDF,
                                            extracellular_cellID = extracellular_cellID,
                                            molecular_distance_cutoff = 2.7,
                                            cellular_distance_cutoff = 25,
                                            score_baseline = score_baseline,
                                            lowerCutoff_transNum = lowerCutoff_transNum,
                                            higherCutoff_transNum= higherCutoff_transNum,
                                            cutoff_spatialMerge = 0)
#> 3 Dimension of spaital coordinates are provided.
#> Use cellular_distance_cutoff = 25.0000 for searching of neighbor cells.
#> Use `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> 960 unique genes are found in `transcript_df`.
#> 960 unique genes are shared by the provided `score_GeneMatrix` that contains cluster profiles of 960 genes for 22 clusters.
#> Found 756783 transcript records and 1375 cells in input `transcript_df`.
#> Use the providied `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 25.0000 for searching of neighbor cells.
#> Found 960 common genes among transcript_df and score_GeneMatrix. 
#> Found 1375 cells and assigned cell type based on the provided `refProfiles` cluster profiles.
#> Run linear regreassion in 3 Dimension.
#> Warning: Below model_cutoff = 50, skip 37 cells with fewer transcripts. Move forward with remaining 1338 cells.
#> 373 cells, 0.2788 of all evaluated cells, are flagged for resegmentation with lrtest_nlog10P > 5.0.
#> Run SVM in 3 Dimension.
#> Found 373 common cells and 960 common genes among chosen_cells, transcript_df, and score_GeneMatrix. 
#> Warning: Below model_cutoff = 50, skip 0 cells with fewer transcripts. Move forward with remaining 373 cells.
#> Warning: Skip 0 cells with all transcripts in same class given `score_cutoff = -2`. Move forward with remaining 373 cells.
#> Remove 0 cells with raw transcript score all in same class based on cutoff -2.00 when running spatial SVM model.
#> Do spatial network analysis in 3 Dimension.
#> 10652 chosen_transcripts are found in common cells.
#> SVM spatial model further identified 17 cells with transcript score all in same class, exclude from transcript group analysis.
#> Found 960 common genes among transcript_df and score_GeneMatrix. 
#> Use neighbor_distance_xy = 25.0000 for searching of neighbor cells.
#> Use distance_cutoff = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use first 2D for searching cell neighborhood, but all 3 Dimension to identify direct neighbors based on molecular distance.
#> Found 2289 common cells and 960 common genes among transcript_df, cell_networkDT, and score_GeneMatrix. 
#> 914 chosen_cells are found in common cells.
#> The provided `cutoff_spatialMerge = 0`, no spatial evaluation would be done for potential cell merging events. 
#> Found 928 common cells and 960 common genes among `names(reseg_full_converter)`, `transcript_df`, and `score_GeneMatrix`.