fastReseg_flag_all_errors — fastReseg_flag_all

Wrapper to process multiple files of one dataset for segmentation error detection in transcript level. The function reformats the individual transcript data.frame to have unique IDs and a global coordinate system and save into disk, then scores each cell for segmentation error and flags transcripts that have low goodness-of-fit to current cells.

Usage

fastReseg_flag_all_errors(
  counts,
  clust = NULL,
  refProfiles = NULL,
  transDF_fileInfo = NULL,
  filepath_coln = "file_path",
  prefix_colns = c("slide", "fov"),
  fovOffset_colns = c("stage_X", "stage_Y"),
  pixel_size = 0.18,
  zstep_size = 0.8,
  transcript_df = NULL,
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = "CellId",
  spatLocs_colns = c("x", "y", "z"),
  extracellular_cellID = NULL,
  flagModel_TransNum_cutoff = 50,
  flagCell_lrtest_cutoff = 5,
  svmClass_score_cutoff = -2,
  svm_args = list(kernel = "radial", scale = FALSE, gamma = 0.4),
  path_to_output = "reSeg_res",
  transDF_export_option = c(1, 2, 0),
  return_trimmed_perCell = FALSE,
  combine_extra = FALSE,
  ctrl_genes = NULL,
  seed_transError = NULL,
  percentCores = 0.75
)

Arguments

counts: Counts matrix for entire data set, cells X genes.
clust: Vector of cluster assignments for each cell in counts, when NULL to automatically assign the cell cluster for each cell based on maximum transcript score of given the provided refProfiles
refProfiles: A matrix of cluster profiles, genes X clusters, default = NULL to use external cluster assignments
transDF_fileInfo: a data.frame with each row for each individual file of per FOV transcript data.frame within which the coordinates and CellId are unique, columns include the file path of per FOV transcript data.frame file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire data set; when NULL, use the provided transcript_df directly
filepath_coln: the column name of each individual file of per FOV transcript data.frame in transDF_fileInfo
prefix_colns: the column names of annotation in transDF_fileInfo, to be added to the CellId as prefix when creating unique cell_ID for entire data set; set to NULL if use the original transID_coln or cellID_coln
fovOffset_colns: the column name of coordinate offsets in 1st and 2nd dimension for each per FOV transcript data.frame in transDF_fileInfo, unit in micron Notice that some assays like SMI has XY axes swapped between stage and each FOV such that fovOffset_colns should be c("stage_Y", "stage_X").
pixel_size: the micrometer size of image pixel listed in 1st and 2nd dimension of spatLocs_colns of each transcript_df
zstep_size: the micrometer size of z-step for the optional 3rd dimension of spatLocs_colns of each transcript_df
transcript_df: the data.frame of transcript level information with unique CellId, default = NULL to read from the transDF_fileInfo
transID_coln: the column name of transcript_ID in transcript_df, default = NULL to use row index of transcript in each transcript_df; when prefix_colns != NULL, unique transcript_id would be generated from prefix_colns and transID_coln in each transcript_df
transGene_coln: the column name of target or gene name in transcript_df
cellID_coln: the column name of cell_ID in transcript_df; when prefix_colns != NULL, unique cell_ID would be generated from prefix_colns and cellID_coln in each transcript_df
spatLocs_colns: column names for 1st, 2nd and optional 3rd dimension of spatial coordinates in transcript_df
extracellular_cellID: a vector of cell_ID for extracellular transcripts which would be removed from the resegmention pipeline (default = NULL)
flagModel_TransNum_cutoff: the cutoff of transcript number to do spatial modeling for identification of wrongly segmented cells (default = 50)
flagCell_lrtest_cutoff: the cutoff of lrtest_nlog10P to identify putative wrongly segmented cells with strong spatial dependency in transcript score profile
svmClass_score_cutoff: the cutoff of transcript score to separate between high and low score transcripts in SVM (default = -2)
svm_args: a list of arguments to pass to svm function for identifying low-score transcript groups in space, typically involve kernel, gamma, scale
path_to_output: the file path to output folder; directory would be created by function if not exists; flagged_transDF, the reformatted transcript data.frame with transcripts of low goodness-of-fit flagged by SVM_class = 0, and modStats_ToFlagCells, the per cell evaluation output of segmentation error, and classDF_ToFlagTrans, the class assignment of transcripts within each flagged cells are saved as individual csv files for each FOV, respectively.
transDF_export_option: option on how to export updated transcript_df, 0 for no export, 1 for write to path_to_output in disk as csv for each FOV, 2 for return to function as list (default = 1)
return_trimmed_perCell: flag to return a gene x cell count sparse matrix where all putative contaminating transcripts are trimmed (default = FALSE)
combine_extra: flag to combine original extracellular transcripts back to the flagged transcript data.frame. (default = FALSE)
ctrl_genes: a vector of control genes that are present in input transcript data.frame but not present in counts or refProfiles; the ctrl_genes would be included in FastReseg analysis. (default = NULL)
seed_transError: seed for transcript error detection step, default = NULL to skip the seed
percentCores: percent of cores to use for parallel processing (0-1] (default = 0.75)

Value

a list

refProfiles: a genes * clusters matrix of cluster-specific reference profiles used in resegmenation pipeline
baselineData: a list of two matrices in cluster * percentile format for the cluster-specific percentile distribution of per cell value; span_score is for the average per molecule transcript tLLR score of each cell, span_transNum is for the transcript number of each cell.
ctrl_genes: a vector of control genes whose transcript scores are set to fixed value for all cell types, return when ctrl_genes is not NULL.
combined_modStats_ToFlagCells: a data.frame for spatial modeling statistics of each cell for all cells in the data set, output of score_cell_segmentation_error function
combined_flaggedCells: a list with each element to be a vector of UMI_cellID for cells flagged for potential cell segmentation errors within each FOV
trimmed_perCellExprs: a gene x cell count sparse matrix where all putative contaminating transcripts are trimmed, return when return_trimmed_perCell = TRUE
flagged_transDF_list: a list of per-FOV transcript data.frame with flagging information in SVM_class column, return when transDF_export_option = 2

Details

The function would first estimate mean profile for each cell cluster based on the provided cell x gene count matrix and cluster assignment for entire data set. And then, the function would use the estimated cluster-specific profile as reference profiles when not provided. For each transcript data.frame, the function would score each transcript based on the provided cell type-specific reference profiles, evaluate the goodness-of-fit of each transcript within original cell segment, and identify the low-score transcript groups within cells that has strong spatial dependency in transcript score profile. When transDF_export_option =1, the function would save the each per FOV output as individual file in path_to_output directory; flagged_transDF, modStats_ToFlagCells and classDF_ToFlagTrans would be saved as csv file, respectively.

flagged_transDF: a transcript data.frame for each FOV, with columns for unique IDs of transcripts UMI_transID and cells UMI_cellID, for global coordinate system x, y, z, and for the goodness-of-fit in original cell segment SMI_class; the original per FOV cell ID and pixel/index-based coordinates systems are saved under columns, CellId, pixel_x, pixel_y, idx_z
modStats_ToFlagCells: a data.frame for spatial modeling statistics of each cell, output of score_cell_segmentation_error function
classDF_ToFlagTrans: data.frame for the class assignment of transcripts within putative wrongly segmented cells, output of flag_bad_transcripts functions

To account for genes missing in refProfiles but present in input transcript data.frame, genes in ctrl_genes would be assigned with goodness-of-fit score equal to svmClass_score_cutoff for all cell types to minimize the impact of those genes on the identification of low-score transcript groups via SVM. To avoid significant interference from those ctrl_genes, it's recommended to have total counts of those genes below 1% of total counts of all genes in each cell.

Examples

data("mini_transcriptDF")
data("ori_RawExprs")
data("example_refProfiles")
data("example_baselineCT")
# cell_ID for extracellualr transcripts
extracellular_cellID <- mini_transcriptDF[which(mini_transcriptDF$CellId ==0), 'cell_ID'] 

# case #'1: provide `transcript_df` directly,
# do auto cluster assignment of each cell based on gene expression matrix, 
# `counts`, and cluster-specific reference profiles, `refProfiles`
res1 <- fastReseg_flag_all_errors(counts = ori_RawExprs,
                                  clust = NULL,
                                  refProfiles = example_refProfiles,
                                  pixel_size = 1,
                                  zstep_size = 1,
                                  transcript_df = mini_transcriptDF,
                                  transID_coln = "UMI_transID",
                                  transGene_coln = "target",
                                  cellID_coln = "UMI_cellID",
                                  spatLocs_colns = c("x","y","z"),
                                  extracellular_cellID = extracellular_cellID,
                                  path_to_output = "res1f_directDF")
#> Per-FOV outputs including transcript data.frame with flagging information would be exported to disk at `path_to_output = 'res1f_directDF'`.
#> Found 960 common genes among `refProfiles` and `counts`. 
#> No common cell types/clusters found between `clust` and `refProfiles`.
#> Perform cluster assignment based on maximum transcript score given the provided `refProfiles`.
#> Use the providied `molecular_distance_cutoff` = 1.0000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 10.0000 for searching of neighbor cells.
#> 3 Dimension of spaital coordinates are provided.
#> A single `transcript_df` is provided with unique `cellID_coln` = UMI_cellID and `transID_coln` = UMI_transID (use row idx if NULL).
#> 
#> ##############
#> Processing file `1`: NA
#> 
#> 
#> ##------ Wed Mar 19 01:14:32 2025 ------##
#> Found 960 common genes among transcript_df and score_GeneMatrix. 
#> Found 1375 cells and assigned cell type based on the provided `refProfiles` cluster profiles.
#> Run linear regreassion in 3 Dimension.
#> Warning: Below model_cutoff = 50, skip 37 cells with fewer transcripts. Move forward with remaining 1338 cells.
#> 373 cells, 0.2788 of all evaluated cells, are flagged for resegmentation with lrtest_nlog10P > 5.0.
#> Run SVM in 3 Dimension.
#> Found 373 common cells and 960 common genes among chosen_cells, transcript_df, and score_GeneMatrix. 
#> Warning: Below model_cutoff = 50, skip 0 cells with fewer transcripts. Move forward with remaining 373 cells.
#> Warning: Skip 0 cells with all transcripts in same class given `score_cutoff = -2`. Move forward with remaining 373 cells.
#> Remove 0 cells with raw transcript score all in same class based on cutoff -2.00 when running spatial SVM model.

# case #'2: provide file paths to per FOV transcript data files and specify 
# the spatial offset for each FOV,
# do auto-calculation of cluster-specific reference profiles from gene 
# expression matrix, `counts`, and cluster assignment of each cell, `clust`.
data("example_CellGeneExpr")
data("example_clust")

# the example individual transcript files are stored under `data` directory of this package
# update your path accordingly
# Notice that some assays like SMI has XY axes swapped between stage and each FOV;
# coordinates for each FOV should have units in micron
dataDir <- system.file("extdata", package = "FastReseg")
fileInfo_DF <- data.frame(
  file_path = fs::path(dataDir,
                       c("Run4104_FOV001__complete_code_cell_target_call_coord.csv",
                         "Run4104_FOV002__complete_code_cell_target_call_coord.csv")),
  slide = c(1, 1),
  fov = c(1,2),
  stage_X = 1000*c(5.13, -2.701),
  stage_Y = 1000*c(-0.452, 0.081))

res2 <- fastReseg_flag_all_errors(counts = example_CellGeneExpr,
                                  clust = example_clust,
                                  refProfiles = NULL,
                                  transDF_fileInfo =fileInfo_DF,
                                  filepath_coln = 'file_path',
                                  prefix_colns = c('slide','fov'),
                                  
                                  # match XY axes between stage and each FOV
                                  fovOffset_colns = c('stage_Y','stage_X'), 
                                  # 0.18 micron per pixel in transcript data
                                  pixel_size = 0.18, 
                                  # 0.8 micron per z step in transcript data
                                  zstep_size = 0.8, 
                                  
                                  transcript_df = NULL,
                                  
                                  # row index as transcript_id
                                  transID_coln = NULL, 
                                  
                                  transGene_coln = "target",
                                  cellID_coln = "CellId",
                                  spatLocs_colns = c("x","y","z"),
                                  
                                  # CellId = 0 means extracelluar transcripts in raw data
                                  extracellular_cellID = c(0), 
                                  
                                  path_to_output = "res2f_multiFiles")
#> Per-FOV outputs including transcript data.frame with flagging information would be exported to disk at `path_to_output = 'res2f_multiFiles'`.
#> Found 960 common genes among `refProfiles` and `counts`. 
#> Use the providied `molecular_distance_cutoff` = 1.0000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 10.0000 for searching of neighbor cells.
#> 3 Dimension of spaital coordinates are provided.
#> 2 individual per FOV files are provided in `transDF_fileInfo`, use the 1st file to calculate distance cutoffs
#> `transID_coln` and `cellID_coln` of each per FOV transcript_df would be re-named based on `prefix_colns` = `slide`,`fov`.