Wrapper to process multiple files of one dataset for segmentation error detection in transcript level. The function reformats the individual transcript data.frame to have unique IDs and a global coordinate system and save into disk, then scores each cell for segmentation error and flags transcripts that have low goodness-of-fit to current cells.
Usage
fastReseg_flag_all_errors(
counts,
clust = NULL,
refProfiles = NULL,
transDF_fileInfo = NULL,
filepath_coln = "file_path",
prefix_colns = c("slide", "fov"),
fovOffset_colns = c("stage_X", "stage_Y"),
pixel_size = 0.18,
zstep_size = 0.8,
transcript_df = NULL,
transID_coln = NULL,
transGene_coln = "target",
cellID_coln = "CellId",
spatLocs_colns = c("x", "y", "z"),
extracellular_cellID = NULL,
flagModel_TransNum_cutoff = 50,
flagCell_lrtest_cutoff = 5,
svmClass_score_cutoff = -2,
svm_args = list(kernel = "radial", scale = FALSE, gamma = 0.4),
path_to_output = "reSeg_res",
transDF_export_option = c(1, 2, 0),
return_trimmed_perCell = FALSE,
combine_extra = FALSE,
ctrl_genes = NULL,
seed_transError = NULL,
percentCores = 0.75
)
Arguments
- counts
Counts matrix for entire data set, cells X genes.
- clust
Vector of cluster assignments for each cell in
counts
, when NULL to automatically assign the cell cluster for each cell based on maximum transcript score of given the providedrefProfiles
- refProfiles
A matrix of cluster profiles, genes X clusters, default = NULL to use external cluster assignments
- transDF_fileInfo
a data.frame with each row for each individual file of per FOV transcript data.frame within which the coordinates and CellId are unique, columns include the file path of per FOV transcript data.frame file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire data set; when NULL, use the provided
transcript_df
directly- filepath_coln
the column name of each individual file of per FOV transcript data.frame in
transDF_fileInfo
- prefix_colns
the column names of annotation in
transDF_fileInfo
, to be added to the CellId as prefix when creating unique cell_ID for entire data set; set to NULL if use the originaltransID_coln
orcellID_coln
- fovOffset_colns
the column name of coordinate offsets in 1st and 2nd dimension for each per FOV transcript data.frame in
transDF_fileInfo
, unit in micron Notice that some assays like SMI has XY axes swapped between stage and each FOV such thatfovOffset_colns
should be c("stage_Y", "stage_X").- pixel_size
the micrometer size of image pixel listed in 1st and 2nd dimension of
spatLocs_colns
of eachtranscript_df
- zstep_size
the micrometer size of z-step for the optional 3rd dimension of
spatLocs_colns
of eachtranscript_df
- transcript_df
the data.frame of transcript level information with unique CellId, default = NULL to read from the
transDF_fileInfo
- transID_coln
the column name of transcript_ID in
transcript_df
, default = NULL to use row index of transcript in eachtranscript_df
; whenprefix_colns
!= NULL, unique transcript_id would be generated fromprefix_colns
andtransID_coln
in eachtranscript_df
- transGene_coln
the column name of target or gene name in
transcript_df
- cellID_coln
the column name of cell_ID in
transcript_df
; whenprefix_colns
!= NULL, unique cell_ID would be generated fromprefix_colns
andcellID_coln
in eachtranscript_df
- spatLocs_colns
column names for 1st, 2nd and optional 3rd dimension of spatial coordinates in
transcript_df
- extracellular_cellID
a vector of cell_ID for extracellular transcripts which would be removed from the resegmention pipeline (default = NULL)
- flagModel_TransNum_cutoff
the cutoff of transcript number to do spatial modeling for identification of wrongly segmented cells (default = 50)
- flagCell_lrtest_cutoff
the cutoff of
lrtest_nlog10P
to identify putative wrongly segmented cells with strong spatial dependency in transcript score profile- svmClass_score_cutoff
the cutoff of transcript score to separate between high and low score transcripts in SVM (default = -2)
- svm_args
a list of arguments to pass to svm function for identifying low-score transcript groups in space, typically involve kernel, gamma, scale
- path_to_output
the file path to output folder; directory would be created by function if not exists;
flagged_transDF
, the reformatted transcript data.frame with transcripts of low goodness-of-fit flagged bySVM_class = 0
, andmodStats_ToFlagCells
, the per cell evaluation output of segmentation error, andclassDF_ToFlagTrans
, the class assignment of transcripts within each flagged cells are saved as individual csv files for each FOV, respectively.- transDF_export_option
option on how to export updated transcript_df, 0 for no export, 1 for write to
path_to_output
in disk as csv for each FOV, 2 for return to function as list (default = 1)- return_trimmed_perCell
flag to return a gene x cell count sparse matrix where all putative contaminating transcripts are trimmed (default = FALSE)
- combine_extra
flag to combine original extracellular transcripts back to the flagged transcript data.frame. (default = FALSE)
- ctrl_genes
a vector of control genes that are present in input transcript data.frame but not present in
counts
orrefProfiles
; thectrl_genes
would be included in FastReseg analysis. (default = NULL)- seed_transError
seed for transcript error detection step, default = NULL to skip the seed
- percentCores
percent of cores to use for parallel processing (0-1] (default = 0.75)
Value
a list
- refProfiles
a genes * clusters matrix of cluster-specific reference profiles used in resegmenation pipeline
- baselineData
a list of two matrices in cluster * percentile format for the cluster-specific percentile distribution of per cell value;
span_score
is for the average per molecule transcript tLLR score of each cell,span_transNum
is for the transcript number of each cell.- ctrl_genes
a vector of control genes whose transcript scores are set to fixed value for all cell types, return when
ctrl_genes
is not NULL.- combined_modStats_ToFlagCells
a data.frame for spatial modeling statistics of each cell for all cells in the data set, output of
score_cell_segmentation_error
function- combined_flaggedCells
a list with each element to be a vector of
UMI_cellID
for cells flagged for potential cell segmentation errors within each FOV- trimmed_perCellExprs
a gene x cell count sparse matrix where all putative contaminating transcripts are trimmed, return when
return_trimmed_perCell
= TRUE- flagged_transDF_list
a list of per-FOV transcript data.frame with flagging information in
SVM_class
column, return whentransDF_export_option = 2
Details
The function would first estimate mean profile for each cell cluster based on the provided cell x gene count matrix and cluster assignment for entire data set.
And then, the function would use the estimated cluster-specific profile as reference profiles when not provided.
For each transcript data.frame, the function would score each transcript based on the provided cell type-specific reference profiles, evaluate the goodness-of-fit of each transcript within original cell segment, and identify the low-score transcript groups within cells that has strong spatial dependency in transcript score profile.
When transDF_export_option =1
, the function would save the each per FOV output as individual file in path_to_output
directory; flagged_transDF
, modStats_ToFlagCells
and classDF_ToFlagTrans
would be saved as csv file, respectively.
- flagged_transDF
a transcript data.frame for each FOV, with columns for unique IDs of transcripts
UMI_transID
and cellsUMI_cellID
, for global coordinate systemx
,y
,z
, and for the goodness-of-fit in original cell segmentSMI_class
; the original per FOV cell ID and pixel/index-based coordinates systems are saved under columns,CellId
,pixel_x
,pixel_y
,idx_z
- modStats_ToFlagCells
a data.frame for spatial modeling statistics of each cell, output of
score_cell_segmentation_error
function- classDF_ToFlagTrans
data.frame for the class assignment of transcripts within putative wrongly segmented cells, output of
flag_bad_transcripts
functions
To account for genes missing in refProfiles
but present in input transcript data.frame, genes in ctrl_genes
would be assigned with goodness-of-fit score equal to svmClass_score_cutoff
for all cell types to minimize the impact of those genes on the identification of low-score transcript groups via SVM. To avoid significant interference from those ctrl_genes
, it's recommended to have total counts of those genes below 1% of total counts of all genes in each cell.
Examples
data("mini_transcriptDF")
data("ori_RawExprs")
data("example_refProfiles")
data("example_baselineCT")
# cell_ID for extracellualr transcripts
extracellular_cellID <- mini_transcriptDF[which(mini_transcriptDF$CellId ==0), 'cell_ID']
# case #'1: provide `transcript_df` directly,
# do auto cluster assignment of each cell based on gene expression matrix,
# `counts`, and cluster-specific reference profiles, `refProfiles`
res1 <- fastReseg_flag_all_errors(counts = ori_RawExprs,
clust = NULL,
refProfiles = example_refProfiles,
pixel_size = 1,
zstep_size = 1,
transcript_df = mini_transcriptDF,
transID_coln = "UMI_transID",
transGene_coln = "target",
cellID_coln = "UMI_cellID",
spatLocs_colns = c("x","y","z"),
extracellular_cellID = extracellular_cellID,
path_to_output = "res1f_directDF")
#> Per-FOV outputs including transcript data.frame with flagging information would be exported to disk at `path_to_output = 'res1f_directDF'`.
#> Found 960 common genes among `refProfiles` and `counts`.
#> No common cell types/clusters found between `clust` and `refProfiles`.
#> Perform cluster assignment based on maximum transcript score given the provided `refProfiles`.
#> Use the providied `molecular_distance_cutoff` = 1.0000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 10.0000 for searching of neighbor cells.
#> 3 Dimension of spaital coordinates are provided.
#> A single `transcript_df` is provided with unique `cellID_coln` = UMI_cellID and `transID_coln` = UMI_transID (use row idx if NULL).
#>
#> ##############
#> Processing file `1`: NA
#>
#>
#> ##------ Fri May 24 10:03:42 2024 ------##
#> Found 960 common genes among transcript_df and score_GeneMatrix.
#> Found 1375 cells and assigned cell type based on the provided `refProfiles` cluster profiles.
#> Run linear regreassion in 3 Dimension.
#> Warning: Below model_cutoff = 50, skip 37 cells with fewer transcripts. Move forward with remaining 1338 cells.
#> 373 cells, 0.2788 of all evaluated cells, are flagged for resegmentation with lrtest_nlog10P > 5.0.
#> Run SVM in 3 Dimension.
#> Found 373 common cells and 960 common genes among chosen_cells, transcript_df, and score_GeneMatrix.
#> Warning: Below model_cutoff = 50, skip 0 cells with fewer transcripts. Move forward with remaining 373 cells.
#> Warning: Skip 0 cells with all transcripts in same class given `score_cutoff = -2`. Move forward with remaining 373 cells.
#> Remove 0 cells with raw transcript score all in same class based on cutoff -2.00 when running spatial SVM model.
# case #'2: provide file paths to per FOV transcript data files and specify
# the spatial offset for each FOV,
# do auto-calculation of cluster-specific reference profiles from gene
# expression matrix, `counts`, and cluster assignment of each cell, `clust`.
data("example_CellGeneExpr")
data("example_clust")
# the example individual transcript files are stored under `data` directory of this package
# update your path accordingly
# Notice that some assays like SMI has XY axes swapped between stage and each FOV;
# coordinates for each FOV should have units in micron
dataDir <- system.file("extdata", package = "FastReseg")
fileInfo_DF <- data.frame(
file_path = fs::path(dataDir,
c("Run4104_FOV001__complete_code_cell_target_call_coord.csv",
"Run4104_FOV002__complete_code_cell_target_call_coord.csv")),
slide = c(1, 1),
fov = c(1,2),
stage_X = 1000*c(5.13, -2.701),
stage_Y = 1000*c(-0.452, 0.081))
res2 <- fastReseg_flag_all_errors(counts = example_CellGeneExpr,
clust = example_clust,
refProfiles = NULL,
transDF_fileInfo =fileInfo_DF,
filepath_coln = 'file_path',
prefix_colns = c('slide','fov'),
# match XY axes between stage and each FOV
fovOffset_colns = c('stage_Y','stage_X'),
# 0.18 micron per pixel in transcript data
pixel_size = 0.18,
# 0.8 micron per z step in transcript data
zstep_size = 0.8,
transcript_df = NULL,
# row index as transcript_id
transID_coln = NULL,
transGene_coln = "target",
cellID_coln = "CellId",
spatLocs_colns = c("x","y","z"),
# CellId = 0 means extracelluar transcripts in raw data
extracellular_cellID = c(0),
path_to_output = "res2f_multiFiles")
#> Per-FOV outputs including transcript data.frame with flagging information would be exported to disk at `path_to_output = 'res2f_multiFiles'`.
#> Found 960 common genes among `refProfiles` and `counts`.
#> Use the providied `molecular_distance_cutoff` = 1.0000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 10.0000 for searching of neighbor cells.
#> 3 Dimension of spaital coordinates are provided.
#> 2 individual per FOV files are provided in `transDF_fileInfo`, use the 1st file to calculate distance cutoffs
#> `transID_coln` and `cellID_coln` of each per FOV transcript_df would be re-named based on `prefix_colns` = `slide`,`fov`.