runPreprocess — runPreprocess • FastReseg

modular wrapper to get baseline data and cutoffs from entire dataset

Usage

runPreprocess(
  counts,
  clust = NULL,
  refProfiles = NULL,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum = NULL,
  imputeFlag_missingCTs = TRUE,
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = 2.7,
  cellular_distance_cutoff = NULL,
  transcript_df = NULL,
  transDF_fileInfo = NULL,
  filepath_coln = "file_path",
  prefix_colns = c("slide", "fov"),
  fovOffset_colns = c("stage_X", "stage_Y"),
  pixel_size = 0.18,
  zstep_size = 0.8,
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = "CellId",
  spatLocs_colns = c("x", "y", "z"),
  extracellular_cellID = NULL
)

Arguments

counts: Counts matrix for entire dataset, cells X genes.
clust: Vector of cluster assignments for each cell in counts, when NULL to automatically assign the cell cluster for each cell based on maximum transcript score of given the provided refProfiles
refProfiles: A matrix of cluster profiles, genes X clusters, default = NULL to use external cluster assignments. Of note, when refProfiles != NULL, genes unique to counts but missing in refProfiles would be omitted from downstream analysis.
score_baseline: a named vector of score baseline under each cell type listed in refProfiles such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence; default = NULL to calculate from counts and refProfiles
lowerCutoff_transNum: a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is; default = NULL to calculate from counts and refProfiles
higherCutoff_transNum: a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type; default = NULL to calculate from counts and refProfiles
imputeFlag_missingCTs: flag to impute score_baseline, lowerCutoff_transNum,higherCutoff_transNum for cell types present in refProfiles but missing in the provided transcript data files or the provided baseline and cutoffs; when TRUE, the median values of existing cell types would be used as the values for missing cell types.
ctrl_genes: a vector of control genes that are present in input transcript data.frame but not in refProfiles and expect no cell type dependency, e.g. negative control probes; the ctrl_genes would be included in FastReseg analysis. (default = NULL)
svmClass_score_cutoff: the cutoff of transcript score to separate between high and low score transcripts in SVM, used as the score values for ctrl_genes (default = -2)
molecular_distance_cutoff: maximum molecule-to-molecule distance within connected transcript group, unit in micron (default = 2.7 micron). If set to NULL, the pipeline would first randomly choose no more than 2500 cells from up to 10 random picked ROIs with search radius to be 5 times of cellular_distance_cutoff, and then calculate the minimal molecular distance between picked cells. The pipeline would further use the 5 times of 90% quantile of minimal molecular distance as molecular_distance_cutoff. This calculation is slow and is not recommended for large transcript data.frame.
cellular_distance_cutoff: maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, unit in micron. Default = NULL to use the 2 times of average 2D cell diameter.
transcript_df: the data.frame of transcript level information with unique CellId, default = NULL to read from the transDF_fileInfo
transDF_fileInfo: a data.frame with each row for each individual file of per FOV transcript data.frame within which the coordinates and CellId are unique, columns include the file path of per FOV transcript data.frame file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire dataset; when NULL, use the provided transcript_df directly
filepath_coln: the column name of each individual file of per FOV transcript data.frame in transDF_fileInfo
prefix_colns: the column names of annotation in transDF_fileInfo, to be added to the CellId as prefix when creating unique cell_ID for entire dataset; set to NULL if use the original transID_coln or cellID_coln
fovOffset_colns: the column name of coordinate offsets in 1st and 2nd dimension for each per FOV transcript data.frame in transDF_fileInfo, unit in micron Notice that some assays like SMI has XY axes swapped between stage and each FOV such that fovOffset_colns should be c("stage_Y", "stage_X").
pixel_size: the micrometer size of image pixel listed in 1st and 2nd dimension of spatLocs_colns of each transcript_df
zstep_size: the micrometer size of z-step for the optional 3rd dimension of spatLocs_colns of each transcript_df
transID_coln: the column name of transcript_ID in transcript_df, default = NULL to use row index of transcript in each transcript_df; when prefix_colns != NULL, unique transcript_id would be generated from prefix_colns and transID_coln in each transcript_df
transGene_coln: the column name of target or gene name in transcript_df
cellID_coln: the column name of cell_ID in transcript_df; when prefix_colns != NULL, unique cell_ID would be generated from prefix_colns and cellID_coln in each transcript_df
spatLocs_colns: column names for 1st, 2nd and optional 3rd dimension of spatial coordinates in transcript_df
extracellular_cellID: a vector of cell_ID for extracellular transcripts which would be removed from the resegmention pipeline (default = NULL)

Value

a nested list

clust: vector of cluster assignments for each cell in counts, used in caculating baselineData
refProfiles: a genes X clusters matrix of cluster-specific reference profiles to use in resegmenation pipeline
baselineData: a list of two matrice in cluster X percentile format for the cluster-specific percentile distribution of per cell value; span_score is for the average per molecule transcript tLLR score of each cell, span_transNum is for the transcript number of each cell.
cutoffs_list: a list of cutoffs to use in resegmentation pipeline, including, score_baseline, lowerCutoff_transNum, higherCutoff_transNum, cellular_distance_cutoff, molecular_distance_cutoff
ctrl_genes: a vector of control genes whose transcript scores are set to fixed value for all cell types, return when ctrl_genes is not NULL.
score_GeneMatrix: a gene x cell-type score matrix to use in resegmenation pipeline, the scores for ctrl_genes are set to be the same as svmClass_score_cutoff
processed_1st_transDF: a list of 2 elements for the intracellular and extracellular transcript data.frame of the processed outcomes of 1st transcrip file

The cutoffs_list is a list containing

score_baseline: a named vector of score baseline under each cell type listed in refProfiles such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence.
lowerCutoff_transNum: a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is.
higherCutoff_transNum: a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type.
cellular_distance_cutoff: maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, unit in micron.
molecular_distance_cutoff: maximum molecule-to-molecule distance within connected transcript group, unit in micron.

Examples

 
data("mini_transcriptDF")
data("example_CellGeneExpr")
data("example_clust")
data("example_refProfiles")
# cell_ID for extracellualr transcripts
extracellular_cellID <- mini_transcriptDF[which(mini_transcriptDF$CellId ==0), 'cell_ID'] 

# case 1: use `clust` and `transcript_df` directly, with known distance cutoffs
prep_res1 <- runPreprocess(
  counts = example_CellGeneExpr,
  clust = example_clust,
  refProfiles = NULL,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum= NULL,
  imputeFlag_missingCTs = FALSE,
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = 2.7,
  cellular_distance_cutoff = 20,
  transcript_df = mini_transcriptDF, 
  transDF_fileInfo = NULL, 
  pixel_size = 0.18,
  zstep_size = 0.8, 
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = 'CellId',
  spatLocs_colns = c('x','y','z'),
  extracellular_cellID = 0 
)
#> Found 960 common genes among `refProfiles` and `counts`. 
#> Use the providied `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 20.0000 for searching of neighbor cells.

# case 2: use `refProfiles` to get `clust`, use `transcript_df` directly, 
# unknown distance cutoffs
prep_res2 <- runPreprocess(
  counts = example_CellGeneExpr,
  clust = NULL,
  refProfiles = example_refProfiles,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum= NULL,
  
  # impute for cell types missing in provided 'transcript_df' 
  imputeFlag_missingCTs = TRUE, 
  
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = NULL,
  cellular_distance_cutoff = NULL,
  transcript_df = mini_transcriptDF, 
  transDF_fileInfo = NULL, 
  pixel_size = 0.18,
  zstep_size = 0.8, 
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = 'CellId',
  spatLocs_colns = c('x','y','z'),
  extracellular_cellID = 0 
)
#> Found 960 common genes among `refProfiles` and `counts`. 
#> No common cell types/clusters found between `clust` and `refProfiles`.
#> Perform cluster assignment based on maximum transcript score given the provided `refProfiles`.
#> Extract distance cutoff from first input transcript data.
#> 3 Dimension of spaital coordinates are provided.
#> 
#> Use 2 times of average 2D cell diameter as cellular_distance_cutoff = 4.3628 for searching of neighbor cells.
#> Identified 3D coordinates with variance. 
#> Distribution of minimal molecular distance between 1375 cells: 0, 0.01, 0.03, 0.04, 0.05, 0.07, 0.09, 0.11, 0.14, 0.18, 1.24, at quantile = 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
#> Use 5 times of 90% quantile of minimal 3D molecular distance between picked cells as `molecular_distance_cutoff` = 0.9003 for defining direct neighbor cells.
#> Use `molecular_distance_cutoff` = 0.9003 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use cellular_distance_cutoff = 4.3628 for searching of neighbor cells.

# case 3: provide both `refProfiles` and `clust`, use transDF_fileInfo for 
# multi-files, no known molecular distance cutoffs
dataDir <- system.file("extdata", package = "FastReseg")
transDF_fileInfo <- data.frame(
  file_path = fs::path(dataDir,
                       c("Run4104_FOV001__complete_code_cell_target_call_coord.csv",
                         "Run4104_FOV002__complete_code_cell_target_call_coord.csv")),
  slide = c(1, 1),
  fov = c(1,2),
  stage_X = 1000*c(5.13, -2.701),
  stage_Y = 1000*c(-0.452, 0.081))
prep_res3 <- runPreprocess(
  counts = example_CellGeneExpr,
  clust = example_clust,
  refProfiles = example_refProfiles,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum= NULL,
  imputeFlag_missingCTs = TRUE,
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = NULL,
  cellular_distance_cutoff = 20,
  transcript_df = NULL, 
  transDF_fileInfo = transDF_fileInfo, 
  filepath_coln = 'file_path', 
  prefix_colns = c('slide','fov'), 
  fovOffset_colns = c('stage_X','stage_Y'), 
  pixel_size = 0.18,
  zstep_size = 0.8, 
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = 'CellId',
  spatLocs_colns = c('x','y','z'),
  extracellular_cellID = 0 
)
#> Found 960 common genes among `refProfiles` and `counts`. 
#> Extract distance cutoff from first input transcript data.
#> 3 Dimension of spaital coordinates are provided.
#> 2 individual per FOV files are provided in `transDF_fileInfo`, use the 1st file to calculate distance cutoffs
#> `transID_coln` and `cellID_coln` of each per FOV transcript_df would be re-named based on `prefix_colns` = `slide`,`fov`.
#> Use 2 times of average 2D cell diameter as cellular_distance_cutoff = 19.1980 for searching of neighbor cells.
#> Identified 3D coordinates with variance. 
#> Distribution of minimal molecular distance between 385 cells: 0, 0.16, 0.22, 0.29, 0.36, 0.44, 0.52, 0.63, 0.76, 0.84, 8.01, at quantile = 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
#> Use 5 times of 90% quantile of minimal 3D molecular distance between picked cells as `molecular_distance_cutoff` = 4.2191 for defining direct neighbor cells.
#> Use `molecular_distance_cutoff` = 4.2191 for defining direct neighbor cells based on molecule-to-molecule distance.