Skip to contents

modular wrapper to get baseline data and cutoffs from entire dataset

Usage

runPreprocess(
  counts,
  clust = NULL,
  refProfiles = NULL,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum = NULL,
  imputeFlag_missingCTs = TRUE,
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = 2.7,
  cellular_distance_cutoff = NULL,
  transcript_df = NULL,
  transDF_fileInfo = NULL,
  filepath_coln = "file_path",
  prefix_colns = c("slide", "fov"),
  fovOffset_colns = c("stage_X", "stage_Y"),
  pixel_size = 0.18,
  zstep_size = 0.8,
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = "CellId",
  spatLocs_colns = c("x", "y", "z"),
  extracellular_cellID = NULL
)

Arguments

counts

Counts matrix for entire dataset, cells X genes.

clust

Vector of cluster assignments for each cell in counts, when NULL to automatically assign the cell cluster for each cell based on maximum transcript score of given the provided refProfiles

refProfiles

A matrix of cluster profiles, genes X clusters, default = NULL to use external cluster assignments. Of note, when refProfiles != NULL, genes unique to counts but missing in refProfiles would be omitted from downstream analysis.

score_baseline

a named vector of score baseline under each cell type listed in refProfiles such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence; default = NULL to calculate from counts and refProfiles

lowerCutoff_transNum

a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is; default = NULL to calculate from counts and refProfiles

higherCutoff_transNum

a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type; default = NULL to calculate from counts and refProfiles

imputeFlag_missingCTs

flag to impute score_baseline, lowerCutoff_transNum,higherCutoff_transNum for cell types present in refProfiles but missing in the provided transcript data files or the provided baseline and cutoffs; when TRUE, the median values of existing cell types would be used as the values for missing cell types.

ctrl_genes

a vector of control genes that are present in input transcript data.frame but not in refProfiles and expect no cell type dependency, e.g. negative control probes; the ctrl_genes would be included in FastReseg analysis. (default = NULL)

svmClass_score_cutoff

the cutoff of transcript score to separate between high and low score transcripts in SVM, used as the score values for ctrl_genes (default = -2)

molecular_distance_cutoff

maximum molecule-to-molecule distance within connected transcript group, unit in micron (default = 2.7 micron). If set to NULL, the pipeline would first randomly choose no more than 2500 cells from up to 10 random picked ROIs with search radius to be 5 times of cellular_distance_cutoff, and then calculate the minimal molecular distance between picked cells. The pipeline would further use the 5 times of 90% quantile of minimal molecular distance as molecular_distance_cutoff. This calculation is slow and is not recommended for large transcript data.frame.

cellular_distance_cutoff

maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, unit in micron. Default = NULL to use the 2 times of average 2D cell diameter.

transcript_df

the data.frame of transcript level information with unique CellId, default = NULL to read from the transDF_fileInfo

transDF_fileInfo

a data.frame with each row for each individual file of per FOV transcript data.frame within which the coordinates and CellId are unique, columns include the file path of per FOV transcript data.frame file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire dataset; when NULL, use the provided transcript_df directly

filepath_coln

the column name of each individual file of per FOV transcript data.frame in transDF_fileInfo

prefix_colns

the column names of annotation in transDF_fileInfo, to be added to the CellId as prefix when creating unique cell_ID for entire dataset; set to NULL if use the original transID_coln or cellID_coln

fovOffset_colns

the column name of coordinate offsets in 1st and 2nd dimension for each per FOV transcript data.frame in transDF_fileInfo, unit in micron Notice that some assays like SMI has XY axes swapped between stage and each FOV such that fovOffset_colns should be c("stage_Y", "stage_X").

pixel_size

the micrometer size of image pixel listed in 1st and 2nd dimension of spatLocs_colns of each transcript_df

zstep_size

the micrometer size of z-step for the optional 3rd dimension of spatLocs_colns of each transcript_df

transID_coln

the column name of transcript_ID in transcript_df, default = NULL to use row index of transcript in each transcript_df; when prefix_colns != NULL, unique transcript_id would be generated from prefix_colns and transID_coln in each transcript_df

transGene_coln

the column name of target or gene name in transcript_df

cellID_coln

the column name of cell_ID in transcript_df; when prefix_colns != NULL, unique cell_ID would be generated from prefix_colns and cellID_coln in each transcript_df

spatLocs_colns

column names for 1st, 2nd and optional 3rd dimension of spatial coordinates in transcript_df

extracellular_cellID

a vector of cell_ID for extracellular transcripts which would be removed from the resegmention pipeline (default = NULL)

Value

a nested list

clust

vector of cluster assignments for each cell in counts, used in caculating baselineData

refProfiles

a genes X clusters matrix of cluster-specific reference profiles to use in resegmenation pipeline

baselineData

a list of two matrice in cluster X percentile format for the cluster-specific percentile distribution of per cell value; span_score is for the average per molecule transcript tLLR score of each cell, span_transNum is for the transcript number of each cell.

cutoffs_list

a list of cutoffs to use in resegmentation pipeline, including, score_baseline, lowerCutoff_transNum, higherCutoff_transNum, cellular_distance_cutoff, molecular_distance_cutoff

ctrl_genes

a vector of control genes whose transcript scores are set to fixed value for all cell types, return when ctrl_genes is not NULL.

score_GeneMatrix

a gene x cell-type score matrix to use in resegmenation pipeline, the scores for ctrl_genes are set to be the same as svmClass_score_cutoff

processed_1st_transDF

a list of 2 elements for the intracellular and extracellular transcript data.frame of the processed outcomes of 1st transcrip file

The cutoffs_list is a list containing

score_baseline

a named vector of score baseline under each cell type listed in refProfiles such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence.

lowerCutoff_transNum

a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is.

higherCutoff_transNum

a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type.

cellular_distance_cutoff

maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, unit in micron.

molecular_distance_cutoff

maximum molecule-to-molecule distance within connected transcript group, unit in micron.

Examples

 
data("mini_transcriptDF")
data("example_CellGeneExpr")
data("example_clust")
data("example_refProfiles")
# cell_ID for extracellualr transcripts
extracellular_cellID <- mini_transcriptDF[which(mini_transcriptDF$CellId ==0), 'cell_ID'] 

# case 1: use `clust` and `transcript_df` directly, with known distance cutoffs
prep_res1 <- runPreprocess(
  counts = example_CellGeneExpr,
  clust = example_clust,
  refProfiles = NULL,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum= NULL,
  imputeFlag_missingCTs = FALSE,
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = 2.7,
  cellular_distance_cutoff = 20,
  transcript_df = mini_transcriptDF, 
  transDF_fileInfo = NULL, 
  pixel_size = 0.18,
  zstep_size = 0.8, 
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = 'CellId',
  spatLocs_colns = c('x','y','z'),
  extracellular_cellID = 0 
)
#> Found 960 common genes among `refProfiles` and `counts`. 
#> Use the providied `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 20.0000 for searching of neighbor cells.

# case 2: use `refProfiles` to get `clust`, use `transcript_df` directly, 
# unknown distance cutoffs
prep_res2 <- runPreprocess(
  counts = example_CellGeneExpr,
  clust = NULL,
  refProfiles = example_refProfiles,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum= NULL,
  
  # impute for cell types missing in provided 'transcript_df' 
  imputeFlag_missingCTs = TRUE, 
  
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = NULL,
  cellular_distance_cutoff = NULL,
  transcript_df = mini_transcriptDF, 
  transDF_fileInfo = NULL, 
  pixel_size = 0.18,
  zstep_size = 0.8, 
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = 'CellId',
  spatLocs_colns = c('x','y','z'),
  extracellular_cellID = 0 
)
#> Found 960 common genes among `refProfiles` and `counts`. 
#> No common cell types/clusters found between `clust` and `refProfiles`.
#> Perform cluster assignment based on maximum transcript score given the provided `refProfiles`.
#> Extract distance cutoff from first input transcript data.
#> 3 Dimension of spaital coordinates are provided.
#> 
#> Use 2 times of average 2D cell diameter as cellular_distance_cutoff = 4.3628 for searching of neighbor cells.
#> Identified 3D coordinates with variance. 
#> Distribution of minimal molecular distance between 1375 cells: 0, 0.01, 0.03, 0.04, 0.05, 0.07, 0.09, 0.11, 0.14, 0.18, 1.24, at quantile = 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
#> Use 5 times of 90% quantile of minimal 3D molecular distance between picked cells as `molecular_distance_cutoff` = 0.9003 for defining direct neighbor cells.
#> Use `molecular_distance_cutoff` = 0.9003 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use cellular_distance_cutoff = 4.3628 for searching of neighbor cells.

# case 3: provide both `refProfiles` and `clust`, use transDF_fileInfo for 
# multi-files, no known molecular distance cutoffs
dataDir <- system.file("extdata", package = "FastReseg")
transDF_fileInfo <- data.frame(
  file_path = fs::path(dataDir,
                       c("Run4104_FOV001__complete_code_cell_target_call_coord.csv",
                         "Run4104_FOV002__complete_code_cell_target_call_coord.csv")),
  slide = c(1, 1),
  fov = c(1,2),
  stage_X = 1000*c(5.13, -2.701),
  stage_Y = 1000*c(-0.452, 0.081))
prep_res3 <- runPreprocess(
  counts = example_CellGeneExpr,
  clust = example_clust,
  refProfiles = example_refProfiles,
  score_baseline = NULL,
  lowerCutoff_transNum = NULL,
  higherCutoff_transNum= NULL,
  imputeFlag_missingCTs = TRUE,
  ctrl_genes = NULL,
  svmClass_score_cutoff = -2,
  molecular_distance_cutoff = NULL,
  cellular_distance_cutoff = 20,
  transcript_df = NULL, 
  transDF_fileInfo = transDF_fileInfo, 
  filepath_coln = 'file_path', 
  prefix_colns = c('slide','fov'), 
  fovOffset_colns = c('stage_X','stage_Y'), 
  pixel_size = 0.18,
  zstep_size = 0.8, 
  transID_coln = NULL,
  transGene_coln = "target",
  cellID_coln = 'CellId',
  spatLocs_colns = c('x','y','z'),
  extracellular_cellID = 0 
)
#> Found 960 common genes among `refProfiles` and `counts`. 
#> Extract distance cutoff from first input transcript data.
#> 3 Dimension of spaital coordinates are provided.
#> 2 individual per FOV files are provided in `transDF_fileInfo`, use the 1st file to calculate distance cutoffs
#> `transID_coln` and `cellID_coln` of each per FOV transcript_df would be re-named based on `prefix_colns` = `slide`,`fov`.
#> Use 2 times of average 2D cell diameter as cellular_distance_cutoff = 19.1980 for searching of neighbor cells.
#> Identified 3D coordinates with variance. 
#> Distribution of minimal molecular distance between 385 cells: 0, 0.16, 0.22, 0.29, 0.36, 0.44, 0.52, 0.63, 0.76, 0.84, 8.01, at quantile = 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
#> Use 5 times of 90% quantile of minimal 3D molecular distance between picked cells as `molecular_distance_cutoff` = 4.2191 for defining direct neighbor cells.
#> Use `molecular_distance_cutoff` = 4.2191 for defining direct neighbor cells based on molecule-to-molecule distance.