modular wrapper to get baseline data and cutoffs from entire dataset
Usage
runPreprocess(
counts,
clust = NULL,
refProfiles = NULL,
score_baseline = NULL,
lowerCutoff_transNum = NULL,
higherCutoff_transNum = NULL,
imputeFlag_missingCTs = TRUE,
ctrl_genes = NULL,
svmClass_score_cutoff = -2,
molecular_distance_cutoff = 2.7,
cellular_distance_cutoff = NULL,
transcript_df = NULL,
transDF_fileInfo = NULL,
filepath_coln = "file_path",
prefix_colns = c("slide", "fov"),
fovOffset_colns = c("stage_X", "stage_Y"),
pixel_size = 0.18,
zstep_size = 0.8,
transID_coln = NULL,
transGene_coln = "target",
cellID_coln = "CellId",
spatLocs_colns = c("x", "y", "z"),
extracellular_cellID = NULL
)
Arguments
- counts
Counts matrix for entire dataset, cells X genes.
- clust
Vector of cluster assignments for each cell in
counts
, when NULL to automatically assign the cell cluster for each cell based on maximum transcript score of given the providedrefProfiles
- refProfiles
A matrix of cluster profiles, genes X clusters, default = NULL to use external cluster assignments. Of note, when
refProfiles != NULL
, genes unique tocounts
but missing inrefProfiles
would be omitted from downstream analysis.- score_baseline
a named vector of score baseline under each cell type listed in
refProfiles
such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence; default = NULL to calculate fromcounts
andrefProfiles
- lowerCutoff_transNum
a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is; default = NULL to calculate from
counts
andrefProfiles
- higherCutoff_transNum
a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type; default = NULL to calculate from
counts
andrefProfiles
- imputeFlag_missingCTs
flag to impute
score_baseline
,lowerCutoff_transNum
,higherCutoff_transNum
for cell types present inrefProfiles
but missing in the provided transcript data files or the provided baseline and cutoffs; when TRUE, the median values of existing cell types would be used as the values for missing cell types.- ctrl_genes
a vector of control genes that are present in input transcript data.frame but not in
refProfiles
and expect no cell type dependency, e.g. negative control probes; thectrl_genes
would be included in FastReseg analysis. (default = NULL)- svmClass_score_cutoff
the cutoff of transcript score to separate between high and low score transcripts in SVM, used as the score values for
ctrl_genes
(default = -2)- molecular_distance_cutoff
maximum molecule-to-molecule distance within connected transcript group, unit in micron (default = 2.7 micron). If set to NULL, the pipeline would first randomly choose no more than 2500 cells from up to 10 random picked ROIs with search radius to be 5 times of
cellular_distance_cutoff
, and then calculate the minimal molecular distance between picked cells. The pipeline would further use the 5 times of 90% quantile of minimal molecular distance asmolecular_distance_cutoff
. This calculation is slow and is not recommended for large transcript data.frame.- cellular_distance_cutoff
maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, unit in micron. Default = NULL to use the 2 times of average 2D cell diameter.
- transcript_df
the data.frame of transcript level information with unique CellId, default = NULL to read from the
transDF_fileInfo
- transDF_fileInfo
a data.frame with each row for each individual file of per FOV transcript data.frame within which the coordinates and CellId are unique, columns include the file path of per FOV transcript data.frame file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire dataset; when NULL, use the provided
transcript_df
directly- filepath_coln
the column name of each individual file of per FOV transcript data.frame in
transDF_fileInfo
- prefix_colns
the column names of annotation in
transDF_fileInfo
, to be added to the CellId as prefix when creating unique cell_ID for entire dataset; set to NULL if use the originaltransID_coln
orcellID_coln
- fovOffset_colns
the column name of coordinate offsets in 1st and 2nd dimension for each per FOV transcript data.frame in
transDF_fileInfo
, unit in micron Notice that some assays like SMI has XY axes swapped between stage and each FOV such thatfovOffset_colns
should be c("stage_Y", "stage_X").- pixel_size
the micrometer size of image pixel listed in 1st and 2nd dimension of
spatLocs_colns
of eachtranscript_df
- zstep_size
the micrometer size of z-step for the optional 3rd dimension of
spatLocs_colns
of eachtranscript_df
- transID_coln
the column name of transcript_ID in
transcript_df
, default = NULL to use row index of transcript in eachtranscript_df
; whenprefix_colns
!= NULL, unique transcript_id would be generated fromprefix_colns
andtransID_coln
in eachtranscript_df
- transGene_coln
the column name of target or gene name in
transcript_df
- cellID_coln
the column name of cell_ID in
transcript_df
; whenprefix_colns
!= NULL, unique cell_ID would be generated fromprefix_colns
andcellID_coln
in eachtranscript_df
- spatLocs_colns
column names for 1st, 2nd and optional 3rd dimension of spatial coordinates in
transcript_df
- extracellular_cellID
a vector of cell_ID for extracellular transcripts which would be removed from the resegmention pipeline (default = NULL)
Value
a nested list
- clust
vector of cluster assignments for each cell in
counts
, used in caculatingbaselineData
- refProfiles
a genes X clusters matrix of cluster-specific reference profiles to use in resegmenation pipeline
- baselineData
a list of two matrice in cluster X percentile format for the cluster-specific percentile distribution of per cell value;
span_score
is for the average per molecule transcript tLLR score of each cell,span_transNum
is for the transcript number of each cell.- cutoffs_list
a list of cutoffs to use in resegmentation pipeline, including,
score_baseline
,lowerCutoff_transNum
,higherCutoff_transNum
,cellular_distance_cutoff
,molecular_distance_cutoff
- ctrl_genes
a vector of control genes whose transcript scores are set to fixed value for all cell types, return when
ctrl_genes
is not NULL.- score_GeneMatrix
a gene x cell-type score matrix to use in resegmenation pipeline, the scores for
ctrl_genes
are set to be the same assvmClass_score_cutoff
- processed_1st_transDF
a list of 2 elements for the intracellular and extracellular transcript data.frame of the processed outcomes of 1st transcrip file
The cutoffs_list
is a list containing
- score_baseline
a named vector of score baseline under each cell type listed in
refProfiles
such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence.- lowerCutoff_transNum
a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is.
- higherCutoff_transNum
a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type.
- cellular_distance_cutoff
maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, unit in micron.
- molecular_distance_cutoff
maximum molecule-to-molecule distance within connected transcript group, unit in micron.
Examples
data("mini_transcriptDF")
data("example_CellGeneExpr")
data("example_clust")
data("example_refProfiles")
# cell_ID for extracellualr transcripts
extracellular_cellID <- mini_transcriptDF[which(mini_transcriptDF$CellId ==0), 'cell_ID']
# case 1: use `clust` and `transcript_df` directly, with known distance cutoffs
prep_res1 <- runPreprocess(
counts = example_CellGeneExpr,
clust = example_clust,
refProfiles = NULL,
score_baseline = NULL,
lowerCutoff_transNum = NULL,
higherCutoff_transNum= NULL,
imputeFlag_missingCTs = FALSE,
ctrl_genes = NULL,
svmClass_score_cutoff = -2,
molecular_distance_cutoff = 2.7,
cellular_distance_cutoff = 20,
transcript_df = mini_transcriptDF,
transDF_fileInfo = NULL,
pixel_size = 0.18,
zstep_size = 0.8,
transID_coln = NULL,
transGene_coln = "target",
cellID_coln = 'CellId',
spatLocs_colns = c('x','y','z'),
extracellular_cellID = 0
)
#> Found 960 common genes among `refProfiles` and `counts`.
#> Use the providied `molecular_distance_cutoff` = 2.7000 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use the providied `cellular_distance_cutoff` = 20.0000 for searching of neighbor cells.
# case 2: use `refProfiles` to get `clust`, use `transcript_df` directly,
# unknown distance cutoffs
prep_res2 <- runPreprocess(
counts = example_CellGeneExpr,
clust = NULL,
refProfiles = example_refProfiles,
score_baseline = NULL,
lowerCutoff_transNum = NULL,
higherCutoff_transNum= NULL,
# impute for cell types missing in provided 'transcript_df'
imputeFlag_missingCTs = TRUE,
ctrl_genes = NULL,
svmClass_score_cutoff = -2,
molecular_distance_cutoff = NULL,
cellular_distance_cutoff = NULL,
transcript_df = mini_transcriptDF,
transDF_fileInfo = NULL,
pixel_size = 0.18,
zstep_size = 0.8,
transID_coln = NULL,
transGene_coln = "target",
cellID_coln = 'CellId',
spatLocs_colns = c('x','y','z'),
extracellular_cellID = 0
)
#> Found 960 common genes among `refProfiles` and `counts`.
#> No common cell types/clusters found between `clust` and `refProfiles`.
#> Perform cluster assignment based on maximum transcript score given the provided `refProfiles`.
#> Extract distance cutoff from first input transcript data.
#> 3 Dimension of spaital coordinates are provided.
#>
#> Use 2 times of average 2D cell diameter as cellular_distance_cutoff = 4.3628 for searching of neighbor cells.
#> Identified 3D coordinates with variance.
#> Distribution of minimal molecular distance between 1375 cells: 0, 0.01, 0.03, 0.04, 0.05, 0.07, 0.09, 0.11, 0.14, 0.18, 1.24, at quantile = 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
#> Use 5 times of 90% quantile of minimal 3D molecular distance between picked cells as `molecular_distance_cutoff` = 0.9003 for defining direct neighbor cells.
#> Use `molecular_distance_cutoff` = 0.9003 for defining direct neighbor cells based on molecule-to-molecule distance.
#> Use cellular_distance_cutoff = 4.3628 for searching of neighbor cells.
# case 3: provide both `refProfiles` and `clust`, use transDF_fileInfo for
# multi-files, no known molecular distance cutoffs
dataDir <- system.file("extdata", package = "FastReseg")
transDF_fileInfo <- data.frame(
file_path = fs::path(dataDir,
c("Run4104_FOV001__complete_code_cell_target_call_coord.csv",
"Run4104_FOV002__complete_code_cell_target_call_coord.csv")),
slide = c(1, 1),
fov = c(1,2),
stage_X = 1000*c(5.13, -2.701),
stage_Y = 1000*c(-0.452, 0.081))
prep_res3 <- runPreprocess(
counts = example_CellGeneExpr,
clust = example_clust,
refProfiles = example_refProfiles,
score_baseline = NULL,
lowerCutoff_transNum = NULL,
higherCutoff_transNum= NULL,
imputeFlag_missingCTs = TRUE,
ctrl_genes = NULL,
svmClass_score_cutoff = -2,
molecular_distance_cutoff = NULL,
cellular_distance_cutoff = 20,
transcript_df = NULL,
transDF_fileInfo = transDF_fileInfo,
filepath_coln = 'file_path',
prefix_colns = c('slide','fov'),
fovOffset_colns = c('stage_X','stage_Y'),
pixel_size = 0.18,
zstep_size = 0.8,
transID_coln = NULL,
transGene_coln = "target",
cellID_coln = 'CellId',
spatLocs_colns = c('x','y','z'),
extracellular_cellID = 0
)
#> Found 960 common genes among `refProfiles` and `counts`.
#> Extract distance cutoff from first input transcript data.
#> 3 Dimension of spaital coordinates are provided.
#> 2 individual per FOV files are provided in `transDF_fileInfo`, use the 1st file to calculate distance cutoffs
#> `transID_coln` and `cellID_coln` of each per FOV transcript_df would be re-named based on `prefix_colns` = `slide`,`fov`.
#> Use 2 times of average 2D cell diameter as cellular_distance_cutoff = 19.1980 for searching of neighbor cells.
#> Identified 3D coordinates with variance.
#> Distribution of minimal molecular distance between 385 cells: 0, 0.16, 0.22, 0.29, 0.36, 0.44, 0.52, 0.63, 0.76, 0.84, 8.01, at quantile = 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
#> Use 5 times of 90% quantile of minimal 3D molecular distance between picked cells as `molecular_distance_cutoff` = 4.2191 for defining direct neighbor cells.
#> Use `molecular_distance_cutoff` = 4.2191 for defining direct neighbor cells based on molecule-to-molecule distance.