Predictions from Splicing Data
RNAForecaster requires transcriptomics data from the same cell at two different points in time. The best way to acquire this sort of information from standard single cell RNA-seq protocols is to leverage incidental capture of introns to determine whether a transcript is spliced or unspliced. In general, it is a reasonable assumption that spliced transcripts were produced more recently than unspliced, thus giving some temporal information on the sequenced transcripts.
Unfortunately, the spliced and unspliced count matrices thus generated cannot be considered two equivalent time points as the count matrices come from very different distributions, which prevents us from being able to straightforwardly predict future expression states from one time point to the next. We could attempt to handle this problem using RNA velocity style assumptions about the splicing and degradation rates to update our predicted expression levels at each time point. However, with RNAForecaster this process will propagate too much error for the resulting predictions to be of value.
Instead, we make a single prediction about the direction of regulation, which can be compared to predictions under perturbations, allowing us to uncover key genes relating to the biological process of interest.
Here we will use a small subset of murine pancreatic development data published by Bastidas-Ponce et al. (2019) to demonstrate how RNAForecaster is used with splicing data.
#read in data
using DelimitedFiles
splicedSub = readdlm("data/pancSplicedExampleData.csv", ',')
unsplicedSub = readdlm("data/pancUnsplicedExampleData.csv", ',')
geneNamesSub = readdlm("data/pancGeneNamesExampleData.csv", ',')
geneNamesSub = string.(geneNamesSub)[:,1]
#data must be log normalized and converted to Float32
splicedSub = Float32.(log1p.(splicedSub))
unsplicedSub = Float32.(log1p.(unsplicedSub))
#train RNAForecaster
trainedModel = trainRNAForecaster(splicedSub, unsplicedSub);
So we now have a neural network that can predicted unspliced counts based on spliced counts. We can now perturb each gene to see how the predictions of unspliced counts change. This can help uncover regulatory relationships and important genes underlying the system's biology.
Here we will simulate a KO of each gene of the 10 genes in the data from the initial conditions of 25 random cells.
KOEffects = perturbEffectPredictions(trainedModel[1], splicedSub, 25);
We can search these results for statistically significant effects.
findSigRegulation(KOEffects, geneNamesSub)
We can also specify specific genes and specific perturbations.
perturbEffects = perturbEffectPredictions(trainedModel[1], splicedSub, 25,
perturbGenes = geneNamesSub[[2,5]], geneNames = geneNamesSub,
perturbLevels = [1.0f0, 1.5f0]);
RNAForecaster.perturbEffectPredictions
— FunctionperturbEffectPredictions(trainedNetwork, splicedData::Matrix{Float32}, nCells::Int; perturbGenes::Vector{String} = Vector{String}(undef, 0), geneNames::Vector{String} = Vector{String}(undef,0), seed::Int=123)
Based on spliced/unspliced counts, predict the immediate transcriptomic effect of any or all single gene perturbations. Outputs a tuple containing the cells used for prediction, the expression predictions, the gene wise differences, and the cell-wise euclidean distances for each perturbation.
This function is capable of running on multiple parallel processes using Distributed.jl. Call addprocs(n) before running the function to add parallel workers, where n is the number of additional processes desired.
Required Arguments
- trainedNetwork - trained neuralODE from trainRNAForecaster
- splicedData - log normalized spliced counts matrix. Must be in Float32 format
- nCells - how many cells from the data should be used for prediction of perturb effect.
Higher values will increase computational time required.
Optional Arguments
- perturbGenes - list of genes to simulate a perturbation of. By default all genes are used
- geneNames - if providing a subset of the genes to perturb, a vector of gene names to
match against, in the order of splicedData
- perturbLevels - list of perturbation levels to use for each perturbed gene. By default
all genes are set to zero, simulating a KO.
- seed - Random seed for reproducibility on the cells chosen for prediction
Convenience functions are provided to search the results of the perturbations
RNAForecaster.totalPerturbImpact
— FunctiontotalPerturbImpact(perturbData, geneNames::Vector{String})
Function to yield a sorted data frame of the size of the predicted effect of a perturbation on the cellular transcriptome. Intended to serve as a measure of more or less impactful gene perturbations.
Required Arguments
- perturbData - results from perturbEffectPredictions function
- geneNames - vector of gene names in the order of the input expression data.
Should only include perturbed genes
RNAForecaster.genePerturbExpressionChanges
— FunctiongenePerturbExpressionChanges(perturbData, geneNames::Vector{String}, perturbGene::String; genesperturbd::Vector{String} = geneNames)
Function to get a sorted data frame of the predicted effect of a gene perturb on all other genes.
Required Arguments
- perturbData - results from perturbEffectPredictions function
- geneNames - vector of gene names in the order of the input expression data
- perturbGene - a gene name to query the predicted perturb effect on expression
Optional Arguments
- genesPerturbed - If less than all the gene perturbs were performed, the ordered names of the perturb genes must be supplied
RNAForecaster.geneResponseToPerturb
— FunctiongeneResponseToPerturb(perturbData, geneNames::Vector{String}, geneOfInterest::String; genesPerturbed::Vector{String} = geneNames)
Function to get a sorted data frame of the predicted effect of all other gene perturbations on a particular gene of interest.
Required Arguments
- perturbData - results from perturbEffectPredictions function
- geneNames - vector of gene names in the order of the input expression data
- geneOfInterest - a gene name to query
For example, we can get a list of the genes, and the size of the impact of the KO on the rest of the genes' expression, measured in euclidean distance, using
totalPerturbImpact(KOEffects, geneNamesSub)