ATACCoGAPS can be installed from GitHub

#installing required Bioconductor packages
BiocManager::install(c("GenomicRanges", "projectR", "TFBSTools", "GeneOverlap", "msigdbr", "motifmatchr", "chromVAR", "GenomicFeatures", "IRanges", "fgsea", "rGREAT", "Homo.sapiens", "Mus.musculus"))

#installing latest version of CoGAPS algorithm
devtools::install_github("FertigLab/CoGAPS")

#installing ATACCoGAPS
devtools::install_github("FertigLab/ATACCoGAPS")

Attach the ATACCoGAPS package, which attaches CoGAPS as a dependency

library(ATACCoGAPS)

To outline the ATACCoGAPS pipeline, we will use as an example data set single-cell ATAC sequencing data containing 10 cell lines, published by Schep et al, 2017. The data was downloaded from GEO accession number GSE99172 and preprocessed using dataSubsetBySparsity() from the ATACCoGAPS package to remove cells and peaks with more than 99% sparsity (more than 99% zeroes). For the code used in preprocessing this data, see: https://github.com/rossinerbe/ATACCoGAPS-Analysis-Code/blob/master/Schep_data_analysis/main_analysis_peak_summarization.Rmd

data("schepFilteredData")
data("schepCelltypes")
data("schepFilteredPeaks")

We use these data to set the hyperparameters of the CoGAPS algorithm. Here we tell CoGAPS to find 7 patterns in 10000 iterations of the algorithm. We use the singleCell and sparseOptimization methods as our data are sparse single-cell data. We run the algorithm distributed across the genome since we have more genomic features than cells (if it was the opposite we would set the distributed pattern to “single-cell”). We then input the peak and cell type information to be returned as part of our result object. Finally, we set distributed parameters so the algorithm will run in parallel across 9 cores.

params <- CogapsParams(nPatterns=7, nIterations=10000, seed=42, singleCell=TRUE, sparseOptimization=TRUE, distributed="genome-wide", geneNames=schepFilteredPeaks, sampleNames=as.character(schepCelltypes))

## Warning in CogapsParams(nPatterns = 7, nIterations = 10000, seed = 42,
## singleCell = TRUE, : singleCell has been deprecated, this parameter will be
## ignored

params <- setDistributedParams(params, nSets=9)

## setting distributed parameters - call this again if you change nPatterns

params

## -- Standard Parameters --
## nPatterns            7 
## nIterations          10000 
## seed                 42 
## sparseOptimization   TRUE 
## distributed          genome-wide 
## 
## -- Sparsity Parameters --
## alpha          0.01 
## maxGibbsMass   100 
## 
## -- Distributed CoGAPS Parameters -- 
## nSets          9 
## cut            7 
## minNS          5 
## maxNS          14 
## 
## 90300 gene names provided
## first gene name: chr1-237588-238087 
## 
## 1392 sample names provided
## first sample name: Fibroblasts

We now call CoGAPS via the R function. CoGAPS is a Bayesian Non-Negative Matrix Factorization algorithm (Fertig et al, 2010). It factorizes count matrices from RNA or epigenetic sequencing data and returns patterns which distinguish both features and samples, allowing for the discovery of regulatory differences between samples. In the case of scATAC-seq our features are usually peaks and our samples are indvidual cells.

It is generally not recommended to run CoGAPS locally as it is quite computationally intensive and usually requires at least 3-6 hours runtime for most single-cell data sets, even on powerful servers. The code below is used only as an example, and to speed up the generation of this document, will not be run here. We will instead use the example CoGAPS result included in the ATACCoGAPS package which was run on the Batch servers of the AWS cloud.

schepCogapsResult <- GWCoGAPS(data = schepFilteredData, params = params, nThreads = 9)

Loading in the pre-computed CoGAPS result

data("schepCogapsResult")

Pattern Matrix Visualization

The first quick visualization of CoGAPS results is generally plotting the Pattern Matrix (the output matrix which is patterns x cells). These plots allow us to determine which patterns differentiate which cell types.

We can either plot each pattern indvidually

#colors to plot by
col <- c("salmon", 'aquamarine', 'aquamarine3', "aquamarine4", "darkorchid1", 'limegreen', "magenta", "orangered", "darkgreen", "darkred", "red3", "darkorange")


cgapsPlot(cgaps_result = schepCogapsResult, sample.classifier = schepCelltypes, cols = col, ylab = "Pattern Weight")

Or all together in a heatmap

heatmapPatternMatrix(cgaps_result = schepCogapsResult, sample.classifier = schepCelltypes, cellCols = col, col = viridis::magma(n=9))

We can note which patterns differentiate which cell types (for example that pattern 1 seems to be defining the K562 Erythroleukmia Cell Line). If any patterns are unclear, such as pattern 7, we can perform a Wilcoxon Rank Sum test to determine which cell types are most significantly associated with the pattern.

#get the pattern Matrix
patMatrix <- getSampleFactors(schepCogapsResult)
#perform a pairwise Wilcoxon test
pairwise.wilcox.test(patMatrix[,7], schepCelltypes, p.adjust.method = "BH")

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  patMatrix[, 7] and schepCelltypes 
## 
##                      Fibroblasts LCL 1   LCL 2 Rep1 LCL 2 Rep2 ESCs   
## LCL 1                1.6e-07     -       -          -          -      
## LCL 2 Rep1           0.0254      1.5e-10 -          -          -      
## LCL 2 Rep2           0.9310      1.9e-05 0.0677     -          -      
## ESCs                 0.4005      6.8e-08 0.0088     0.9878     -      
## HL60 Leukemia        < 2e-16     < 2e-16 < 2e-16    < 2e-16    < 2e-16
## K562 Erythroleukemia 2.0e-05     1.3e-05 4.4e-13    0.0108     0.0003 
## LMPP                 < 2e-16     < 2e-16 < 2e-16    < 2e-16    < 2e-16
## Monocyte             < 2e-16     < 2e-16 < 2e-16    < 2e-16    < 2e-16
## AML Patient 070      < 2e-16     < 2e-16 < 2e-16    < 2e-16    < 2e-16
## AML Patient 353      < 2e-16     < 2e-16 < 2e-16    < 2e-16    < 2e-16
## Erythroblast         4.2e-12     0.5636  4.1e-16    1.3e-07    3.2e-13
##                      HL60 Leukemia K562 Erythroleukemia LMPP    Monocyte
## LCL 1                -             -                    -       -       
## LCL 2 Rep1           -             -                    -       -       
## LCL 2 Rep2           -             -                    -       -       
## ESCs                 -             -                    -       -       
## HL60 Leukemia        -             -                    -       -       
## K562 Erythroleukemia < 2e-16       -                    -       -       
## LMPP                 0.3480        < 2e-16              -       -       
## Monocyte             6.4e-12       < 2e-16              6.1e-14 -       
## AML Patient 070      0.1182        < 2e-16              0.3378  9.1e-15 
## AML Patient 353      0.0028        < 2e-16              0.0034  < 2e-16 
## Erythroblast         < 2e-16       1.6e-11              < 2e-16 < 2e-16 
##                      AML Patient 070 AML Patient 353
## LCL 1                -               -              
## LCL 2 Rep1           -               -              
## LCL 2 Rep2           -               -              
## ESCs                 -               -              
## HL60 Leukemia        -               -              
## K562 Erythroleukemia -               -              
## LMPP                 -               -              
## Monocyte             -               -              
## AML Patient 070      -               -              
## AML Patient 353      0.0729          -              
## Erythroblast         < 2e-16         < 2e-16        
## 
## P value adjustment method: BH

We see that pattern 7 is most strongly associated with the monocytes in the data.

If we do not have pre-established cell annotations, we can cluster cells by pattern association.

cellClass <- patternMarkerCellClassifier(schepCogapsResult)
cellClasses <- cellClass$cellClassifier

heatmapPatternMatrix(schepCogapsResult, as.factor(cellClasses), col = viridis::magma(n=9))

Finding Regulatory Differences between Cell Types

Now that we know which patterns distinguish which cell types, we can look at those same patterns in the amplitude matrix (peaks by patterns) to determine which peaks are differentially accessible between the patterns and thus which peaks are differentially accessible between the cell types.

We can use the patternMarker Statistic (Stein-O’Brien et al, 2017) to find which peaks are most differentially accessible. To show the degree of differentiation, we can plot the 50 most pattern differentiating peaks for each pattern from the original data.

heatmapPatternMarkers(cgaps_result = schepCogapsResult, atac_data = schepFilteredData, celltypes = schepCelltypes, numregions = 50, colColors = col, col = viridis::plasma(n = 2))

The differentially accessible peaks we find distinguish the cell types we see in the pattern Matrix. In patterns 6 and 7 it seems to distinguish those cell types even better than the pattern Matrix does. This visualization allows us to see the biological differences between cell types CoGAPS is identifying.

Pathway Based Analysis

To make use of this differential accessibility data, one option is to try to find genes that fall within these peaks and determine whether the accessibility of certain groups of genes suggests differential pathway activation.

data("schepGranges")

#loading TxDb of human genes
library(Homo.sapiens)

#find genes known to fall within thresholded patternMarker peaks for each pattern
genes <- genePatternMatch(cogapsResult = schepCogapsResult, generanges = schepGranges, genome = Homo.sapiens)

## [1] Number of peaks used for each pattern:
## Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5 Pattern 6 Pattern 7 
##      7886      9752     12973      9115     10577     29088     10909

#download hallmark pathways using msigdbr
library(dplyr)

pathways <- msigdbr::msigdbr(species = "Homo sapiens", category =
                             "H") %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()

#match these pattern Gene sets to hallmark pathways, using an adjusted p-value threshold of 0.001.
matchedPathways <- pathwayMatch(gene_list = genes, pathways = pathways, p_threshold = 0.001)

lapply(matchedPathways, function(x) {x[4]})

## [[1]]
## [[1]]$summaryTable
##                            pathway       PValue
## 5         HALLMARK_HEME_METABOLISM 1.348010e-16
## 6         HALLMARK_MITOTIC_SPINDLE 8.585937e-07
## 8 HALLMARK_TNFA_SIGNALING_VIA_NFKB 3.155078e-05
## 3               HALLMARK_APOPTOSIS 4.977247e-05
## 1            HALLMARK_ADIPOGENESIS 5.795468e-05
## 4 HALLMARK_ESTROGEN_RESPONSE_EARLY 5.795468e-05
## 7      HALLMARK_TGF_BETA_SIGNALING 7.693160e-05
## 2         HALLMARK_APICAL_JUNCTION 1.043982e-04
## 
## 
## [[2]]
## [[2]]$summaryTable
##                             pathway       PValue
## 12          HALLMARK_UV_RESPONSE_DN 5.798672e-12
## 6      HALLMARK_IL2_STAT5_SIGNALING 9.539725e-12
## 4          HALLMARK_HEME_METABOLISM 6.511603e-11
## 9         HALLMARK_MTORC1_SIGNALING 1.284655e-08
## 5                  HALLMARK_HYPOXIA 2.936508e-07
## 8          HALLMARK_MITOTIC_SPINDLE 9.333264e-07
## 1             HALLMARK_ADIPOGENESIS 2.501268e-06
## 2        HALLMARK_ANDROGEN_RESPONSE 1.427122e-05
## 11 HALLMARK_TNFA_SIGNALING_VIA_NFKB 1.789366e-05
## 10       HALLMARK_PROTEIN_SECRETION 1.821590e-05
## 7    HALLMARK_INFLAMMATORY_RESPONSE 6.032164e-05
## 3  HALLMARK_ESTROGEN_RESPONSE_EARLY 1.076035e-04
## 
## 
## [[3]]
## [[3]]$summaryTable
##                               pathway       PValue
## 15   HALLMARK_TNFA_SIGNALING_VIA_NFKB 3.868026e-18
## 7        HALLMARK_IL2_STAT5_SIGNALING 5.130824e-16
## 16            HALLMARK_UV_RESPONSE_DN 4.574162e-14
## 9      HALLMARK_INFLAMMATORY_RESPONSE 8.532207e-12
## 11         HALLMARK_KRAS_SIGNALING_UP 2.259654e-11
## 1        HALLMARK_ALLOGRAFT_REJECTION 1.489475e-10
## 10 HALLMARK_INTERFERON_GAMMA_RESPONSE 1.489475e-10
## 3                  HALLMARK_APOPTOSIS 3.766399e-08
## 8    HALLMARK_IL6_JAK_STAT3_SIGNALING 4.507967e-08
## 5    HALLMARK_ESTROGEN_RESPONSE_EARLY 4.381401e-06
## 12           HALLMARK_MITOTIC_SPINDLE 1.195938e-05
## 2            HALLMARK_APICAL_JUNCTION 1.587116e-05
## 14        HALLMARK_TGF_BETA_SIGNALING 5.198330e-05
## 13          HALLMARK_MTORC1_SIGNALING 5.321764e-05
## 4                 HALLMARK_COMPLEMENT 9.468292e-05
## 6                    HALLMARK_HYPOXIA 2.830334e-04
## 
## 
## [[4]]
## [[4]]$summaryTable
##                                       pathway       PValue
## 6  HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 8.529163e-36
## 16                    HALLMARK_UV_RESPONSE_DN 1.051232e-25
## 15           HALLMARK_TNFA_SIGNALING_VIA_NFKB 3.232202e-15
## 9                HALLMARK_IL2_STAT5_SIGNALING 3.003042e-14
## 8                            HALLMARK_HYPOXIA 2.573234e-13
## 7            HALLMARK_ESTROGEN_RESPONSE_EARLY 7.308530e-13
## 3                    HALLMARK_APICAL_JUNCTION 2.523192e-10
## 13                        HALLMARK_MYOGENESIS 3.588584e-09
## 4                          HALLMARK_APOPTOSIS 8.742780e-09
## 12                   HALLMARK_MITOTIC_SPINDLE 1.372606e-08
## 5                         HALLMARK_COMPLEMENT 8.851363e-07
## 10             HALLMARK_INFLAMMATORY_RESPONSE 8.851363e-07
## 2                       HALLMARK_ANGIOGENESIS 2.587624e-06
## 11                 HALLMARK_KRAS_SIGNALING_UP 6.999488e-06
## 14                HALLMARK_TGF_BETA_SIGNALING 5.751367e-05
## 1                  HALLMARK_ANDROGEN_RESPONSE 1.415445e-04
## 
## 
## [[5]]
## [[5]]$summaryTable
##                                       pathway       PValue
## 1                    HALLMARK_APICAL_JUNCTION 1.388378e-13
## 2  HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 3.301153e-12
## 3            HALLMARK_ESTROGEN_RESPONSE_EARLY 2.436182e-11
## 10                    HALLMARK_UV_RESPONSE_DN 3.808226e-10
## 9            HALLMARK_TNFA_SIGNALING_VIA_NFKB 5.766965e-09
## 6                HALLMARK_IL2_STAT5_SIGNALING 3.051722e-07
## 8                         HALLMARK_MYOGENESIS 1.291245e-06
## 7                  HALLMARK_KRAS_SIGNALING_UP 2.576170e-06
## 5                            HALLMARK_HYPOXIA 5.036962e-06
## 4                 HALLMARK_HEDGEHOG_SIGNALING 1.402250e-04
## 
## 
## [[6]]
## [[6]]$summaryTable
##                                       pathway       PValue
## 20           HALLMARK_TNFA_SIGNALING_VIA_NFKB 3.085790e-13
## 11             HALLMARK_INFLAMMATORY_RESPONSE 5.627103e-10
## 10               HALLMARK_IL2_STAT5_SIGNALING 1.693213e-09
## 13                   HALLMARK_MITOTIC_SPINDLE 2.173996e-09
## 12                 HALLMARK_KRAS_SIGNALING_UP 3.659294e-08
## 14                  HALLMARK_MTORC1_SIGNALING 9.468201e-08
## 21                    HALLMARK_UV_RESPONSE_DN 1.216962e-07
## 5            HALLMARK_ESTROGEN_RESPONSE_EARLY 5.707008e-07
## 6                     HALLMARK_G2M_CHECKPOINT 5.707008e-07
## 19                HALLMARK_TGF_BETA_SIGNALING 1.275552e-06
## 18           HALLMARK_PI3K_AKT_MTOR_SIGNALING 3.213048e-06
## 1                       HALLMARK_ADIPOGENESIS 6.576505e-06
## 4  HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 6.576505e-06
## 17                       HALLMARK_P53_PATHWAY 6.576505e-06
## 3                        HALLMARK_E2F_TARGETS 5.713666e-05
## 22        HALLMARK_WNT_BETA_CATENIN_SIGNALING 7.632301e-05
## 15                    HALLMARK_MYC_TARGETS_V1 1.106565e-04
## 2                          HALLMARK_APOPTOSIS 1.450926e-04
## 7                         HALLMARK_GLYCOLYSIS 2.082222e-04
## 9                            HALLMARK_HYPOXIA 2.082222e-04
## 8                    HALLMARK_HEME_METABOLISM 3.809065e-04
## 16                        HALLMARK_MYOGENESIS 3.809065e-04
## 
## 
## [[7]]
## [[7]]$summaryTable
##                               pathway       PValue
## 17   HALLMARK_TNFA_SIGNALING_VIA_NFKB 9.869770e-27
## 10     HALLMARK_INFLAMMATORY_RESPONSE 4.117782e-22
## 8        HALLMARK_IL2_STAT5_SIGNALING 1.122470e-09
## 18            HALLMARK_UV_RESPONSE_DN 1.417200e-09
## 11 HALLMARK_INTERFERON_GAMMA_RESPONSE 1.422270e-08
## 7                    HALLMARK_HYPOXIA 3.184648e-08
## 3                  HALLMARK_APOPTOSIS 9.020828e-08
## 16        HALLMARK_TGF_BETA_SIGNALING 1.071393e-07
## 12         HALLMARK_KRAS_SIGNALING_UP 1.503725e-07
## 13           HALLMARK_MITOTIC_SPINDLE 4.678593e-07
## 4                 HALLMARK_COMPLEMENT 6.558180e-07
## 5    HALLMARK_ESTROGEN_RESPONSE_EARLY 6.558180e-07
## 1        HALLMARK_ALLOGRAFT_REJECTION 1.329669e-06
## 9    HALLMARK_IL6_JAK_STAT3_SIGNALING 2.381951e-06
## 15               HALLMARK_P53_PATHWAY 5.153906e-06
## 2          HALLMARK_ANDROGEN_RESPONSE 9.740774e-06
## 19            HALLMARK_UV_RESPONSE_UP 5.921498e-05
## 6     HALLMARK_ESTROGEN_RESPONSE_LATE 3.212447e-04
## 14          HALLMARK_MTORC1_SIGNALING 3.212447e-04

Several patterns do not return Hallmark pathways at this level of significance, but those that do seem logical in the cell types those patterns differentiate.

Of particular note, we find the Epithelial Mesenchymal Transition pathway to be strongly associated with Fibroblasts, which is known to be the classical wound healing pathway in Fibroblasts. Additionally, monocytes are most strongly associated with the Hallmark Inflammatory Response, as we would expect for inflammatory cells.

Motif/Transcription Factor Based Analysis

The other way we can use differential peak information is to match to DNA motifs and known Transcription Factor binding at those motifs.

motifResults = simpleMotifTFMatch(cogapsResult = schepCogapsResult, generanges = schepGranges, organism = "Homo sapiens", genome = "hg19", motifsPerRegion = 1)

## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3

## [1] Number of peaks used for each pattern:
## Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5 Pattern 6 Pattern 7 
##      7886      9752     12973      9115     10577     29088     10909

We can get a summary of TF binding, generally having more confidence in those TFs that have higher numbers of motifs at which the same TF could bind.

#showing motif results for monocytes
motifResults$tfMatchSummary[[7]]

##             (Other)              ZNF263                IRF1 
##                1620                1063                 621 
##                CTCF                SPIC                 SP2 
##                 503                 371                 353 
##                EGR1        STAT1::STAT2               FOXP1 
##                 328                 295                 240 
##                SPI1               RREB1           MAF::NFE2 
##                 231                 208                 161 
##               PRDM1               MEF2C               CEBPA 
##                 160                 150                 139 
##                 SP1                 SP4                NRF1 
##                 128                 125                 115 
##              POU2F2          JUN(var.2)                KLF5 
##                 113                  96                  96 
##                 JUN                 YY1               FOSL1 
##                  90                  88                  84 
##               STAT1                ATF4               NR2C2 
##                  80                  76                  76 
##               ESRRB                CDX2              ZBTB18 
##                  75                  74                  74 
##               HNF4G               TBX15                NFE2 
##                  72                  72                  70 
##               MEF2A                NFYA           BATF::JUN 
##                  68                  65                  63 
##                RELA                PAX5 SMAD2::SMAD3::SMAD4 
##                  63                  62                  62 
##               FOSL2         JUND(var.2)                NFYB 
##                  61                  61                  60 
##                E2F6                JUND                USF2 
##                  58                  56                  56 
##               PLAG1               INSM1          EWSR1-FLI1 
##                  55                  53                  52 
##               FOXB1               NR2F1               STAT3 
##                  52                  52                  52 
##                ZEB1               FOXA1               RUNX3 
##                  50                  49                  48 
##                MAFF               FOXH1                MAFK 
##                  47                  46                  46 
##               NFIL3          RARA::RXRA          NFIC::TLX1 
##                  46                  46                  45 
##                REST              TCF7L2                DUX4 
##                  45                  45                  44 
##                 FOS               FOXP2               GATA2 
##                  43                  41                  41 
##         RORA(var.2)                 TEF                ESR2 
##                  41                  41                  40 
##               FOXC2       TFAP2B(var.2)       TFAP2B(var.3) 
##                  40                  40                  40 
##               NFKB2               NHLH1                RFX3 
##                  39                  39                  38 
##                 MSC              POU1F1                LEF1 
##                  37                  36                  35 
##                HSF4                IRF2               PROP1 
##                  33                  33                  32 
##                RFX5          TAL1::TCF3               FOXF2 
##                  32                  32                  31 
##         GATA1::TAL1                HSF1               KLF16 
##                  31                  31                  31 
##               MEF2B              SREBF2                PBX1 
##                  31                  31                  30 
##              POU4F1                EBF1                IRF7 
##                  30                  29                  29 
##               KLF14              POU4F3               TFAP4 
##                  29                  29                  29 
##                ELF3              POU3F3               GATA3 
##                  28                  28                  27 
##                JDP2 
##                  27

The entrez gene summary is returned for all TFs found in matching, allowing us to easily check whether a TF seems like a plausible regulatory factor in a given cell type. For example, if we want to take a look at the function of IRF1, given that it has the most available binding sites in the monocyte associated pattern we can call

motifResults$tfDescriptions[[7]][which(motifResults$tfDescriptions[[7]][,2]=="IRF1"), 1]

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     summary
## 1: The protein encoded by this gene is a transcriptional regulator and tumor suppressor, serving as an activator of genes involved in both innate and acquired immune responses. The encoded protein activates the transcription of genes involved in the body's response to viruses and bacteria, playing a role in cell proliferation, apoptosis, the immune response, and DNA damage response. This protein represses the transcription of several other genes. As a tumor suppressor, it both suppresses tumor cell growth and stimulates an immune response against tumor cells. Defects in this gene have been associated with gastric cancer, myelogenous leukemia, and lung cancer. [provided by RefSeq, Aug 2017].  Publication Note:  This RefSeq record includes a subset of the publications that are available for this gene. Please see the Gene record to access additional publications.  ##Evidence-Data-START## Transcript exon combination :: AK314025.1, SRR3476690.44046.1 [ECO:0000332] RNAseq introns              :: mixed/partial sample support SAMEA1965299, SAMEA1966682 [ECO:0000350] ##Evidence-Data-END##  ##RefSeq-Attributes-START## MANE Ensembl match     :: ENST00000245414.9/ ENSP00000245414.4 RefSeq Select criteria :: based on conservation, expression, longest protein ##RefSeq-Attributes-END##

This description indicates the immune relevance of IRF1, and that it may have activity in monocytes.

We can also examine the accessibility of the TF itself in a cell type of interest in order to gather information on whether the TF is expressed to bind at accessible sites. We’ll test the accessibility of IRF1 in monocytes as an example.

#get peaks overlapping with the gene
IRF1peaks <- geneAccessibility("IRF1", schepGranges, schepFilteredData, Homo.sapiens)

## 'select()' returned 1:1 mapping between keys and columns

#make binary accessibility matrix
binaryMatrix <- (schepFilteredData > 0) + 0

#find accessibility of those peaks relative to others among monocyte cells
foldAccessibility(peaksAccessibility = IRF1peaks$IRF1, cellTypeList = schepCelltypes, cellType = "Monocyte", binaryMatrix = binaryMatrix)

## [1] 2.324261

IRF1 is 2.3 times more accessible than average in Monocytes, indicating it is likely active in these cells

Transfer Learning with ProjectR

To determine if the patterns we have identified with CoGAPS appear in other data sets we can apply transfer learning between ATAC datasets using projectR (Stein-O’Brein et al, 2017). projectR allows us to project patterns learned on one data set into another.

This can be useful for validating the generality and biological relevance of patterns, determining if learned signatures appear in other datasets without needing to run CoGAPS again, or simply to learn more regulatory information by combining patterns learned on different data sets.

To demonstrate, we will project from the Schep data into a set of scATAC data published by Buenrostro et al, 2018, containing a number of hematopoietic lineage cells.

#getting count matrix - peaks x cells
repmis::source_data("https://github.com/FertigLab/ATACCoGAPS/blob/master/BuenrostroFinalSubsetData.Rdata?raw=true")

## [1] "BuenrostroFinalSubsetData"

#getting GRanges for peaks
repmis::source_data("https://github.com/FertigLab/ATACCoGAPS/blob/master/BuenrostroGRanges.Rdata?raw=true")

## [1] "BuenrostroGRanges"

#getting celltypes
repmis::source_data("https://github.com/FertigLab/ATACCoGAPS/blob/master/BuenrostroCellTypes.Rdata?raw=true")

## [1] "BuenrostroCellTypes"

To transfer patterns between the two data sets, we have to find which peaks overlap between data sets because we can only project onto that overlapping subset. To do this we employ a wrapper function around projectR which automatically maps overlapping peaks together.

projectRResults <- ATACTransferLearning(newData = BuenrostroFinalSubsetData, CoGAPSResult = schepCogapsResult, originalPeaks = schepFilteredPeaks, originalGranges = schepGranges,newGranges = BuenrostroGRanges)

## [1] "62387 row names matched between data and loadings"
## [1] "Updated dimension of data: 62387 1331"

We can then plot the output patterns to see how well they transfer into the target data

cgapsPlot(t(projectRResults$projection), as.factor(BuenrostroCellTypes), matrix = TRUE)

We see in pattern1 that there is correspondence between the Erythroleukemia pattern and Megakaryocyte-Erythrocyte Progenitors, which makes some sense as we would expect there to be some similiarities between Erythrocyte progneitors and Erythroleukemia.

In the B-cell derived LCL pattern (pattern3) we see strong activation of Common Lymphoid progenitors and Dendritic cells.

In pattern 7 (monocyte) we see strongest signal in the moncytes in the target data set. This may be difficult to determine visually, to confirm we can perform a Wilcoxon Rank Sum test.

pairwise.wilcox.test(projectRResults$projection[7,], BuenrostroCellTypes, p.adjust.method = "BH")

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  projectRResults$projection[7, ] and BuenrostroCellTypes 
## 
##      CLP     CMP     GMP     HSC     LMPP    MEP     mono    MPP    
## CMP  0.00019 -       -       -       -       -       -       -      
## GMP  5.1e-05 0.79073 -       -       -       -       -       -      
## HSC  0.02140 0.00032 0.00017 -       -       -       -       -      
## LMPP 5.3e-05 0.79073 0.94658 0.00100 -       -       -       -      
## MEP  0.12105 0.01257 0.00959 0.79073 0.02485 -       -       -      
## mono 1.3e-11 7.3e-15 3.8e-15 < 2e-16 7.8e-15 6.6e-11 -       -      
## MPP  0.39297 1.7e-12 2.4e-13 2.2e-07 1.4e-11 0.00084 < 2e-16 -      
## pDC  1.4e-06 0.00408 0.00356 6.2e-07 0.00356 0.00063 7.2e-06 1.8e-12
## UNK  6.8e-05 0.23074 0.26183 0.00110 0.31518 0.01539 6.2e-07 1.3e-08
##      pDC    
## CMP  -      
## GMP  -      
## HSC  -      
## LMPP -      
## MEP  -      
## mono -      
## MPP  -      
## pDC  -      
## UNK  0.23074
## 
## P value adjustment method: BH

and make a boxplot of the pattern weight of each cell type

boxplot(projectRResults$projection[7,] ~ as.factor(BuenrostroCellTypes), xlab = "Cell Type", ylab = "Pattern Weight")

And we observe that the monocytes in the target dataset are most siginificantly associated with the monocyte pattern.

Session Info