Learning ChIPseeker

library(GenomicFeatures)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(org.Hs.eg.db)
library(clusterProfiler)
library(ReactomePA)
library(ChIPseeker)

Introduction

ChIPseeker is an R package for annotating, comparison and visualizationChIP-seq(Chromatin immunoprecipitation high-throughput sequencing) data analysis.
It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions.Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets.
Currently, ChIPseeker contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation.

Sample Datasets

files <- getSampleFiles()
peak <- readPeakFile(files[[4]])
peak
## GRanges object with 1331 ranges and 2 metadata columns:
##          seqnames                 ranges strand |             V4        V5
##             <Rle>              <IRanges>  <Rle> |       <factor> <numeric>
##      [1]     chr1     [ 815093,  817883]      * |    MACS_peak_1    295.76
##      [2]     chr1     [1243288, 1244338]      * |    MACS_peak_2     63.19
##      [3]     chr1     [2979977, 2981228]      * |    MACS_peak_3    100.16
##      [4]     chr1     [3566182, 3567876]      * |    MACS_peak_4    558.89
##      [5]     chr1     [3816546, 3818111]      * |    MACS_peak_5     57.57
##      ...      ...                    ...    ... .            ...       ...
##   [1327]     chrX [135244783, 135245821]      * | MACS_peak_1327     55.54
##   [1328]     chrX [139171964, 139173506]      * | MACS_peak_1328    270.19
##   [1329]     chrX [139583954, 139586126]      * | MACS_peak_1329    918.73
##   [1330]     chrX [139592002, 139593238]      * | MACS_peak_1330    210.88
##   [1331]     chrY [ 13845134,  13845777]      * | MACS_peak_1331     58.39
##   -
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

ChIP peaks coverage plot

Finding the peak locations over the whole genome.

covplot(peak, weightCol="V5")

covplot(peak, weightCol="V5", chrs=c("chr1", "chr2"), xlim=c(5e7, 6e7))

Profile of ChIP peaks binding to TSS regions

First of all, for calculating the profile of ChIP peaks binding to TSS regions, we should prepare the TSS regions, which are defined as the flanking sequence of the TSS sites. Then align the peaks that are mapping to these regions, and generate the tagMatrix.

In the above code, tagMatrix is not restricted to TSS regions. The regions can be other types that defined by the user.

## tagMatrix is a Large matrix which contains about 4000000 elements
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
tagMatrix <- getTagMatrix(peak, windows=promoter)
tagHeatmap(tagMatrix, xlim=c(-3000, 3000), color="red")
peakHeatmap(files[[4]], TxDb=txdb, upstream=3000, downstream=3000, color="red")
## >> preparing promoter regions...  2018-04-18 21:34:24 
## >> preparing tag matrix...        2018-04-18 21:34:24 
## >> generating figure...       2018-04-18 21:34:40
Heatmap of ChIP peaks binding to TSS regions

Heatmap of ChIP peaks binding to TSS regions

## >> done...            2018-04-18 21:34:41

Average Profile of ChIP peaks binding to TSS region

plotAvgProf(tagMatrix, xlim=c(-3000, 3000),
            xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

#==
# plotAvgProf2(files[[4]], TxDb=txdb, upstream=3000, downstream=3000,
#              xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

Confidence interval estimated by bootstrap method is also supported for characterizing ChIP binding profiles.

plotAvgProf(tagMatrix, xlim=c(-3000, 3000), conf = 0.95, resample = 1000)
## >> Running bootstrapping for tag matrix...        2018-04-18 21:35:25

## Profile of ChIP peaks binding to start site of Exon/Intron

getBioRegion(TxDb = NULL, upstream = 1000, downstream = 1000,
  by = "gene")
## Warning in loadTxDb(TxDb): >> TxDb is not specified, use
## 'TxDb.Hsapiens.UCSC.hg19.knownGene' by default...
## GRanges object with 22801 ranges and 0 metadata columns:
##           seqnames                 ranges strand
##              <Rle>              <IRanges>  <Rle>
##       [1]    chr19 [ 58873214,  58875214]      -
##       [2]     chr8 [ 18247755,  18249755]      +
##       [3]    chr20 [ 43279376,  43281376]      -
##       [4]    chr18 [ 25756445,  25758445]      -
##       [5]     chr1 [244005886, 244007886]      -
##       ...      ...                    ...    ...
##   [22797]     chr9 [115094944, 115096944]      -
##   [22798]    chr21 [ 35735323,  35737323]      +
##   [22799]    chr22 [ 19108967,  19110967]      -
##   [22800]     chr6 [ 90538619,  90540619]      +
##   [22801]    chr22 [ 50963905,  50965905]      -
##   -
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths

Peak Annotation

# tssRegion: transcription start site
# 
peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")
## >> loading peak file...               2018-04-18 21:35:29 
## >> preparing features information...      2018-04-18 21:35:29 
## >> identifying nearest features...        2018-04-18 21:35:45 
## >> calculating distance from peak to TSS...   2018-04-18 21:35:45 
## >> assigning genomic annotation...        2018-04-18 21:35:45 
## >> adding gene annotation...          2018-04-18 21:36:13 
## >> assigning chromosome lengths           2018-04-18 21:36:13 
## >> done...                    2018-04-18 21:36:13
peakAnno
## Annotated peaks generated by ChIPseeker
## 1331/1331  peaks were annotated
## Genomic Annotation Summary:
##               Feature  Frequency
## 9    Promoter (<=1kb) 48.1592787
## 10   Promoter (1-2kb)  6.0856499
## 11   Promoter (2-3kb)  4.5078888
## 4              5' UTR  0.3005259
## 3              3' UTR  2.1036814
## 1            1st Exon  0.6010518
## 7          Other Exon  2.4042074
## 2          1st Intron  3.6814425
## 8        Other Intron  4.2824944
## 6  Downstream (<=3kb)  0.9767092
## 5   Distal Intergenic 26.8970699
head(as.GRanges(peakAnno))
## GRanges object with 6 ranges and 14 metadata columns:
##       seqnames             ranges strand |          V4        V5
##          <Rle>          <IRanges>  <Rle> |    <factor> <numeric>
##   [1]     chr1 [ 815093,  817883]      * | MACS_peak_1    295.76
##   [2]     chr1 [1243288, 1244338]      * | MACS_peak_2     63.19
##   [3]     chr1 [2979977, 2981228]      * | MACS_peak_3    100.16
##   [4]     chr1 [3566182, 3567876]      * | MACS_peak_4    558.89
##   [5]     chr1 [3816546, 3818111]      * | MACS_peak_5     57.57
##   [6]     chr1 [6304865, 6305704]      * | MACS_peak_6     54.57
##             annotation   geneChr geneStart   geneEnd geneLength geneStrand
##            <character> <integer> <integer> <integer>  <integer>  <integer>
##   [1] Promoter (2-3kb)         1    803451    812182       8732          2
##   [2] Promoter (<=1kb)         1   1243994   1247057       3064          1
##   [3] Promoter (<=1kb)         1   2976181   2980350       4170          2
##   [4] Promoter (<=1kb)         1   3547331   3566671      19341          2
##   [5] Promoter (<=1kb)         1   3816968   3832011      15044          1
##   [6] Promoter (<=1kb)         1   6304252   6305638       1387          1
##            geneId transcriptId distanceToTSS         ENSEMBL      SYMBOL
##       <character>  <character>     <numeric>     <character> <character>
##   [1]      284593   uc001abt.4         -2911 ENSG00000230368      FAM41C
##   [2]      126789   uc001aed.3             0 ENSG00000169972       PUSL1
##   [3]      440556   uc001aka.3             0 ENSG00000177133   LINC00982
##   [4]       49856   uc001ako.3             0 ENSG00000116213      WRAP73
##   [5]   100133612   uc001alg.3             0 ENSG00000236423   LINC01134
##   [6]      390992   uc009vly.2           613 ENSG00000173673        HES3
##                                          GENENAME
##                                       <character>
##   [1] family with sequence similarity 41 member C
##   [2]             pseudouridylate synthase-like 1
##   [3]  long intergenic non-protein coding RNA 982
##   [4]     WD repeat containing, antisense to TP73
##   [5] long intergenic non-protein coding RNA 1134
##   [6]      hes family bHLH transcription factor 3
##   -
##   seqinfo: 24 sequences from hg19 genome
head(as.data.frame(peakAnno))
##   seqnames   start     end width strand          V4     V5
## 1     chr1  815093  817883  2791      * MACS_peak_1 295.76
## 2     chr1 1243288 1244338  1051      * MACS_peak_2  63.19
## 3     chr1 2979977 2981228  1252      * MACS_peak_3 100.16
## 4     chr1 3566182 3567876  1695      * MACS_peak_4 558.89
## 5     chr1 3816546 3818111  1566      * MACS_peak_5  57.57
## 6     chr1 6304865 6305704   840      * MACS_peak_6  54.57
##         annotation geneChr geneStart geneEnd geneLength geneStrand
## 1 Promoter (2-3kb)       1    803451  812182       8732          2
## 2 Promoter (<=1kb)       1   1243994 1247057       3064          1
## 3 Promoter (<=1kb)       1   2976181 2980350       4170          2
## 4 Promoter (<=1kb)       1   3547331 3566671      19341          2
## 5 Promoter (<=1kb)       1   3816968 3832011      15044          1
## 6 Promoter (<=1kb)       1   6304252 6305638       1387          1
##      geneId transcriptId distanceToTSS         ENSEMBL    SYMBOL
## 1    284593   uc001abt.4         -2911 ENSG00000230368    FAM41C
## 2    126789   uc001aed.3             0 ENSG00000169972     PUSL1
## 3    440556   uc001aka.3             0 ENSG00000177133 LINC00982
## 4     49856   uc001ako.3             0 ENSG00000116213    WRAP73
## 5 100133612   uc001alg.3             0 ENSG00000236423 LINC01134
## 6    390992   uc009vly.2           613 ENSG00000173673      HES3
##                                      GENENAME
## 1 family with sequence similarity 41 member C
## 2             pseudouridylate synthase-like 1
## 3  long intergenic non-protein coding RNA 982
## 4     WD repeat containing, antisense to TP73
## 5 long intergenic non-protein coding RNA 1134
## 6      hes family bHLH transcription factor 3

annotatePeak report detail information when the annotation is Exon or Intron, for instance “Exon (uc002sbe.3/9736, exon 69 of 80)”, means that the peak is overlap with an Exon of transcript uc002sbe.3, and the corresponding Entrez gene ID is 9736 (Transcripts that belong to the same gene ID may differ in splice events), and this overlaped exon is the 69th exon of the 80 exons that this transcript uc002sbe.3 prossess.

Parameter annoDb is optional, if provided, extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID will be added. The geneId column in annotation output will be consistent with the geneID in TxDb. If it is ENTREZID, ENSEMBL will be added if annoDb is provided, while if it is ENSEMBL ID, ENTREZID will be added.

Visualize Genomic Annotation

plotAnnoBar(peakAnno)
Genomic Annotation by barplot

Genomic Annotation by barplot

vennpie(peakAnno)
Genomic Annotation by vennpie

Genomic Annotation by vennpie

upsetplot(peakAnno, vennpie=TRUE)

The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak and reported in the output. We provide plotDistToTSS to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.

plotDistToTSS(peakAnno,
              title="Distribution of transcription factor-binding loci\nrelative to TSS")

Functional enrichment analysis

Perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating biological knowledge provided by biological ontologies. For instance, Gene Ontology (GO) annotates genes to biological processes, molecular functions, and cellular components in a directed acyclic graph structure, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotates genes to pathways, Disease Ontology (DO) annotates genes with human disease association, and Reactome10 annotates gene to pathways and reactions.

ChIPseeker provides a function, seq2gene, for linking genomc regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function is designed to link both coding and non-coding genomic regions to coding genes and facilitate functional analysis.

pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)
##                          ID
## R-HSA-186712   R-HSA-186712
## R-HSA-3769402 R-HSA-3769402
##                                                            Description
## R-HSA-186712                       Regulation of beta-cell development
## R-HSA-3769402 Deactivation of the beta-catenin transactivating complex
##               GeneRatio  BgRatio       pvalue     p.adjust       qvalue
## R-HSA-186712     12/453 33/10513 5.512889e-09 4.724546e-06 4.724546e-06
## R-HSA-3769402     9/453 42/10513 5.923251e-05 2.538113e-02 2.538113e-02
##                                                                          geneID
## R-HSA-186712  2494/5080/3651/3175/6928/390874/3642/4821/4825/2255/222546/389692
## R-HSA-3769402                   607/55553/6662/6657/51176/83595/64321/7088/6658
##               Count
## R-HSA-186712     12
## R-HSA-3769402     9
gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
##                          ID                           Description
## R-HSA-186712   R-HSA-186712   Regulation of beta-cell development
## R-HSA-8943724 R-HSA-8943724 Regulation of PTEN gene transcription
##               GeneRatio  BgRatio       pvalue     p.adjust       qvalue
## R-HSA-186712     10/368 33/10513 1.101985e-07 8.881999e-05 8.676682e-05
## R-HSA-8943724    10/368 61/10513 4.477029e-05 1.235530e-02 1.206969e-02
##                                                                   geneID
## R-HSA-186712       2494/3651/4821/4825/2255/222546/389692/5080/6928/3642
## R-HSA-8943724 64121/100532731/23466/2122/7101/57167/648/84733/57332/8535
##               Count
## R-HSA-186712     10
## R-HSA-8943724    10
dotplot(pathway2)

ChIP peak data set comparison

Profile of several ChIP peak data binding to TSS region

Function plotAvgProf and tagHeatmap can accept a list of tagMatrix and visualize profile or heatmap among several ChIP experiments, while plotAvgProf2 and peakHeatmap can accept a list of bed files and perform the same task in one step.

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList[1:3], xlim=c(-3000, 3000))

plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=500, facet="row")
## >> Running bootstrapping for tag matrix...        2018-04-18 21:36:38 
## >> Running bootstrapping for tag matrix...        2018-04-18 21:36:42 
## >> Running bootstrapping for tag matrix...        2018-04-18 21:36:46 
## >> Running bootstrapping for tag matrix...        2018-04-18 21:37:28 
## >> Running bootstrapping for tag matrix...        2018-04-18 21:38:22

tagHeatmap(tagMatrixList, xlim=c(-3000, 3000), color=NULL)

## ChIP peak annotation comparision

peakAnnoList <- sapply(files, annotatePeak, TxDb=txdb,
                       tssRegion=c(-3000, 3000), verbose=FALSE)
plotAnnoBar(peakAnnoList)

R function plotDistToTSS can use to comparing distance to TSS profiles among ChIPseq data.

plotDistToTSS(peakAnnoList)

Functional profiles comparison

genes = lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) = sub("_", "\n", names(genes))
compKEGG <- compareCluster(geneCluster   = genes,
                         fun           = "enrichKEGG",
                         pvalueCutoff  = 0.05,
                         pAdjustMethod = "BH")
plot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")

Overlap of peaks and annotated genes

To compare the overlap peaks of replicate experiments or from different experiments, peak2GRanges can read peak file and stored in GRanges object. Several files can be read simultaneously using lapply, and then passed to vennplot to calculate their overlap and draw venn plot.

genes= lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
vennplot(genes)
Overlap of annotated genes

Overlap of annotated genes

Statistical testing of ChIP seq overlap

Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface.

Shuffle genome coordination

p <- GRanges(seqnames=c("chr1", "chr3"),
             ranges=IRanges(start=c(1, 100), end=c(50, 130)))
## Randomly permute the genomic locations of ChIP peaks defined in a genome which stored in TxDb object
shuffle(p, TxDb=txdb)
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames                 ranges strand
##          <Rle>              <IRanges>  <Rle>
##   [1]     chr1 [153088833, 153088882]      *
##   [2]     chr3 [183428672, 183428702]      *
##   -
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Peak overlap enrichment analysis

With the ease of this shuffle method, we can generate thousands of random ChIP data and calculate the background null distribution of the overlap among ChIP data sets.

enrichPeakOverlap(queryPeak     = files[[5]],
                  targetPeak    = unlist(files[1:4]),
                  TxDb          = txdb,
                  pAdjustMethod = "BH",
                  nShuffle      = 50,
                  chainFile     = NULL,
                  verbose       = FALSE)
##                                                       qSample
## ARmo_0M    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_1nM   GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## CBX6_BF    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
##                                                       tSample qLen tLen
## ARmo_0M                       GSM1174480_ARmo_0M_peaks.bed.gz 1663  812
## ARmo_1nM                     GSM1174481_ARmo_1nM_peaks.bed.gz 1663 2296
## ARmo_100nM                 GSM1174482_ARmo_100nM_peaks.bed.gz 1663 1359
## CBX6_BF    GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663 1331
##            N_OL     pvalue   p.adjust
## ARmo_0M       0 0.80392157 0.80392157
## ARmo_1nM      8 0.21568627 0.43137255
## ARmo_100nM    3 0.43137255 0.57516340
## CBX6_BF     968 0.01960784 0.07843137

Parameter queryPeak is the query ChIP data, while targetPeak is bed file name or a vector of bed file names from comparison; nShuffle is the number to shuffle the peaks in targetPeak. To speed up the compilation of this vignettes, we only set nShuffle to 50 as an example for only demonstration. User should set the number to 1000 or above for more robust result. Parameter chainFile are chain file name for mapping the targetPeak to the genome version consistent with queryPeak when their genome version are different. This creat the possibility of comparison among different genome version and cross species.

In the output, qSample is the name of queryPeak and qLen is the the number of peaks in queryPeak. N_OL is the number of overlap between queryPeak and targetPeak.

Data Mining with ChIP seq data deposited in GEO

getGEOspecies()
##                                               species  Freq
## 1                                       Aedes aegypti    11
## 2                                            Anabaena     6
## 3                                 Anolis carolinensis     5
## 4                                   Anopheles gambiae     2
## 5                                      Apis mellifera     5
## 6                           Apis mellifera scutellata     1
## 7                                  Arabidopsis lyrata     4
## 8                                Arabidopsis thaliana   209
## 9                                Atelerix albiventris     2
## 10                                         Bos taurus    28
## 11                          Branchiostoma lanceolatum    62
## 12                                      Brassica rapa     8
## 13                             Caenorhabditis elegans   164
## 14                                   Candida albicans    25
## 15                               Candida dubliniensis    20
## 16                             Canis lupus familiaris     7
## 17                          Chlamydomonas reinhardtii    51
## 18                               Chlorocebus aethiops     2
## 19                                 Cleome hassleriana     1
## 20                                      Columba livia     6
## 21                                  Crassostrea gigas     1
## 22                            Cryptococcus neoformans    51
## 23                                    Cyprinus carpio    40
## 24                                        Danio rerio   164
## 25                                 Drosophila busckii     1
## 26                            Drosophila melanogaster   734
## 27            Drosophila melanogaster;\tSindbis virus     3
## 28                                 Drosophila miranda     2
## 29                           Drosophila pseudoobscura     7
## 30                                Drosophila simulans    12
## 31                                 Drosophila virilis    26
## 32                              Drosophila willistoni     1
## 33                                  Drosophila yakuba     8
## 34                                     Equus caballus     1
## 35                                   Escherichia coli    13
## 36                           Escherichia coli BW25113     4
## 37                              Escherichia coli K-12     2
## 38          Escherichia coli str. K-12 substr. MG1655     8
## 39                                      Gallus gallus    55
## 40                       Geobacter sulfurreducens PCA     3
## 41                                    Gorilla gorilla     2
## 42                                  Histophilus somni     1
## 43                                       Homo sapiens 10083
## 44                 Homo sapiens;\tHuman herpesvirus 8     6
## 45                               Human herpesvirus 6B     2
## 46                                Human herpesvirus 8     6
## 47                                Larimichthys crocea     7
## 48                             Legionella pneumophila     5
## 49                             Leishmania amazonensis     4
## 50                                   Leishmania major     2
## 51                   Leishmania major strain Friedlin     4
## 52                              Leishmania tarentolae    15
## 53                                     Macaca mulatta   116
## 54                              Monodelphis domestica     8
## 55                         Moraxella catarrhalis O35E     6
## 56                                       Mus musculus  7596
## 57                         Mus musculus x Mus spretus     1
## 58                         Mycobacterium tuberculosis     2
## 59                                    Myotis brandtii    15
## 60                              Naumovozyma castellii     1
## 61                             Nematostella vectensis    23
## 62                              Oreochromis niloticus     1
## 63                           Ornithorhynchus anatinus     5
## 64                                       Oryza sativa    23
## 65                                    Oryzias latipes     2
## 66                                    Pan troglodytes    93
## 67                                       Papio anubis     1
## 68                              Plasmodium falciparum   105
## 69                          Plasmodium falciparum 3D7    29
## 70                          Pseudomonas putida KT2440     2
## 71                                 Pseudozyma aphidis    11
## 72                                Pyrococcus furiosus     4
## 73                                  Rattus norvegicus    57
## 74                         Rhodopseudomonas palustris     6
## 75                  Rhodopseudomonas palustris CGA009     3
## 76                           Saccharomyces cerevisiae   586
## 77 Saccharomyces cerevisiae x Saccharomyces paradoxus    16
## 78            Saccharomyces cerevisiae;\tMus musculus    12
## 79                         Saccharomyces kudriavzevii     1
## 80                            Saccharomyces paradoxus     8
## 81                               Saccharomyces uvarum     1
## 82                      Schizosaccharomyces japonicus     2
## 83                          Schizosaccharomyces pombe   130
## 84                             Schmidtea mediterranea     7
## 85                                    Sorghum bicolor     2
## 86                      Streptomyces coelicolor A3(2)     6
## 87                                         Sus scrofa    23
## 88                                Taeniopygia guttata     1
## 89                                   Tupaia chinensis     7
## 90                                    Vibrio cholerae     8
## 91                      Xenopus (Silurana) tropicalis    62
## 92                                     Xenopus laevis    10
## 93                                           Zea mays    63
getGEOspecies()[grep("^A.*\\st.*", getGEOspecies()[["species"]]),]
##                species Freq
## 8 Arabidopsis thaliana  209
getGEOgenomeVersion()
##                         organism genomeVersion Freq
## 1            Anolis carolinensis       anoCar2    5
## 2                     Bos taurus       bosTau4    2
## 3                     Bos taurus       bosTau6   24
## 4                     Bos taurus       bosTau7    2
## 5         Caenorhabditis elegans          ce10    4
## 6         Caenorhabditis elegans           ce6   64
## 7                    Danio rerio       danRer6    6
## 8                    Danio rerio       danRer7   61
## 9        Drosophila melanogaster           dm3  502
## 10                 Gallus gallus       galGal3   32
## 11                 Gallus gallus       galGal4   15
## 12                  Homo sapiens          hg18 2512
## 13                  Homo sapiens          hg19 6876
## 14                  Homo sapiens          hg38   43
## 15                  Mus musculus          mm10  214
## 16                  Mus musculus           mm8  507
## 17                  Mus musculus           mm9 6289
## 18         Monodelphis domestica       monDom5    8
## 19               Pan troglodytes       panTro3   48
## 20               Pan troglodytes       panTro4   42
## 21                Macaca mulatta       rheMac2   81
## 22                Macaca mulatta       rheMac3   31
## 23             Rattus norvegicus           rn5    3
## 24      Saccharomyces cerevisiae       sacCer2  141
## 25      Saccharomyces cerevisiae       sacCer3  227
## 26                    Sus scrofa       susScr2   17
## 27 Xenopus (Silurana) tropicalis       xenTro3    3
# If simplify is set to FALSE, extra information including source_name, extract_protocol, description, data_processing and submission_date will be incorporated.
hg19 <- getGEOInfo(genome="hg19", simplify=TRUE)
head(hg19)
##     series_id        gsm     organism
## 111  GSE16256  GSM521889 Homo sapiens
## 112  GSE16256  GSM521887 Homo sapiens
## 113  GSE16256  GSM521883 Homo sapiens
## 114  GSE16256 GSM1010966 Homo sapiens
## 115  GSE16256  GSM896166 Homo sapiens
## 116  GSE16256  GSM910577 Homo sapiens
##                                                                                                       title
## 111          Reference Epigenome: ChIP-Seq Analysis of H3K27me3 in IMR90 Cells; renlab.H3K27me3.IMR90-02.01
## 112            Reference Epigenome: ChIP-Seq Analysis of H3K27ac in IMR90 Cells; renlab.H3K27ac.IMR90-03.01
## 113            Reference Epigenome: ChIP-Seq Analysis of H3K14ac in IMR90 Cells; renlab.H3K14ac.IMR90-02.01
## 114                      polyA RNA sequencing of STL003 Pancreas Cultured Cells; polyA-RNA-seq_STL003PA_r1a
## 115          Reference Epigenome: ChIP-Seq Analysis of H4K8ac in hESC H1 Cells; renlab.H4K8ac.hESC.H1.01.01
## 116 Reference Epigenome: ChIP-Seq Analysis of H3K4me1 in Human Spleen Tissue; renlab.H3K4me1.STL003SX.01.01
##                                                                                                     supplementary_file
## 111         ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521889/suppl/GSM521889_UCSD.IMR90.H3K27me3.SK05.bed.gz
## 112          ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521887/suppl/GSM521887_UCSD.IMR90.H3K27ac.YL58.bed.gz
## 113          ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521883/suppl/GSM521883_UCSD.IMR90.H3K14ac.SK17.bed.gz
## 114 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1010nnn/GSM1010966/suppl/GSM1010966_UCSD.Pancreas.mRNA-Seq.STL003.bed.gz
## 115              ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM896nnn/GSM896166/suppl/GSM896166_UCSD.H1.H4K8ac.AY17.bed.gz
## 116       ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM910nnn/GSM910577/suppl/GSM910577_UCSD.Spleen.H3K4me1.STL003.bed.gz
##     genomeVersion pubmed_id
## 111          hg19  19829295
## 112          hg19  19829295
## 113          hg19  19829295
## 114          hg19  19829295
## 115          hg19  19829295
## 116          hg19  19829295

ChIPseeker provide function downloadGEObedFiles to download all the bed files of a particular genome.

# downloadGEObedFiles(genome="hg19", destDir="hg19")

# gsm <- hg19$gsm[sample(nrow(hg19), 10)]
# downloadGSMbedFiles(gsm, destDir="hg19")

After download the bed files from GEO, we can pass them to enrichPeakOverlap for testing the significant of overlap. Parameter targetPeak can be the folder, e.g. hg19, that containing bed files. enrichPeakOverlap will parse the folder and compare all the bed files. It is possible to test the overlap with bed files that are mapping to different genome or different genome versions, enrichPeakOverlap provide a parameter chainFile that can pass a chain file and liftOver the targetPeak to the genome version consistent with queryPeak.
Signifcant overlap can be use to generate hypothesis of cooperative regulation.By mining the data deposited in GEO, we can identify some putative complex or interacted regulators in gene expression regulation or chromsome remodelling for further validation.

REFERENCES

  • browseVignettes("ChIPseeker")
Author: Joaxin
Link: https://u.pinsflora.xyz/Reverie/html/bioc_ChIPseeker/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.

Comment
avatar
Joaxin
When I close my eyes, I'm in my own ♡cean 🐟aves.
Follow Me
Announcement
感謝訪問本站,若喜歡請收藏 ^_^