Convert peaks file to GRanges object

Question

ChIPpeakAnno, MACS format annotation

0

Entering edit mode

Khademul Islam ▴ 30

@khademul-islam-3826

Last seen 8.0 years ago

Hi,

I just have installed latest ChIPpeakAnno and tried example code and data. But got error. Same error with my data as well. How to solve this?

# Just another question: when it annotate to nearest TSS, does it use Summit or Start position from MACS file?

https://bioconductor.org/packages/devel/bioc/vignettes/ChIPpeakAnno/inst/doc/ChIPpeakAnno.html

macs <- system.file("extdata", "MACS_peaks.xls", package="ChIPpeakAnno")

macsOutput <- toGRanges(macs, format="MACS")

duplicated or NA names found. Rename all the names by numbers.

Many thanks,

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 24 (Workstation Edition)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel grid stats graphics grDevices utils
[8] datasets methods base

other attached packages:
[1] EnsDb.Hsapiens.v75_2.1.0 ensembldb_1.6.2          GenomicFeatures_1.26.0
[4] AnnotationDbi_1.36.0     Biobase_2.34.0           ChIPpeakAnno_3.8.9
[7] VennDiagram_1.6.17       futile.logger_1.4.3      GenomicRanges_1.26.1
[10] GenomeInfoDb_1.10.1      Biostrings_2.42.1        XVector_0.14.0
[13] IRanges_2.8.1            S4Vectors_0.12.1         BiocGenerics_0.20.0

bioconductor ChIPpeakAnno • 4.0k views

ADD COMMENT • link updated 6.2 years ago by Julie Zhu ★ 4.3k • written 8.0 years ago by Khademul Islam ▴ 30

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 17 months ago

United States

Lucy,

You mentioned that you downloaded the annotation file as GTF format from Ensembl. If this is correct, toGranges with format = "GFF" is not correct since GTF format is different from GFF format. Without changing your code, you could download the annotation file as a GFF file format instead. Alternatively, you can use the following code to get the annotation assuming that you are interested in the human gene annotation.

library(EnsDb.Hsapiens.v86) annoData <- toGRanges(EnsDb.Hsapiens.v86, feature="gene")

Best regards, Julie

ADD COMMENT • link 6.2 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Thank you Julie.

I wasn't sure whether I could use the GFF option as Ensembl states that "The GTF (General Transfer Format) is identical to GFF version 2" https://www.ensembl.org/info/website/upload/gff.html

I have a matched RNA-seq dataset for which I used the Ensembl GTF file for annotation, so I would like to use the exact same annotation version for my peak data. If I download the equivalent GFF file, does this contain all of the same information as the GTF file?

ADD REPLY • link 6.2 years ago Lucy ▴ 60

0

Entering edit mode

Lucy,

Thanks for the clarification!

Could you please post a few lines of the gtf annotation you used for analyzing your RNA-seq dataset? Thanks!

Best regards,

Julie

ADD REPLY • link 6.2 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

#!genome-build GRCh38.p12
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.27
#!genebuild-last-updated 2018-07
chr1    havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1    havana  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
chr1    havana  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
chr1    havana  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
chr1    havana  exon    13221   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";
chr1    havana  transcript  12010   13670   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; tag "basic"; transcript_support_level "NA";
chr1    havana  exon    12010   12057   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; exon_id "ENSE00001948541"; exon_version "1"; tag "basic"; transcript_support_level "NA";
chr1    havana  exon    12179   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; exon_id "ENSE00001671638"; exon_version "2"; tag "basic"; transcript_support_level "NA";

ADD REPLY • link 6.2 years ago Lucy ▴ 60

0

Entering edit mode

Sorry that isn't very easy to read! I would be happy to send you the file if it is easier.

ADD REPLY • link 6.2 years ago Lucy ▴ 60

0

Entering edit mode

Lucy,

Please send me the gtf file (julie.zhu@umassmed.edujulie.zhu@umassmed.edu). Thanks!

BTW, I just noticed that you continued with an old thread which is about MACs format. Could you please start a new thread as ChIPpeakAnno::toGRanges GTF format instead to facilitate future searches? Thanks!

Best,

Julie

ADD REPLY • link 6.2 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 17 months ago

United States

Lucy, Please try the following code snippet for importing the gtf file hg38_200000.gtf.

library(refGenome)

gtf = ensemblGenome()

read.gtf(gtf, filename = "hg38_200000.gtf")

genes = gtf@ev$genes[ ,c("geneid","genename", "start", "end", "strand", "seqid")]

annoData <- toGRanges(genes, format="others", colNames=c("names", "gene_name", "start", "end", "strand", "space"))

Convert peaks file to GRanges object

peaks <- toGRanges("peaks_counts.bed", format="BED", header=FALSE)

peaks <- peaks[width(peaks) >0]

annotatedPeak <- annotatePeakInBatch(myPeakList=peaks, AnnotationData=annoData, ignore.strand=TRUE)

Best regards, Julie

ADD COMMENT • link 6.2 years ago Julie Zhu ★ 4.3k

score 2 · Accepted Answer · 2017-03-30

2

Entering edit mode

Ou, Jianhong ★ 1.3k

@ou-jianhong-4539

Last seen 12 weeks ago

United States

Hi,

Thanks for selecting ChIPpeakAnno as your annotation tool.

First question, that is a warning. I am consider to change it to a message. That message tells you the function could not find peak name or there are duplicated peak names. And the toGRanges function will automatically give a name for each peak.

When it annotate to nearest TSS by default, it use start position for calculation.

Let me know if you still have any question.

ADD COMMENT • link 8.0 years ago Ou, Jianhong ★ 1.3k

0

Entering edit mode

Hi,

I am trying to make a custom annotation file to use with ChIPpeakAnno. I am starting with an Ensembl GTF file. The following command gives the error: duplicated or NA names found. Rename all the names by numbers.

annoData <- toGRanges(gff, format="GFF")

Which part of the GTF file does it not like?

If I run annotatePeakInBatch using this file:

annotatedPeak <- annotatePeakInBatch(myPeakList=peaks, AnnotationData=annoData, ignore.strand=TRUE)

I get the error: Error inrownames<-(tmp, value = c("(-73.9,5e+03]", "(5e+03,9.99e+03]", : invalid rownames length In addition: Warning message: In annotatePeakInBatch(myPeakList = peaks, AnnotationData = annoData, : not all the seqnames of myPeakList is in the AnnotationData.

Could someone please explain what this means and what I need to change?

Thank you!

ADD REPLY • link 6.2 years ago Lucy ▴ 60

0

Entering edit mode

Hi,

You mentioned that you downloaded the annotation file as GTF format from Ensembl. If this is correct, toGranges with format = "GFF" is not correct since GTF format is different from GFF format. Without changing your code, could you please download the annotation file as a GFF file format instead? Alternatively, you can use the following code to get the annotation assuming that you are interested in the human gene annotation.

library(EnsDb.Hsapiens.v86) annoData <- toGRanges(EnsDb.Hsapiens.v86, feature="gene")

Best regards, Julie

ADD REPLY • link 6.2 years ago Julie Zhu ★ 4.3k