GDC legacy archive retired. So, I downloaded the TCGA methylation data in hg38 (450K) using TCGAbiolinks. I'd like to use DMRcate to find DMRs. But the problem is that the annotation is hg19 (IlluminaHumanMethylation450kanno.ilmn12.hg19). I am tring to modify the "cpg.annotate" function, but got stuck when playing with "makeGenomicRatioSetFromMatrix" function to use my home made annotation for methylation array in hg38 (450K) (based on files here http://zwdzwd.github.io/InfiniumAnnotation). Actually, I don't know how to modify the funciton in makeGenomicRatioSetFromMatrix for this part as below to use my annotation rather than the feeded value of "ilmn12.hg19". I was also trying to make a GenomicRatioSet using my homemade annotation, but failed.
out <- GenomicRatioSet(gr = gr[ind2, ], Beta = NULL,
M = mat[ind1, , drop = FALSE], CN = NULL, colData = pData,
annotation = c(array = array, annotation = annotation),
preprocessMethod = preprocessing)
So, the question is that how should I apply "cpg.annotate" to TCGA methylation data in hg38 (450K)?
Another confusion is that I see "DMR.plot" has an option of "genome" which can be "hg38" (https://www.bioconductor.org/packages/devel/bioc/manuals/DMRcate/man/DMRcate.pdf). Is "hg38" only for EPICv2 in hg38?
See a related question here https://www.biostars.org/p/9587144/ .
Thanks a lot!
Hi Xiaofei,
Thanks for this. If your data is from 450K, you'll have to call your DMRs in hg19, and then lift the DMR ranges over to hg38 post-hoc. DMRcate is one-to-one with regards to platform -> reference, since it follows the Illumina-provided annotation.
450K: IlluminaHumanMethylation450kanno.ilmn12.hg19
EPICv1: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
EPICv2: IlluminaHumanMethylationEPICv2anno.20a1.hg38
cpg.annotate() isn't built for customised/homemade annotations, but you're more than welcome to fork the git (https://github.com/timpeters82/DMRcate-devel/) adapt it for your own needs.
Cheers, Tim
Also, re DMR.plot(), yes the understanding is that all DMRs from EPICv2 should be plotted in hg38. I've left that implied for the user since the same function uses sequencing data but if it gets too confusing I'll force the annotations from array data in a future commit.
Hi Tim, Thanks for your reply! Yes, my data is 450K, and it is TCGA methylation data downloaded by TCGAbiolinks. The problem is that I can only get the data in hg38 due to the GDC legacy archive retirement? I don't know how to retrieve them in hg19 anymore. Best, Xiaofei
Hi Xiaofei,
Is it possible to generate a matrix of beta values or M-values with rownames as probe IDs from this data? If so, cpg.annotate() will automatically reannotate them to hg19 and you can proceed as usual.
Cheers, Tim
Thanks for your reply!
I am in a similar situation. So, I have a matrix of beta values with row names as probe ids (which are annotated to hg38 as it is from harmonized TCGA dataset). I am using cpg.annotate with arraytype=450K. When I try to plot with DMR.plot using hg38 it does not work but it works with hg19. So, if I read Tim's suggestion correctly, does it mean that I should use hg19 with DMR.plot as cpg.annotate automatically re-annotate to hg19? Thank you! Prakash
Yes. If you have the Illumina IDs, then all the annotation data that
cpg.annotate
uses will be based on hg19. It doesn't matter if TCGA lifted the data over to hg38, because you aren't using their location data, you are using Illumina's location data.