Excluding probes during methylation analysis

0

Entering edit mode

VSM ▴ 120

@vsm-6019

Last seen 6.9 years ago

United Kingdom

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

• 1.6k views

ADD COMMENT • link updated 11.8 years ago by Tim Triche ★ 4.2k • written 11.8 years ago by VSM ▴ 120

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 4.6 years ago

United States

http://www.bioconductor.org/packages/2.12/data/annotation/html/FDb.UCS C.snp137common.hg19.html, http://www.bioconductor.org/packages/2.13/data/annotation/html/FDb.UCS C.snp135common.hg19.html , or http://www.bioconductor.org/packages/release/data/annotation/html/SNPl ocs.Hsapiens.dbSNP.20120608.html may be handy. The first two can simply be overlapped, but are 'common' (MAF > 0.01) SNPs only. If you want all of the SNPs that have been submitted to dbSNP, you need the SNPlocs package. The UCSC snp13[5|7]common packages are compiled from newer builds of dbSNP than the manifest, which had some bizarre inclusions (SNPs which are > 1bp 3' to the targeted locus, for example) when we looked. I personally screen out common SNPs that overlap the targeted or extension base, using the most recent build available to me, but that's just my preference. There are arguments to be made for SNPs anywhere from 1 to 49 bases 5' to the target based on melting temperature of the oligos, and there are arguments to be made for genotyping all of your subjects and screening individually for SNPs. Anyways, it is straightforward to dump out the probes that get hit by common SNPs: library(FDb.UCSC.snp137common.hg19) commonSNPs <- features(FDb.UCSC.snp137common.hg19) ## load the data: a SummarizedExperiment is like an eSet, but with a GRanges describing the features my.SE <- readRDS('my.SummarizedExperiment.rds') dimmy.SE) ## [1] 485577 11 ## mask common SNPs that overlap the targeted CpG (or CpH, or SNP) site my.SE.noCpgSNPs <- my.SE[ countOverlapsmy.SE, commonSNPs) < 1, ] dim(my.SE.noCpgSNPs) ## [1] 468211 11 ## retain only CpG probes, and only those that do not overlap a common SNP my.SE.noCpgSnps.onlyCpGs <- my.SE.noCpgSNPs[which(substr(rownames(my.SE.noCpgSNPs),1,2)== 'cg') , ] dim(my.SE.noCpgSnps.onlyCpGs) ## [1] 465130 11 I prefer to work on SummarizedExperiments (hence the .SE), as it makes life a bit easier; it also happens to be the parent class for GenomicMethylSet, GenomicRatioSet, etc. in minfi, so the steps are the same for those. Working on genomic coordinates is (almost?) universally preferable in this respect. YMMV... On Fri, Jul 5, 2013 at 9:31 AM, Victoria Svinti < victoria.svinti@igmm.ed.ac.uk> wrote: > Hi there, > > I decided to post after searching the forums for a few days, in hope that > somebody can point me in the right direction. > > I am analysing a 450k methylation array to look for differentially > methylated sites, and got as far as having normalised data. Various > resources suggest that I need to drop probes with know SNPs residing in the > sequence, microsattelites, those that anneal to multiple genomic locations > etc. > > I have looked into the FDb.InfiniumMethylation.hg19 package (get450k), but > I don't see the annotation regarding SNPs (could be due to my unfamiliarity > with GRanges). I finally have acquired a list of these from the GEO, > Illumina GPL13534, but wonder if it's outdated and if there is a better way > of doing this. > > Does someone know of a good/any tutorial for this workflow? > > Many thanks, > Victoria > > -- > Victoria Svinti > Colon Cancer Genetics Group > MRC Human Genetics Unit, IGMM > University of Edinburgh, Western General Hospital, > Crewe Road, Edinburgh, EH4 2XU > > > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD COMMENT • link 11.8 years ago Tim Triche ★ 4.2k

0

Entering edit mode

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 11.8 years ago VSM ▴ 120

0

Entering edit mode

> I'm working with methyLumiM or MethyLumiSet objects, is there a way to convert them to a SummarizedExperiment object? If you are using a recent version of 'methylumi' (2.7.5 is ideal), you can coerce them: my.SE <- as(my.MethyLumiSet, 'SummarizedExperiment') You can also coerce them to minfi-style data structures if you wish: my.RGChannelSet <- as(my.MethyLumiSet, 'RGChannelSet') ## only works if you have all of the output from a methylumIDAT() run my.gmSet <- mapToGenome(my.MethyLumiSet) ## gmSet is a subclass of SummarizedExperiment and essentially behaves the same The coercion functions deserve better documentation than what I've currently written; I'll add some as soon as I have a chance. > Should I only be looking at probes in CpGs? That depends a great deal on your experimental design. If you care about non-CpG methylation, you should look at the ch probes too. The rs (SNP) probes can also be helpful in some circumstances, such as matching against SNP arrays, exome sequences, etc. but it does not make sense to include those in your methylation analysis directly (i.e., don't run tests on SNPs or CpG SNPs!). This latter point has come about in regards to minfi's default summarization functions, which (IMHO) are more elegant than those in methylumi, but on occasion surprising to the user. In my opinion, the most sensible processing pipeline would/will combine minfi and illuminaio's speed with the existing methylumi defaults for background processing and SNP probe handling, emitting a SummarizedExperiment-based container for downstream genomic analysis. GenomicMethylSet and GenomicRatioSet are both subclasses of SummarizedExperiment, so the latter problem is long since solved. The former problem is not so much a problem as an issue of time to write small code patches. > How do I go about finding probes that map to repetitive regions, RepeatMasker is a useful tool for this. I will ask if we can't make the TCGA hm450/hm27 masks public, since (after internal discussion) sensible defaults for probe masking were arrived at, and some computationally obnoxious bits are now water under the bridge for TCGA. > or map to multiple regions in the genome? For this, you need to use either the probe sequence (including degenerate probes, for Infinium II designs) or the forward design target and align the probes. BowTie, BWA, or Rsubread, for example, can handle this task if you just write out the probe sequences as a FASTA file. However, it's not necessarily a good use of your time; other groups have already done this, e.g. http://www.epigeneticsandchromatin.com/content/6/1/4, aka http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16304 Additional FDb-style or minfi-style annotation packages would be a more sensible method to represent the above additions, and in fact both might be trivial enough for me to do them in the near future (I would like to update FDb.InfiniumMethylation.hg19, and I'd like to become more familiar with the guts of minfi-style annotation packages, in the event those are a better choice for future development). > Do I use the resulting SummarisedExperiment object, or the get450k() in FDb.InfiniumMethylation.hg19, or ..? Either will work, since get450k() is used by setAs('MethyLumiSet', 'SummarizedExperiment') to map the probes. Likewise, all of the previously mentioned approaches work for minfi's GenomicMethylSet and GenomicRatioSet classes. Kasper was receptive to input regarding the downstream representation of minfi results as SummarizedExperiments, which is a big help in methylation array analysis. (You can represent RRBS, eRRBS, WGBS, or RNAseq results as SummarizedExperiments, too, which makes integration much easier) The important part, IMHO, is that you get all of the power of the GenomicRanges infrastructure for "free" when you go this route. > found your previous posts on using the GPL13534 helpful, though it seems like things changed since then. Not so much, but the FDb packages attempt to automate some of the irritations related to using GPL entries. The latter are still a good starting point for in-house annotations, transforming GEO datasets into SummarizedExperiment/GenomicMethylSet/GenomicRatioSet objects, or any number of other manipulations (e.g. turning existing expression data into SummarizedExperiments, mapping against RNAseq data, ...). My only real concern is the veracity of some of the annotations, whether it be the target assembly, the symbols (Illumina's GPL entries have a smattering of "Excel genes", as one of my coworkers discovered one day), or their provenance. If you can verify the sequence target(s) and the resulting coordinates in a given genomic assembly, most of the remaining annotations are trivially accomplished using GenomicRanges, Biostrings, VariantAnnotation, and the like. Plus then you know where they came from! It's not an issue of mistrusting vendors, so much as it is an issue of "doveryai no proveryai" -- trust, but verify. That way, there there are no nasty surprises. This is also why I include a script to rebuild the packages from the manifest coordinates (using GEOquery!), verifying the target sequence mappings along the way, in the FDb.InfiniumMethylation packages. Similarly, for the snp135common and snp137common packages, the manipulations required to turn UCSC's curiously-indexed SNP results into SINGLE nucleotide hits (!) are included in inst/build/. I don't expect anyone else to trust me without verification, either. Best, --t On Sat, Jul 6, 2013 at 2:22 AM, Victoria Svinti < victoria.svinti@igmm.ed.ac.uk> wrote: > Hi Tim, > > Thank you for the very helpful pointers. > > I have some other questions remaining ... > > I'm working with methyLumiM or MethyLumiSet objects, is there a way to > convert them to a SummarizedExperiment object? > > Should I only be looking at probes in CpGs? > > How do I go about finding probes that map to repetitive regions, or map to > multiple regions in the genome? Do I use the resulting SummarisedExperiment > object, or the get450k() in FDb.InfiniumMethylation.hg19, or ..? > > Sorry for all the questions, hopefully this would be useful to other > people in the future. I found your previous posts on using the GPL13534 > helpful, though it seems like things changed since then. > > Thanks again, > Victoria > > > -- > *Victoria Svinti* > Colon Cancer Genetics Group > MRC Human Genetics Unit, IGMM > University of Edinburgh, Western General Hospital, > Crewe Road, Edinburgh, EH4 2XU > > > > On 5 Jul 2013, at 23:57, "Tim Triche, Jr." <tim.triche@gmail.com> wrote: > > > http://www.bioconductor.org/packages/2.12/data/annotation/html/FDb.U CSC.snp137common.hg19.html, > > http://www.bioconductor.org/packages/2.13/data/annotation/html/FDb.U CSC.snp135common.hg19.html > , > or > http://www.bioconductor.org/packages/release/data/annotation/html/SN Plocs.Hsapiens.dbSNP.20120608.html may > be handy. The first two can simply be overlapped, but are 'common' (MAF > > 0.01) SNPs only. If you want all of the SNPs that have been submitted to > dbSNP, you need the SNPlocs package. > > The UCSC snp13[5|7]common packages are compiled from newer builds of dbSNP > than the manifest, which had some bizarre inclusions (SNPs which are > 1bp > 3' to the targeted locus, for example) when we looked. I personally screen > out common SNPs that overlap the targeted or extension base, using the most > recent build available to me, but that's just my preference. There are > arguments to be made for SNPs anywhere from 1 to 49 bases 5' to the target > based on melting temperature of the oligos, and there are arguments to be > made for genotyping all of your subjects and screening individually for > SNPs. > > Anyways, it is straightforward to dump out the probes that get hit by > common SNPs: > > library(FDb.UCSC.snp137common.hg19) > commonSNPs <- features(FDb.UCSC.snp137common.hg19) > > ## load the data: a SummarizedExperiment is like an eSet, but with a > GRanges describing the features > my.SE <- readRDS('my.SummarizedExperiment.rds') > dimmy.SE) > ## [1] 485577 11 > > ## mask common SNPs that overlap the targeted CpG (or CpH, or SNP) site > my.SE.noCpgSNPs <- my.SE[ countOverlapsmy.SE, commonSNPs) < 1, ] > dim(my.SE.noCpgSNPs) > ## [1] 468211 11 > > ## retain only CpG probes, and only those that do not overlap a common SNP > my.SE.noCpgSnps.onlyCpGs > <- my.SE.noCpgSNPs[which(substr(rownames(my.SE.noCpgSNPs),1,2)== 'cg') , ] > dim(my.SE.noCpgSnps.onlyCpGs) > ## [1] 465130 11 > > I prefer to work on SummarizedExperiments (hence the .SE), as it makes > life a bit easier; it also happens to be the parent class for > GenomicMethylSet, GenomicRatioSet, etc. in minfi, so the steps are the same > for those. Working on genomic coordinates is (almost?) universally > preferable in this respect. > > YMMV... > > > > > On Fri, Jul 5, 2013 at 9:31 AM, Victoria Svinti < > victoria.svinti@igmm.ed.ac.uk> wrote: > >> Hi there, >> >> I decided to post after searching the forums for a few days, in hope that >> somebody can point me in the right direction. >> >> I am analysing a 450k methylation array to look for differentially >> methylated sites, and got as far as having normalised data. Various >> resources suggest that I need to drop probes with know SNPs residing in the >> sequence, microsattelites, those that anneal to multiple genomic locations >> etc. >> >> I have looked into the FDb.InfiniumMethylation.hg19 package (get450k), >> but I don't see the annotation regarding SNPs (could be due to my >> unfamiliarity with GRanges). I finally have acquired a list of these from >> the GEO, Illumina GPL13534, but wonder if it's outdated and if there is a >> better way of doing this. >> >> Does someone know of a good/any tutorial for this workflow? >> >> Many thanks, >> Victoria >> >> -- >> Victoria Svinti >> Colon Cancer Genetics Group >> MRC Human Genetics Unit, IGMM >> University of Edinburgh, Western General Hospital, >> Crewe Road, Edinburgh, EH4 2XU >> >> >> >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago Tim Triche ★ 4.2k

0

Entering edit mode

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 11.8 years ago VSM ▴ 120

Login before adding your answer.