SNP6 data to VCF

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

Hi, all. I'm a little rusty on my oligo array software tools. I'm interested in taking Affymetrix SNP6 data to VCF format. To do that, I am going to need to: 1. Call SNPs 2. Determine strand and reference allele for each SNP on the array 3. Assign the correct alleles to each SNP for each sample 4. Write out the VCF file with the correct genotypes (on the positive strand, reference allele correctly specified) What is the best way to do steps 1-3? I'll deal with step 4 since I don't think that has been implemented directly. Thanks, Sean

SNP oligo ASSIGN SNP oligo ASSIGN • 3.9k views

ADD COMMENT • link updated 6.2 years ago by giulio.genovese • 0 • written 12.8 years ago by Sean Davis 21k

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 9 weeks ago

United States

On Mon, Feb 13, 2012 at 2:13 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: > Hi, all. > > I'm a little rusty on my oligo array software tools. I'm interested > in taking Affymetrix SNP6 data to VCF format. To do that, I am going > to need to: > > 1. Call SNPs > 2. Determine strand and reference allele for each SNP on the array > 3. Assign the correct alleles to each SNP for each sample > for 2 and 3 pd.genomewidesnp.6 has the metadata > con = pd.genomewidesnp.6@getdb() > dbListTables(con) [1] "featureSet" "featureSetCNV" "fragmentLength" [4] "fragmentLengthCNV" "pmfeature" "pmfeatureCNV" [7] "sequence" "sequenceCNV" "sqlite_stat1" [10] "table_info" > ss = dbGetQuery(con, "select * from featureSet limit 5") > ss fsetid man_fsetid affy_snp_id dbsnp_rs_id chrom physical_pos strand 1 1 SNP_A-2131660 NA rs2887286 1 1156131 0 2 2 SNP_A-1967418 NA rs1496555 1 2234251 0 3 3 SNP_A-1969580 NA rs41477744 1 2329564 0 4 4 SNP_A-4263484 NA rs3890745 1 2553624 0 5 5 SNP_A-1978185 NA rs10492936 1 2936870 1 cytoband allele_a allele_b 1 p36.33 C T 2 p36.33 A G 3 p36.32 A G 4 p36.32 C T 5 p36.32 C T > 4. Write out the VCF file with the correct genotypes (on the positive > strand, reference allele correctly specified) > > What is the best way to do steps 1-3? I'll deal with step 4 since I > don't think that has been implemented directly. > > Thanks, > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.8 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

On Mon, Feb 13, 2012 at 2:28 PM, Vincent Carey <stvjc at="" channing.harvard.edu=""> wrote: > > > On Mon, Feb 13, 2012 at 2:13 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >> >> Hi, all. >> >> I'm a little rusty on my oligo array software tools. ?I'm interested >> in taking Affymetrix SNP6 data to VCF format. ?To do that, I am going >> to need to: >> >> 1. ?Call SNPs >> 2. ?Determine strand and reference allele for each SNP on the array >> 3. ?Assign the correct alleles to each SNP for each sample > > > for 2 and 3 pd.genomewidesnp.6 has the metadata > >> con? = pd.genomewidesnp.6 at getdb() >> dbListTables(con) > ?[1] "featureSet"??????? "featureSetCNV"???? "fragmentLength" > ?[4] "fragmentLengthCNV" "pmfeature"???????? "pmfeatureCNV" > ?[7] "sequence"????????? "sequenceCNV"?????? "sqlite_stat1" > [10] "table_info" > >> ss = dbGetQuery(con, "select * from featureSet limit 5") >> ss > ? fsetid??? man_fsetid affy_snp_id dbsnp_rs_id chrom physical_pos strand > 1????? 1 SNP_A-2131660????????? NA?? rs2887286???? 1????? 1156131????? 0 > 2????? 2 SNP_A-1967418????????? NA?? rs1496555???? 1????? 2234251????? 0 > 3????? 3 SNP_A-1969580????????? NA? rs41477744???? 1????? 2329564????? 0 > 4????? 4 SNP_A-4263484????????? NA?? rs3890745???? 1????? 2553624????? 0 > 5????? 5 SNP_A-1978185????????? NA? rs10492936???? 1????? 2936870????? 1 > ? cytoband allele_a allele_b > 1?? p36.33??????? C??????? T > 2?? p36.33??????? A??????? G > 3?? p36.32??????? A??????? G > 4?? p36.32??????? C??????? T > 5?? p36.32??????? C??????? T Told you I was rusty. Thanks, Vince. Sean >> >> 4. ?Write out the VCF file with the correct genotypes (on the positive >> strand, reference allele correctly specified) >> >> What is the best way to do steps 1-3? ?I'll deal with step 4 since I >> don't think that has been implemented directly. >> >> Thanks, >> Sean >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 12.8 years ago Sean Davis 21k

0

Entering edit mode

Allow me to suggest to use, at least for now, the crlmm package to call the genotypes on SNP 6.0 (also for SNP 5.0, in case you also have data on that platform). Its implementation has significant improvements over our initial crlmm implementation (present in oligo). benilton On 13 February 2012 19:36, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > On Mon, Feb 13, 2012 at 2:28 PM, Vincent Carey > <stvjc at="" channing.harvard.edu=""> wrote: >> >> >> On Mon, Feb 13, 2012 at 2:13 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >>> >>> Hi, all. >>> >>> I'm a little rusty on my oligo array software tools. ?I'm interested >>> in taking Affymetrix SNP6 data to VCF format. ?To do that, I am going >>> to need to: >>> >>> 1. ?Call SNPs >>> 2. ?Determine strand and reference allele for each SNP on the array >>> 3. ?Assign the correct alleles to each SNP for each sample >> >> >> for 2 and 3 pd.genomewidesnp.6 has the metadata >> >>> con? = pd.genomewidesnp.6 at getdb() >>> dbListTables(con) >> ?[1] "featureSet"??????? "featureSetCNV"???? "fragmentLength" >> ?[4] "fragmentLengthCNV" "pmfeature"???????? "pmfeatureCNV" >> ?[7] "sequence"????????? "sequenceCNV"?????? "sqlite_stat1" >> [10] "table_info" >> >>> ss = dbGetQuery(con, "select * from featureSet limit 5") >>> ss >> ? fsetid??? man_fsetid affy_snp_id dbsnp_rs_id chrom physical_pos strand >> 1????? 1 SNP_A-2131660????????? NA?? rs2887286???? 1????? 1156131????? 0 >> 2????? 2 SNP_A-1967418????????? NA?? rs1496555???? 1????? 2234251????? 0 >> 3????? 3 SNP_A-1969580????????? NA? rs41477744???? 1????? 2329564????? 0 >> 4????? 4 SNP_A-4263484????????? NA?? rs3890745???? 1????? 2553624????? 0 >> 5????? 5 SNP_A-1978185????????? NA? rs10492936???? 1????? 2936870????? 1 >> ? cytoband allele_a allele_b >> 1?? p36.33??????? C??????? T >> 2?? p36.33??????? A??????? G >> 3?? p36.32??????? A??????? G >> 4?? p36.32??????? C??????? T >> 5?? p36.32??????? C??????? T > > Told you I was rusty. ?Thanks, Vince. > > Sean > > >>> >>> 4. ?Write out the VCF file with the correct genotypes (on the positive >>> strand, reference allele correctly specified) >>> >>> What is the best way to do steps 1-3? ?I'll deal with step 4 since I >>> don't think that has been implemented directly. >>> >>> Thanks, >>> Sean >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.8 years ago Benilton Carvalho ★ 4.3k

0

Entering edit mode

Hi, I collected dozens of breast cancer GEO datasets (same platform, Affy U133Plus2) and wonder if there is a way to normalize these datasets so I can compare the gene expression levels across all the datasets even though they are from different labs? I think about doing a RMA to all the datasets together first then followed by SVA to correct for batch effect, or doing RMAs dataset by dataset then follwed by mean-scaling. Does any of these make sense? Or what is the best approach? Any suggestion? Thanks a lot for the help! Ying [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago ying chen ▴ 340

0

Entering edit mode

Ying, You might consider fRMA: McCall MN, Bolstad BM, and Irizarry RA* (2010). Frozen Robust Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253. http://bioconductor.org/packages/release/bioc/html/frma.html This preprocessing algorithm was designed to handle such multi-batch analyses. Best, Matt On Tue, Feb 14, 2012 at 4:49 PM, ying chen <ying_chen at="" live.com=""> wrote: > > > Hi, I collected dozens of breast cancer GEO datasets (same platform, Affy U133Plus2) and wonder if there is a way to normalize these datasets so I can compare the gene expression levels across all the datasets even though they are from different labs? I think about doing a RMA to all the datasets together first then followed by SVA to correct for batch effect, or doing RMAs dataset by dataset then follwed by mean-scaling. Does any of these make sense? Or what is the best approach? Any suggestion? Thanks a lot for the help! Ying > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Matthew N McCall, PhD 112 Arvine Heights Rochester, NY 14611 Cell: 202-222-5880

ADD REPLY • link 12.8 years ago Matthew McCall ▴ 830

0

Entering edit mode

Dear Matt, I just went to link you provided and took a look at fRMA. It is a very useful package. May I ask two questions regarding fRMA? 1. Does fRMA have the precomputed vectors for Affy's Human Exon arrays to estimate both gene-level and exon-level data? 2. Can the Random Eff ct Model in fRMA handle two random batch effects which are known in my exon array data? Thanks, Shirley On Tue, Feb 14, 2012 at 5:17 PM, Matthew McCall <mccallm@gmail.com> wrote: > Ying, > > You might consider fRMA: > McCall MN, Bolstad BM, and Irizarry RA* (2010). Frozen Robust > Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253. > http://bioconductor.org/packages/release/bioc/html/frma.html > > This preprocessing algorithm was designed to handle such multi-batch > analyses. > > Best, > Matt > > On Tue, Feb 14, 2012 at 4:49 PM, ying chen <ying_chen@live.com> wrote: > > > > > > Hi, I collected dozens of breast cancer GEO datasets (same platform, > Affy U133Plus2) and wonder if there is a way to normalize these datasets so > I can compare the gene expression levels across all the datasets even > though they are from different labs? I think about doing a RMA to all the > datasets together first then followed by SVA to correct for batch effect, > or doing RMAs dataset by dataset then follwed by mean-scaling. Does any of > these make sense? Or what is the best approach? Any suggestion? Thanks a > lot for the help! Ying > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Matthew N McCall, PhD > 112 Arvine Heights > Rochester, NY 14611 > Cell: 202-222-5880 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago shirley zhang ★ 1.0k

0

Entering edit mode

Shirley, Thanks for kind remark. As for your questions: 1. I have computed preliminary frozen parameters for the Affy HuEx arrays. At the time they were made (about a year ago), there wasn't quite enough public data, which is why I call them "preliminary." 2. I'm assuming you mean the 2 batches are known. If so, then running frma in random_effect mode on each batch separately will produce the desired result. Best, Matt On Feb 14, 2012 9:23 PM, "shirley zhang" <shirley0818@gmail.com> wrote: > Dear Matt, > > I just went to link you provided and took a look at fRMA. It is a very > useful package. May I ask two questions regarding fRMA? > > 1. Does fRMA have the precomputed vectors for Affy's Human Exon arrays to > estimate both gene-level and exon-level data? > 2. Can the Random Eff ct Model in fRMA handle two random batch effects > which are known in my exon array data? > > Thanks, > Shirley > > On Tue, Feb 14, 2012 at 5:17 PM, Matthew McCall <mccallm@gmail.com> wrote: > >> Ying, >> >> You might consider fRMA: >> McCall MN, Bolstad BM, and Irizarry RA* (2010). Frozen Robust >> Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253. >> http://bioconductor.org/packages/release/bioc/html/frma.html >> >> This preprocessing algorithm was designed to handle such multi- batch >> analyses. >> >> Best, >> Matt >> >> On Tue, Feb 14, 2012 at 4:49 PM, ying chen <ying_chen@live.com> wrote: >> > >> > >> > Hi, I collected dozens of breast cancer GEO datasets (same platform, >> Affy U133Plus2) and wonder if there is a way to normalize these datasets so >> I can compare the gene expression levels across all the datasets even >> though they are from different labs? I think about doing a RMA to all the >> datasets together first then followed by SVA to correct for batch effect, >> or doing RMAs dataset by dataset then follwed by mean-scaling. Does any of >> these make sense? Or what is the best approach? Any suggestion? Thanks a >> lot for the help! Ying >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Matthew N McCall, PhD >> 112 Arvine Heights >> Rochester, NY 14611 >> Cell: 202-222-5880 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Matthew McCall ▴ 830

0

Entering edit mode

Hi Matt, Thanks a lot for the suggestion. I read the papers and think frma is perfect for my task. But I still have a few questions: 1) For the multiple arrays, the only summarize method is random_effect, right? 2) In your frmaTools paper (BMC Bioinformatics) you mentioned that the latest version of the frma package has the option to use the version 13 Entrez Gene probe annotation (section 3.2 Alternative CDF). But I could not find any method to apply this option in manual frma.pdf (Feb 14, 2012) downloaded from Bioconductor frma page. Is this option still available? 3) Is the data file installed automatically when I install firma package or I need to install it by myself like biocLite("hgu133plus2frmavecs")? 4) When you built the reference distribution for U133Plus2, did you pay attention to the experiment protocol used for each sample, such as the starting RNA type (total RNA or mRNA), the amount of total RNA used (~5ug or 10-100ng)? Does it make sense to run SVA after frma to correct for the possible batch effects due to different protocols used? Thanks, Ying > Date: Tue, 14 Feb 2012 17:17:48 -0500 > Subject: Re: [BioC] Best way to normalize GEO gene expression datasets from different labs/sources? > From: mccallm@gmail.com > To: ying_chen@live.com > CC: bioconductor@r-project.org > > Ying, > > You might consider fRMA: > McCall MN, Bolstad BM, and Irizarry RA* (2010). Frozen Robust > Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253. > http://bioconductor.org/packages/release/bioc/html/frma.html > > This preprocessing algorithm was designed to handle such multi-batch analyses. > > Best, > Matt > > On Tue, Feb 14, 2012 at 4:49 PM, ying chen <ying_chen@live.com> wrote: > > > > > > Hi, I collected dozens of breast cancer GEO datasets (same platform, Affy U133Plus2) and wonder if there is a way to normalize these datasets so I can compare the gene expression levels across all the datasets even though they are from different labs? I think about doing a RMA to all the datasets together first then followed by SVA to correct for batch effect, or doing RMAs dataset by dataset then follwed by mean-scaling. Does any of these make sense? Or what is the best approach? Any suggestion? Thanks a lot for the help! Ying > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Matthew N McCall, PhD > 112 Arvine Heights > Rochester, NY 14611 > Cell: 202-222-5880 [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago ying chen ▴ 340

0

Entering edit mode

Ying, 1. For multiple arrays, you have 2 options both use RMA background correction and quantile normalize to a fixed reference distribution. The difference is in the summarization. The default summarization will treat each array individually -- subtracting the frozen "global" probe-effect and down-weighting probes that show high between- or within-batch residual variance. Alternatively, if you know the batches present in your data, you can preprocess each batch separately using the random_effect summarization. This will allow a batch-specific change in the global probe-effect (the random effect in the model) for each batch in your data set. Often the two methods will give you very similar results. 2. Yes, this option is still available. The frma function uses the frozen parameter vectors that correspond to the cdfname of your AffyBatch object. So if you read in the CEL file data with an alternative CDF, frma will attempt to load the corresponding frmavecs data package. 3. You need to install the data package you would like to use via biocLite. 4. I don't believe so. I definitely think that it is worthwhile to examine the preprocessed data for batch effects. fRMA is designed to address a very specific type of batch-effect -- changes in probe behavior between batches. There are certainly other ways in which batch-effects manifest themselves that methods such as SVA are designed to address. Hope this helps. Best, Matt On Thu, Feb 16, 2012 at 2:42 PM, ying chen <ying_chen at="" live.com=""> wrote: > Hi Matt, > > Thanks a lot for the suggestion. > > I read the papers and think?frma is?perfect?for my task. But?I still have a > few questions: > > 1) For the multiple arrays, the only summarize method is random_effect, > right? > > 2) In your frmaTools paper (BMC Bioinformatics) you mentioned that the > latest version of the frma package has the option to use the version 13 > Entrez Gene probe annotation (section 3.2 Alternative CDF). But I could not > find any method to apply this option in manual frma.pdf (Feb 14, > 2012)?downloaded from Bioconductor frma page. Is this option still > available? > > 3) Is the data file installed automatically when I?install firma package or > I need to install it by myself like biocLite("hgu133plus2frmavecs")? > > 4) When you built the reference distribution?for U133Plus2, did you pay > attention to the?experiment protocol used for each sample, such as the > starting RNA type (total RNA or mRNA), the amount of total RNA used (~5ug or > 10-100ng)? Does it make sense to run SVA after frma to correct for the > possible batch effects due to different protocols used? > > Thanks, > > Ying > >> Date: Tue, 14 Feb 2012 17:17:48 -0500 >> Subject: Re: [BioC] Best way to normalize GEO gene expression datasets >> from different labs/sources? >> From: mccallm at gmail.com >> To: ying_chen at live.com >> CC: bioconductor at r-project.org > >> >> Ying, >> >> You might consider fRMA: >> McCall MN, Bolstad BM, and Irizarry RA* (2010). Frozen Robust >> Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253. >> http://bioconductor.org/packages/release/bioc/html/frma.html >> >> This preprocessing algorithm was designed to handle such multi- batch >> analyses. >> >> Best, >> Matt >> >> On Tue, Feb 14, 2012 at 4:49 PM, ying chen <ying_chen at="" live.com=""> wrote: >> > >> > >> > Hi, I collected dozens of breast cancer GEO datasets (same platform, >> > Affy U133Plus2) and wonder if there is a way to normalize these datasets so >> > I can compare the gene expression levels across all the datasets even though >> > they are from different labs? I think about doing a RMA to all the datasets >> > together first then followed by SVA to correct for batch effect, or doing >> > RMAs dataset by dataset then follwed by mean-scaling. Does any of these make >> > sense? Or what is the best approach? Any suggestion? Thanks a lot for the >> > help! Ying >> > ? ? ? ?[[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Matthew N McCall, PhD >> 112 Arvine Heights >> Rochester, NY 14611 >> Cell: 202-222-5880 -- Matthew N McCall, PhD 112 Arvine Heights Rochester, NY 14611 Cell: 202-222-5880

ADD REPLY • link 12.8 years ago Matthew McCall ▴ 830

0

Entering edit mode

Hi Matt, Thanks a lot for the quick reply. I am still a little bit confused about how to apply the alternative CDF. > 2. Yes, this option is still available. The frma function uses the > frozen parameter vectors that correspond to the cdfname of your > AffyBatch object. So if you read in the CEL file data with an > alternative CDF, frma will attempt to load the corresponding frmavecs > data package. > 1). You mentioned above that "read in the CEL file data with an alternative CDF", but I do not see it as an option for frma(). Do you mean I need to put the alternative CDF file in the same directory as my cel files? 2). What's the name of the alternative CDF you have available? Do I need to convert the alternative CDF from text format to binary format? 3). Is there a separate data file for alternative CDF or it's in the same data file hgu133plus2frmavecs? 4). Is there an alternative CDF option for U133A chip? Thanks! Ying PS: I am sorry for the weird format of my post. I really do not know why it comes out like this. > On Thu, Feb 16, 2012 at 2:42 PM, ying chen <ying_chen@live.com> wrote: > > Hi Matt, > > > > Thanks a lot for the suggestion. > > > > I read the papers and think frma is perfect for my task. But I still have a > > few questions: > > > > 1) For the multiple arrays, the only summarize method is random_effect, > > right? > > > > 2) In your frmaTools paper (BMC Bioinformatics) you mentioned that the > > latest version of the frma package has the option to use the version 13 > > Entrez Gene probe annotation (section 3.2 Alternative CDF). But I could not > > find any method to apply this option in manual frma.pdf (Feb 14, > > 2012) downloaded from Bioconductor frma page. Is this option still > > available? > > > > 3) Is the data file installed automatically when I install firma package or > > I need to install it by myself like biocLite("hgu133plus2frmavecs")? > > > > 4) When you built the reference distribution for U133Plus2, did you pay > > attention to the experiment protocol used for each sample, such as the > > starting RNA type (total RNA or mRNA), the amount of total RNA used (~5ug or > > 10-100ng)? Does it make sense to run SVA after frma to correct for the > > possible batch effects due to different protocols used? > > > > Thanks, > > > > Ying > > > >> Date: Tue, 14 Feb 2012 17:17:48 -0500 > >> Subject: Re: [BioC] Best way to normalize GEO gene expression datasets > >> from different labs/sources? > >> From: mccallm@gmail.com > >> To: ying_chen@live.com > >> CC: bioconductor@r-project.org > > > >> > >> Ying, > >> > >> You might consider fRMA: > >> McCall MN, Bolstad BM, and Irizarry RA* (2010). Frozen Robust > >> Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253. > >> http://bioconductor.org/packages/release/bioc/html/frma.html > >> > >> This preprocessing algorithm was designed to handle such multi- batch > >> analyses. > >> > >> Best, > >> Matt > >> > >> On Tue, Feb 14, 2012 at 4:49 PM, ying chen <ying_chen@live.com> wrote: > >> > > >> > > >> > Hi, I collected dozens of breast cancer GEO datasets (same platform, > >> > Affy U133Plus2) and wonder if there is a way to normalize these datasets so > >> > I can compare the gene expression levels across all the datasets even though > >> > they are from different labs? I think about doing a RMA to all the datasets > >> > together first then followed by SVA to correct for batch effect, or doing > >> > RMAs dataset by dataset then follwed by mean-scaling. Does any of these make > >> > sense? Or what is the best approach? Any suggestion? Thanks a lot for the > >> > help! Ying > >> > [[alternative HTML version deleted]] > >> > > >> > _______________________________________________ > >> > Bioconductor mailing list > >> > Bioconductor@r-project.org > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > Search the archives: > >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> > >> -- > >> Matthew N McCall, PhD > >> 112 Arvine Heights > >> Rochester, NY 14611 > >> Cell: 202-222-5880 > > > > -- > Matthew N McCall, PhD > 112 Arvine Heights > Rochester, NY 14611 > Cell: 202-222-5880 [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago ying chen ▴ 340

0

Entering edit mode

giulio.genovese • 0

@giuliogenovese-17235

Last seen 6.2 years ago

I wrote affy2vcf, a bcftools plugin that is capable of generating VCF files out of the output of the Affymetrix apt-probeset-genotype command line tool.

ADD COMMENT • link 6.2 years ago giulio.genovese • 0

Login before adding your answer.