Gene filtering for RNA-seq data

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.6 years ago

I am writing to inquire about independent filtering for my large RNA- seq dataset. I have around 55,000 genes (raw gene counts) RNA sequencing data from 91 libraries/samples, consisting of 3 biological replicates for 4 different genotypes on germinating seeds. I am currently working on differential expression and subsequently transcriptomic network analysis for these samples. Before performing any of these analyses, I'd like to perform an independent filtering for my data to increase detection power for differentially expressed genes. I will be using your DESeq2 package (version 1.2.5) for my filtering and differential expression analysis. Based on recommendation by a statistician, I have decided to perform the following steps: 1) Fit a negative binomial GLM with genotype & time effects across all samples for all genes that have nonzero counts in at least one sample 2) Filter weakly expressed genes (for example using a filter like the one implemented in HTSFilter) 3) Adjust p-values for genes passing the filter to correct for multiple testing While the DESeq2 package was nicely written, since I am not a statistician, I am still a little bit unclear on a few things. Hence, I would like to clarify a few things with you, mainly the workflow for my analysis. Based on my understanding from what's written in DESeq2 package, I should be doing the following (in chronological order): 1. First, perform a differential expression (dds function) on my raw gene counts for library size normalization. This step will fit my data to a negative binomial generalized linear model with genotypes & time effects across all samples for all genes that have nonzero counts in at least one sample. 2. Second, use the result I obtain from step 1 to go through independent filtering step using filter_p function from genefilter package. 3. Third, use the result from step 2 to filter weakly expressed genes further more using HTSFilter package. 4. Finally, adjust p-values for genes passing the filter to correct for multiple testing. I am not entirely sure how to do this. Can I perform this step using DESeq2 package? Furthermore, does DESeq2 take care of PCR duplicate artifacts? -- output of sessionInfo(): none -- Sent via the guest posting facility at bioconductor.org.

Normalization GO Network HTSFilter DESeq2 Normalization GO Network HTSFilter DESeq2 • 4.4k views

ADD COMMENT • link updated 11.4 years ago by Michael Love 43k • written 11.4 years ago by Guest User ★ 13k

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 hours ago

United States

hi Yoong, On Sat, Nov 16, 2013 at 7:42 PM, FeiYian Yoong [guest] < guest@bioconductor.org> wrote: > > I am writing to inquire about independent filtering for my large RNA-seq > dataset. I have around 55,000 genes (raw gene counts) RNA sequencing data > from 91 libraries/samples, consisting of 3 biological replicates for 4 > different genotypes on germinating seeds. I am currently working on > differential expression and subsequently transcriptomic network analysis > for these samples. Before performing any of these analyses, I'd like to > perform an independent filtering for my data to increase detection power > for differentially expressed genes. âIndependent filtering does not have to occur before differential expression analysis. As implemented in DESeq2 version >= 1.2.0, independent filtering is automatically performed by the results() function, after the DESeq() function has been called. Please see the vignette for the details on the default workflow, which includes independent filtering. > I will be using your DESeq2 package (version 1.2.5) for my filtering and > differential expression analysis. > > Based on recommendation by a statistician, I have decided to perform the > following steps: > > 1) Fit a negative binomial GLM with genotype & time effects across all > samples for all genes that have nonzero counts in at least one sample > â\â > > 2) Filter weakly expressed genes (for example using a filter like the one > implemented in HTSFilter) > âFiltering genes by mean normalized count is automatically done by the results() function.â > 3) Adjust p-values for genes passing the filter to correct for multiple > testing > > âThis is also automatically done by the results() function.â > > While the DESeq2 package was nicely written, since I am not a > statistician, I am still a little bit unclear on a few things. Hence, I > would like to clarify a few things with you, mainly the workflow for my > analysis. Based on my understanding from what's written in DESeq2 package, > I should be doing the following (in chronological order): > > > 1. First, perform a differential expression (dds function) on my raw gene > counts for library size normalization. This step will fit my data to a > negative binomial generalized linear model with genotypes & time effects > across all samples for all genes that have nonzero counts in at least one > sample. > > âThe DESeq() function estimates size factors for library size normalization, estimates the dispersion, then fits a negative binomial generalized linear model. â â > 2. Second, use the result I obtain from step 1 to go through independent > filtering step using filter_p function from genefilter package. > > âThe results() function calls the filter_p function from the genefilter packages. This maximize the number of adjusted p-values which will be less than a given value alpha (defaults to 0.1), by excluding genes with low mean normalized count over all samples. > 3. Third, use the result from step 2 to filter weakly expressed genes > further more using HTSFilter package. > âThis is the same as step 2.â > 4. Finally, adjust p-values for genes passing the filter to correct for > multiple testing. I am not entirely sure how to do this. Can I perform this > step using DESeq2 package? > > âThis is also covered by step 2.â > > Furthermore, does DESeq2 take care of PCR duplicate artifacts? > > âNo, DESeq2 does not take care of PCR duplicate artifacts, because it begins with a summarized count table.â If you feel that some of the samples might have a problem with many duplicate reads stacking up due to PCR, you might consider filtering these upstream of DESeq2. I don't know off the top of my head which are the best tools for this task. Mike [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago Michael Love 43k

0

Entering edit mode

re: PCR dupes: picard's MarkDuplicates is fairly well accepted for this purpose. *He that would live in peace and at ease, * *Must not speak all he knows, nor judge all he sees.* Benjamin Franklin, Poor Richard's Almanack<http: archive.org="" details="" poorrichardsalma00franrich=""> On Sat, Nov 16, 2013 at 9:03 PM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Yoong, > > > On Sat, Nov 16, 2013 at 7:42 PM, FeiYian Yoong [guest] < > guest@bioconductor.org> wrote: > > > > > I am writing to inquire about independent filtering for my large RNA-seq > > dataset. I have around 55,000 genes (raw gene counts) RNA sequencing data > > from 91 libraries/samples, consisting of 3 biological replicates for 4 > > different genotypes on germinating seeds. I am currently working on > > differential expression and subsequently transcriptomic network analysis > > for these samples. Before performing any of these analyses, I'd like to > > perform an independent filtering for my data to increase detection power > > for differentially expressed genes. > > > Independent filtering does not have to occur before differential > expression analysis. As implemented in DESeq2 version >= 1.2.0, independent > filtering is automatically performed by the results() function, after the > DESeq() function has been called. Please see the vignette for the details > on the default workflow, which includes independent filtering. > > > > > I will be using your DESeq2 package (version 1.2.5) for my filtering and > > differential expression analysis. > > > > Based on recommendation by a statistician, I have decided to perform the > > following steps: > > > > 1) Fit a negative binomial GLM with genotype & time effects across all > > samples for all genes that have nonzero counts in at least one sample > > \ > > > > 2) Filter weakly expressed genes (for example using a filter like the one > > implemented in HTSFilter) > > > > Filtering genes by mean normalized count is automatically done by the > results() function. > > > > > 3) Adjust p-values for genes passing the filter to correct for multiple > > testing > > > > > This is also automatically done by the results() function. > > > > > > > While the DESeq2 package was nicely written, since I am not a > > statistician, I am still a little bit unclear on a few things. Hence, I > > would like to clarify a few things with you, mainly the workflow for my > > analysis. Based on my understanding from what's written in DESeq2 > package, > > I should be doing the following (in chronological order): > > > > > > 1. First, perform a differential expression (dds function) on my raw gene > > counts for library size normalization. This step will fit my data to a > > negative binomial generalized linear model with genotypes & time effects > > across all samples for all genes that have nonzero counts in at least one > > sample. > > > > > The DESeq() function estimates size factors for library size > normalization, estimates the dispersion, then fits a negative binomial > generalized linear model. > > > > > 2. Second, use the result I obtain from step 1 to go through independent > > filtering step using filter_p function from genefilter package. > > > > > The results() function calls the filter_p function from the genefilter > packages. This maximize the number of adjusted p-values which will be less > than a given value alpha (defaults to 0.1), by excluding genes with low > mean normalized count over all samples. > > > > > 3. Third, use the result from step 2 to filter weakly expressed genes > > further more using HTSFilter package. > > > > This is the same as step 2. > > > > > 4. Finally, adjust p-values for genes passing the filter to correct for > > multiple testing. I am not entirely sure how to do this. Can I perform > this > > step using DESeq2 package? > > > > > This is also covered by step 2. > > > > > > > Furthermore, does DESeq2 take care of PCR duplicate artifacts? > > > > > No, DESeq2 does not take care of PCR duplicate artifacts, because it > begins with a summarized count table. If you feel that some of the samples > might have a problem with many duplicate reads stacking up due to PCR, you > might consider filtering these upstream of DESeq2. I don't know off the top > of my head which are the best tools for this task. > > Mike > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Tim Triche ★ 4.2k

Login before adding your answer.