Hello All,
I am preparing to run normalization and differential expression on my RNA sequencing raw counts data. However, I wanted to perform a gene filtering step before running limma for batch correction (16 Tissue sites were used for data collection) and DESeq2 (DE Analysis and Normalization). I started with ~60,000 genes and I have already filtered for removal of genes was greater than or equal to 50% 0's expression in the data, which lowered my gene amount to 35,458 genes. This is still a bit high. Again, downstream, I will be performing DESeq2 and WGCNA, and I wanted to ensure that I had genes that were robust. I am not confident in the best approach to apply more filtering. I have visualized the data with a histogram and see a bimodal distribution, along with PCA plot as well. Can you offer any suggestions for filtering? Is this mainly technical or biological? From a technical side, I was told that bimodal distribution in RNA sequencing data is typical, and the left bump corresponds to noise. However, biologically, when I pulled some genes out, I did see that the expression across samples in boxplots for the genes, made sense based on my subtype grouping.
Are there any suggestions for filtering? I have seen people use this before DEseq2.
pre-filtering: removing rows with low gene counts
Calculate total read counts per gene
total_counts <- rowSums(counts(dds))
Filter genes with at least 10 total read counts
dds_filtered <- dds[total_counts >= 10, ]
However, this seems arbitrary and not data specific. I am not sure how to search for a value in literature. I am using high grade serous ovarian cancer data. After I filter, I plan to batch correct with limma before DESeq2.
Any suggestions will be great!
Thank you for pointing me in the right direction. I was overthinking it.
Kaylin