Hi,
I'm using featureCounts() to generate the count matrix of my RNAseq experiment and edgeR to do the DGE analysis.
Now, during the filtering step I am removing low-count genes like this:
fc <- featureCounts(..)
counttable <- fc$counts
y <- DGEList(counts=counttable)
keep <- rowSums(cpm(y) > 1) >= 4 # Keep genes only if they have more than x count per million (calculated to require at least 6 reads on average) AND which are found in at least x replicates per group (which is the minimal amount of samples in a condition)
y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions
Out of 53k+ entries, only around 14k remain after that. But in those 14k there are still genes I want to remove from the analysis, such as mtRNAs, noncoding RNAs etc..
I've tried using biomaRt or editing my annotation GTF file to narrow it down to protein_coding genes, but it didn't work so far.
Is there a way to remove genes that aren't protein_coding? A nifty way to subset my GTF file (Mus_musculus.GRCm38.91.gtf).
I had no luck in bash using grep "protein_coding" ...
EDIT: I have tried to limit my annotation .GTF file to only protein_coding lines using: $ awk '/protein_coding/ { print }' Mus_musculus.GRCm38.91.gtf > test.gtf
The test.gtf generated is ~60MB smaller and a grep mtRNA returns no hits, but a grep lincRNA returns some, but those lincRNAs have protein_coding in their biotype. I will try using this filtered GTF file in featureCounts.
featureCounts:
Load annotation file //Annotation_files/test.gtf || || Features : 719464 || || Meta-features : 22073 || || Chromosomes/contigs : 41
So the number of meta-features was reduced from ~50k to 22k (hopefully all protein_coding annotations).
It does depend of course on what gene annotation system you are using. It would help potential responders to explain up-front that you are using Ensembl rather than Entrez Gene Ids.