Hello,
I am working on quantifying RNASeq data at the exon level, which is intended for use in the DEXSeq package. I followed the documentation provided in the DEXSeq tutorial (http://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.pdf, page 23), as well as the commands on Subread to DEXSeq helper code (https://github.com/vivekbhr/Subread_to_DEXSeq).
First to produce the GTF File, I used the following code:
GTF="/Reference_Data/GRCh37/gtf/Homo_sapiens.GRCh37.87.gtf" python2 dexseq_prepare_annotation2.py -f Homo_sapiens.GRCh37.87_collapsed.gtf $GTF \ Homo_sapiens.GRCh37.87_collapsed.gff
Then I ran subread on the command line as submitted jobs on our shared HPC. We use the SLURM job scheduler.
#Program:featureCounts v1.5.3; Command: featureCounts -p -B -s 2 -T 16 -Q 10 -f -O -F GTF -a Homo_sapiens.GRCh37.87_collapsed.gtf -t exon -g gene_id \ -o sample_withJunctionsOnGenome_dupsFlagged_stranded.exon.cts \ sample_withJunctionsOnGenome_dupsFlagged.bam
I have two questions:
1) Why would I have exons for a gene that are only a few base-pairs long? I have some entries, in the example output below, where the length is calculated as only 1 base-pair. This does not seem representative of the true expression of a real gene, which would not have exons of such short length. Could it be there was some error in the commands run above or in the way the GTF was produced?
2) The number of assigned features for this quantification was particularly low. I have two sets of RNASeq data; 1) created with Poly-A enrichment in the library prep and 2) produced with the ribodepletion (RBS) library protocol. Both sets had paired-end, strand-specific, 75bp reads, and were processed through identical bioinformatic pipeline from raw FASTQ to BAM. I did check that the orientation of the paired reads were the same for both library preparation protocols. It was these BAMs that I input to Subread featurecounts() function.
For my poly-A library, I was getting on average 60% of my total reads assigned to features using the same commands as above, and same GTF reference.
For my RBS library, I am getting on average 30-40% of my total reads assigned to features, again using the same commands as above and same GTF reference. The vast majority of reads, from the summary files produced by Subread and then summarized into one figure by MultiQC, for all samples were classified as "Unassigned_NoFeatures".
I was thinking that this might make sense due to many additional non-protein coding RNAs that the Ribodepletion protocol will sequence. Has anyone seen this type of difference before? Again, should I be optimizing my commands for ribodeplete RNASeq that I am not aware of?
Thank you so much,
Jenny
Example of my output for one sample is below:
Geneid | Chr | Start | End | Strand | Length | Sample_withJunctionsOnGenome_dupsFlagged.bam |
ENSG00000223972 | 1 | 11869 | 11871 | + | 3 | 0 |
ENSG00000223972 | 1 | 11872 | 11873 | + | 2 | 0 |
ENSG00000223972 | 1 | 11874 | 12009 | + | 136 | 1 |
ENSG00000223972 | 1 | 12010 | 12057 | + | 48 | 2 |
ENSG00000223972 | 1 | 12058 | 12178 | + | 121 | 3 |
ENSG00000223972 | 1 | 12179 | 12227 | + | 49 | 9 |
ENSG00000223972 | 1 | 12595 | 12612 | + | 18 | 1 |
ENSG00000223972 | 1 | 12613 | 12697 | + | 85 | 20 |
ENSG00000223972 | 1 | 12698 | 12721 | + | 24 | 10 |
ENSG00000223972 | 1 | 12975 | 13052 | + | 78 | 1 |
ENSG00000223972 | 1 | 13221 | 13224 | + | 4 | 3 |
ENSG00000223972 | 1 | 13225 | 13374 | + | 150 | 3 |
ENSG00000223972 | 1 | 13375 | 13402 | + | 28 | 0 |
ENSG00000223972 | 1 | 13403 | 13452 | + | 50 | 10 |
ENSG00000223972 | 1 | 13453 | 13655 | + | 203 | 7 |
ENSG00000223972 | 1 | 13656 | 13660 | + | 5 | 1 |
ENSG00000223972 | 1 | 13661 | 13670 | + | 10 | 1 |
ENSG00000223972 | 1 | 13671 | 14409 | + | 739 | 1 |
ENSG00000223972 | 1 | 14410 | 14412 | + | 3 | 0 |
ENSG00000227232 | 1 | 14363 | 14403 | - | 41 | 0 |
ENSG00000227232 | 1 | 14404 | 14410 | - | 7 | 0 |
ENSG00000227232 | 1 | 14411 | 14501 | - | 91 | 9 |
ENSG00000227232 | 1 | 14502 | 14502 | - | 1 | 8 |
ENSG00000227232 | 1 | 14503 | 14829 | - | 327 | 51 |