Hi, I have some paired-end RNA samples with similar number of sequenced reads. However, one of them had low number of assigned reads and high number of Unassigned_NoFeatures reads.
Assigned 8,752,014 Unassigned_Ambiguity 4,523,883 Unassigned_MultiMapping 0 Unassigned_NoFeatures 49,769,215 Unassigned_Unmapped 6,918,731 Unassigned_MappingQuality 0 Unassigned_FragmentLength 0 Unassigned_Chimera 0 Unassigned_Secondary 19,635,817 Unassigned_Nonjunction 0 Unassigned_Duplicate 0
This category means “The fragment mapped to a region that is not annotated in the annotation file”. To confirm it, I selected 100K reads marked “Unassigned_NoFeatures” and get their mapped coordinates from BAM file using "bedtools bamtobed" 2.24.0. They were then overlapped with all exons annotated in the GTF file used to run featureCounts using "bedtools intersect" 2.25.0 (I somehow used different versions, which doesn't look affecting the analysis though). These overlapped but not assigned reads should be due to their strands (could there be any other reasons?).
The strands of overlapping reads and exons were then checked. I found when mapping quality >3, all strands are forward strands, which doesn't meet with the setting (reverse strand) used to run featureCounts. So these reads should be unsigned. However, there were some overlaps meeting the setting. Interestingly, all of them had mapping quality 3 (mapping was done by STAR). I wonder how these reads were also not assigned. Two of them are listed below (output of bedtools intersect).
chr1 630896 630987 MG00HS09:702:C78TLACXX:7:1101:10090:91329/1 3 - chr1 630895 630958 ENSG00000276171.1 . + ENSEMBL exon . gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1 630865 630966 MG00HS09:702:C78TLACXX:7:1101:10090:91329/2 3 + chr1 630895 630958 ENSG00000276171.1 . + ENSEMBL exon . gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA";
Note that the above is a read pair. First 6 columns are from BAM file (converted to BED format by "bedtools bamtobed"), latter columns are from GTF file. Read 1 was reversely ("-" in column 6) mapped to a forward gene and read 2 forwardly ("+") mapped to the same gene. This looks to be a perfect reverse strand. So why they were not assigned? Or is there anything wrong?
Such "strange" unassigned reads are the minority (110) anyway. Most unassigned reads (6170) did have forward strands. Except for the above "1-+" and "2++", other strange unassigned reads had "1+-" and "2--". Correctly unassigned reads were 1++, 1--, 2+- and 2-+. The command I used for featureCounts was:
isPairedEnd=TRUE, strandSpecific=2, countPrimaryAlignmentsOnly=TRUE, annot.ext='xxx.gtf', isGTFAnnotationFile=TRUE
BTW, this sample has highest ambiguity of strand: reverse strand ~56% (forward strand ~33%), while for all other samples reverse strands are higher than 70%, as inferred by program “infer_experiment.py” in package RSeQC, which may contribute to its high unassigning rate.
I extracted the two lines in the BAM file and list below. Note they are shown in "inline" format (you can scroll horizontally to see the rest of the whole lines). Same format was also used for the "bedtools intersect" output above (so there are 10 GTF fields after the beginning 6 converted BAM fields).
MG00HS09:702:C78TLACXX:7:1101:10090:91329 419 chr1 630866 3 101M = 630897 122 CAGCTAAGCACCCTAATCAACTGGCTTCAATCTACTTCTCCCGCCGCCGGGAAAAAAGGCGGGAGAAGCCCCGGCAGGTTTGAAGCTGCTTCTTCGAATTT @CCFFFFFHHGHHJIJJIJIIIGHIJIIIIJIJIJJJIIFIFIJJDHIJHE?BAEEB@DABDD@BBDDCDDDDDD5>B?CDBDDDDDDDDDCACC@<8>AD NH:i:2 HI:i:2 AS:i:176 NM:i:0 MD:Z:101
MG00HS09:702:C78TLACXX:7:1101:10090:91329 339 chr1 630897 3 10S91M = 630866 -122 GGTCTTCAACCTCCTCCCCCCCCCGCGGGAAAAAAAGGGGGGAGAAGCCCCGGCAGGTTTGAAGCTGCTTCTTCGAATTTGCAATTCAATATGAAAATCAC ##############################?AA;5'9,'E?)77JJIJJIIJIDJJJIIJJJJIJHEJJIJJJIIJIIIJJJJJJJIJHHHHFFDEDFBBB NH:i:2 HI:i:2 AS:i:176 NM:i:7 MD:Z:2A2T1T3G4C2G8C62