Question

Reads in Unassigned_NoFeatures category produced by featureCounts

3

Entering edit mode

Likai Mao ▴ 40

@likai-mao-9454

Last seen 8.7 years ago

QIMR, Brisbane, Australia

Hi, I have some paired-end RNA samples with similar number of sequenced reads. However, one of them had low number of assigned reads and high number of Unassigned_NoFeatures reads.

Assigned    8,752,014
Unassigned_Ambiguity    4,523,883
Unassigned_MultiMapping    0
Unassigned_NoFeatures    49,769,215
Unassigned_Unmapped    6,918,731
Unassigned_MappingQuality    0
Unassigned_FragmentLength    0
Unassigned_Chimera    0
Unassigned_Secondary    19,635,817
Unassigned_Nonjunction    0
Unassigned_Duplicate    0

This category means “The fragment mapped to a region that is not annotated in the annotation file”. To confirm it, I selected 100K reads marked “Unassigned_NoFeatures” and get their mapped coordinates from BAM file using "bedtools bamtobed" 2.24.0. They were then overlapped with all exons annotated in the GTF file used to run featureCounts using "bedtools intersect" 2.25.0 (I somehow used different versions, which doesn't look affecting the analysis though). These overlapped but not assigned reads should be due to their strands (could there be any other reasons?).

The strands of overlapping reads and exons were then checked. I found when mapping quality >3, all strands are forward strands, which doesn't meet with the setting (reverse strand) used to run featureCounts. So these reads should be unsigned. However, there were some overlaps meeting the setting. Interestingly, all of them had mapping quality 3 (mapping was done by STAR). I wonder how these reads were also not assigned. Two of them are listed below (output of bedtools intersect).

chr1 630896 630987 MG00HS09:702:C78TLACXX:7:1101:10090:91329/1 3 - chr1 630895 630958 ENSG00000276171.1 . + ENSEMBL exon . gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA"; chr1 630865 630966 MG00HS09:702:C78TLACXX:7:1101:10090:91329/2 3 + chr1 630895 630958 ENSG00000276171.1 . + ENSEMBL exon . gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA";

Note that the above is a read pair. First 6 columns are from BAM file (converted to BED format by "bedtools bamtobed"), latter columns are from GTF file. Read 1 was reversely ("-" in column 6) mapped to a forward gene and read 2 forwardly ("+") mapped to the same gene. This looks to be a perfect reverse strand. So why they were not assigned? Or is there anything wrong?

Such "strange" unassigned reads are the minority (110) anyway. Most unassigned reads (6170) did have forward strands. Except for the above "1-+" and "2++", other strange unassigned reads had "1+-" and "2--". Correctly unassigned reads were 1++, 1--, 2+- and 2-+. The command I used for featureCounts was:

isPairedEnd=TRUE, strandSpecific=2, countPrimaryAlignmentsOnly=TRUE, annot.ext='xxx.gtf', isGTFAnnotationFile=TRUE

BTW, this sample has highest ambiguity of strand: reverse strand ~56% (forward strand ~33%), while for all other samples reverse strands are higher than 70%, as inferred by program “infer_experiment.py” in package RSeQC, which may contribute to its high unassigning rate.

rsubread • 7.9k views

ADD COMMENT • link updated 8.7 years ago by Wei Shi ★ 3.6k • written 8.7 years ago by Likai Mao ▴ 40

score 0 · Answer 1 · 2016-03-09

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne/Olivia Newton-John …

Can you provide CIGAR info for the two reads included in your post? Your second read mapped to a 102bp region but your first read mapped to a shorter region of 92bp. This makes me wonder if there is a gap in your second read that may result in this read being unassigned to chr1:630895. Your first read does not overlap this coordinate.

ADD COMMENT • link 8.7 years ago Wei Shi ★ 3.6k

0

Entering edit mode

I extracted the two lines in the BAM file and list below. Note they are shown in "inline" format (you can scroll horizontally to see the rest of the whole lines). Same format was also used for the "bedtools intersect" output above (so there are 10 GTF fields after the beginning 6 converted BAM fields).

MG00HS09:702:C78TLACXX:7:1101:10090:91329 419 chr1 630866 3 101M = 630897 122 CAGCTAAGCACCCTAATCAACTGGCTTCAATCTACTTCTCCCGCCGCCGGGAAAAAAGGCGGGAGAAGCCCCGGCAGGTTTGAAGCTGCTTCTTCGAATTT @CCFFFFFHHGHHJIJJIJIIIGHIJIIIIJIJIJJJIIFIFIJJDHIJHE?BAEEB@DABDD@BBDDCDDDDDD5>B?CDBDDDDDDDDDCACC@<8>AD NH:i:2 HI:i:2 AS:i:176 NM:i:0 MD:Z:101

MG00HS09:702:C78TLACXX:7:1101:10090:91329 339 chr1 630897 3 10S91M = 630866 -122 GGTCTTCAACCTCCTCCCCCCCCCGCGGGAAAAAAAGGGGGGAGAAGCCCCGGCAGGTTTGAAGCTGCTTCTTCGAATTTGCAATTCAATATGAAAATCAC ##############################?AA;5'9,'E?)77JJIJJIIJIDJJJIIJJJJIJHEJJIJJJIIJIIIJJJJJJJIJHHHHFFDEDFBBB NH:i:2 HI:i:2 AS:i:176 NM:i:7 MD:Z:2A2T1T3G4C2G8C62

ADD REPLY • link 8.7 years ago Likai Mao ▴ 40

score 0 · Answer 2 · 2016-03-10

Thanks for providing further information. The first read in the pair indeed overlaps with the exon. But this read (and the other read in the pair) was marked as a secondary alignment. Since you counted primary alignments only ('countPrimaryAlignmentsOnly' was set to 'TRUE' in your command), featureCounts would not count it. Could you let us know how you extracted reads from the 'Unassigned_NoFeatures' category? What is the version of Rsubread you use?

score 0 · Answer 3 · 2016-03-14

0

Entering edit mode

Likai Mao ▴ 40

@likai-mao-9454

Last seen 8.7 years ago

QIMR, Brisbane, Australia

I actually used Subread (1.5.0 p1) to create the counts this time. I posted here just because I am used to it. The command line was same as the R version above:

featureCounts -R -p --primary -a xxx.gtf -s 2 -o count.txt

I extracted 'Unassigned_NoFeatures' lines from the output file (xxx.bam.featureCounts) produced by option '-R'. To find the overlaps between exons and rnofeature eads, both GTF and BAM files were converted to BED format using 'gtf2bed' (a program of BEDOPS) and 'bedtools bamtobed', respectively. BAM converted BED file was filtered using a Perl script to keep only lines with 'Unassigned_NoFeatures' reads. The exon and nofeature read overlaps were then obtained by command "bedtools intersect -a bam.bed -b gtf.bed -wa -wb". Please feel free to let me know if you need further details.

ADD COMMENT • link 8.7 years ago Likai Mao ▴ 40

0

Entering edit mode

Thanks Likai. We do need to have more info to work out what went wrong. Could you please provide the records in your gtf annotation file that overlaps the read pair of concern? Could you also let us know which genome you mapped your data to?

ADD REPLY • link 8.7 years ago Wei Shi ★ 3.6k

0

Entering edit mode

No worries. The related records in GTF file is as below.

chr1    ENSEMBL transcript      630896  630958  .       +       .       gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; level 3; tag "basic"; transcript_support_level "NA";
chr1    ENSEMBL exon    630896  630958  .       +       .       gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA";

The GTF is from GENCODE (v24). The reference genome is also from GENCODE (GRCh38.p5) for best compatibility.

ADD REPLY • link 8.7 years ago Likai Mao ▴ 40

score 0 · Answer 4 · 2016-03-16

The read pair of concern ("MG00HS09:702:C78TLACXX:7:1101:10090:91329") has two reported alignments - a primary one and a secondary one (which is shown in your post). The "xxx.bam.featureCounts" file includes both the primary one and the secondary one and these two alignments have identical read name. When you looked for the "Unassigned_NoFeature" alignments, the one you got was the primary alignment (the secondary alignment was in the category of "Unassigned_Secondary"). However when you searched this read in your bam file, you got both primary and secondary alignments and then you found that the secondary alignment overlap with that exon but this secondary alignment wasn't the one you found originally from the "xxx.bam.featureCounts" file.

If you search the name of this read pair ("MG00HS09:702:C78TLACXX:7:1101:10090:91329") in your "xxx.bam.featureCounts" file, you should find two records - one is in the "Unassigned_NoFeature" category and the other in the "Unassigned_Secondary" category. The alignment you were investigating was the one assigned to the "Unassigned_Secondary" category, which overlaps with your exon but was not counted because you only counted primary alignments.