Question

Feature Counts - potential bug with outfile.jcounts

0

Entering edit mode

chris2.a.white • 0

@b9ada64c

Last seen 3 months ago

Australia

Greetings to those familiar with Feature Counts,

We have one RNA sequencing, paired-end, 150bp file for which we estimate gene and junction counts using Feature Counts. We run Feature Counts in Linux.

Feature Counts version: 2.0.8 Feature Counts code:

$FCOUNTS -a GCF_000001405.40_GRCh38.p14_genomic.gtf -G GCF_000001405.40_GRCh38.p14_genomic.fna  -T 8 -J -M -s 0 -p --countReadPairs --largestOverlap -o outfile infile.bam

We have documented the following instance in a junction counts output file (outfile.jcounts) :

Genes:

PNKD TMBIM1

Situation (see attached picture in IGV):

PNKD gene overlaps TMBIM1 gene, and all transcripts. All TMBIM1 transcripts are in the PNKD intron. All TMBIM1 junction reads ~1000 are splitting between PNKD and TMBIM1 for multiple intron junctions. In outfile.jcounts PNKD is the primary gene, however has no exons located in this region. When we remove the PNKD gene from the GTF and recount with Feature Counts, the reads are assigned perfectly to TMBIM1, as the primary gene with NA as secondary gene, as it should. A check of gene counts using exon as feature correctly assigns counts to genes. This appears to be only a junction counting issue. The .gtf and .fna are Refseq 1405.40 files (as noted above) and are untouched. We use this as a simple verification of raw junction counts before analysing PSI. As you will also note, the reverse is also true with PNKD exon-exon junctions sharing read counts with a partially overlapped gene CATIP-AS2. We have encountered others and believe this may be a widespread event.

Here is the outfile.jcounts of that region:

PNKD    NA  NC_000002.12    218270602   +   NC_000002.12    218271381   +   17
PNKD    NA  NC_000002.12    218270602   +   NC_000002.12    218271424   +   4
PNKD    NA  NC_000002.12    218271549   +   NC_000002.12    218272570   +   14
PNKD    TMBIM1  NC_000002.12    218275621   -   NC_000002.12    218276026   -   74
PNKD    TMBIM1  NC_000002.12    218276079   -   NC_000002.12    218277004   -   46
PNKD    TMBIM1  NC_000002.12    218277099   -   NC_000002.12    218277366   -   90
PNKD    TMBIM1  NC_000002.12    218277453   -   NC_000002.12    218277633   -   96
PNKD    TMBIM1  NC_000002.12    218277670   -   NC_000002.12    218277935   -   73
PNKD    TMBIM1  NC_000002.12    218277974   -   NC_000002.12    218278515   -   123
PNKD    TMBIM1  NC_000002.12    218277992   -   NC_000002.12    218278515   -   1
PNKD    TMBIM1  NC_000002.12    218278118   -   NC_000002.12    218278515   -   2
PNKD    TMBIM1  NC_000002.12    218278565   -   NC_000002.12    218279038   -   128
PNKD    TMBIM1  NC_000002.12    218279091   -   NC_000002.12    218279289   -   101
PNKD    TMBIM1  NC_000002.12    218279091   -   NC_000002.12    218280026   -   6
PNKD    TMBIM1  NC_000002.12    218279353   -   NC_000002.12    218280026   -   109
PNKD    TMBIM1,MIR6513  NC_000002.12    218280126   -   NC_000002.12    218281940   -   26
PNKD    TMBIM1,MIR6513  NC_000002.12    218280126   NA  NC_000002.12    218292466   NA  34
PNKD    TMBIM1  NC_000002.12    218282181   -   NC_000002.12    218284039   -   1
PNKD    TMBIM1  NC_000002.12    218282181   -   NC_000002.12    218285763   -   1
PNKD    TMBIM1  NC_000002.12    218282181   -   NC_000002.12    218287207   -   1
PNKD    TMBIM1  NC_000002.12    218282181   NA  NC_000002.12    218292466   NA  71
PNKD    LOC105373881    NC_000002.12    218316364   +   NC_000002.12    218385147   +   3
PNKD    LOC105373880,LOC105373881   NC_000002.12    218316883   +   NC_000002.12    218317955   +   2
PNKD    CATIP-AS2   NC_000002.12    218323428   NA  NC_000002.12    218341591   NA  5
PNKD    CATIP-AS2   NC_000002.12    218323431   +   NC_000002.12    218339783   +   52
PNKD    CATIP-AS2   NC_000002.12    218339898   +   NC_000002.12    218340029   +   27
PNKD    CATIP-AS2   NC_000002.12    218340141   +   NC_000002.12    218340728   +   8
PNKD    CATIP-AS2   NC_000002.12    218340786   +   NC_000002.12    218341534   +   5
PNKD    CATIP-AS2   NC_000002.12    218341626   +   NC_000002.12    218341981   +   12
PNKD    CATIP-AS2   NC_000002.12    218342144   +   NC_000002.12    218343500   +   12
PNKD    CATIP-AS2   NC_000002.12    218343586   +   NC_000002.12    218344455   +   15
PNKD    CATIP-AS2   NC_000002.12    218344570   +   NC_000002.12    218344808   +   28

Has anyone encountered this before and know where things have gone wrong? Prof Smyth suggested we post here instead of Biostars, to hear some constructive feedback from the Subread authors.

Thanks in advance, Chris. PNKD_TMBIM1 PNKD_TMBIM1_closeup

featurecounts Rsubread • 1.3k views

ADD COMMENT • link 4 months ago • updated 3 months ago chris2.a.white • 0

0

Entering edit mode

When you use featureCounts to count junctions with the -J option, it doesn't check if the junction is exactly overlap with exons of genes. Instead, it looks at the whole gene regions (from the first base to the last base of each gene). If the junction is anywhere within the region of a gene, it's assigned to that gene.

ADD REPLY • link 4 months ago Yang Liao ▴ 450

0

Entering edit mode

Hi Yang Liao, thanks for the note.

We use Feature Counts extensively for gene-level counting (exon features).

Unfortunately though, and I'll wait for other people's professional opinion to comment on this further, from your comments I disagree with the definition of the function and how it works.

I feel users should not have to go and check whether the splice junctions indeed exist in every primary gene of the list (now that I know it is not tracking junction counts as per the annotation).

Further, there are possibly 15 000 occurrences of this, in this one bam file aligned to Refseq 1405.40 . Gencode has similar overlapped genes on the same strand and opposite strand.

Also in the example provided, the junction data rows correctly annotate the strand and coordinates of TMBIM1 (strand negative) in site 1 and site 2, not PNKD (strand positive). The only column recording any information relating to PNKD is the primary gene column.

Other programs annotate and count the splices correctly using 2 exons and corresponding transcript IDs, and the transcript IDs' corresponding Gene Id's to build the junction correctly. (Never using the gene start and end coordinate to act as a bucket for any junctions located in the region).

While it was extremely convenient and useful to use -J and obtain junction counts at the same time as our gene counts, now that I know it is incorrect and I will no longer be able to this feature. We almost submitted this data to external reviewers and it would have been incorrect. However, we counted the junctions with several programs and documented the discrepancy.

Having said of all the above, we hope the moderators agree there is an error here that should be addressed because having this feature in FeatureCounts is extremely useful.

Thanks for considering, Chris.

ADD REPLY • link 4 months ago chris2.a.white • 0

score 2 · Answer 1 · 2024-12-12

2

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

We will further develop the junction quantification module to support the use of user-provided annotation data for annotating detected junctions. We aim to complete the development and testing before the next bioc release.

ADD COMMENT • link 3 months ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Gordon Smyth and Wei Shi,

Firstly, thank you for offering your time to develop the junction quantification module further. This is great news! Looking forward to trying the next version.

Gordon Smyth, I hear your point regarding the novel junctions at the aligner level, and agree. All aligners I guess theoretically detect junctions de novo, regardless of whether an annotated junction file is provided, ( giving it a preference of where to search) but the junction will map to where the junction maps, most correctly.

When counting at the exon feature level and specifying a junction file out, using a gtf of annotated gene and exons lines to map reads, it is assumed a read spanning two annotated exons would then correspond with the appropriate transcript/gene junction reported. As mentioned, still tracking the novel junctions is a must and part of other splice counting software (multiple columns and or files are used to separate novel vs annotated).

An amazing functionality would be the proximity output column as described earlier, so keen to hear your feedback on that : "a novel junction to annotated junction proximity output column. So if the novel junction read is n bases (set by the user) away from the annotated splice then report as "novel", but, with an annotation proximity value. And, the ability to collapse the counts into the annotated junction (another flag as set by the user)."

One last issue we have had prior to this post, was whilst analysing ambiguous reads we found instances where, the upcoming changes to the junction counting approach, would assist in removing the ambiguity between overlapping genes - please see the attached IGV image. It is obvious that the 21 reads map to CASZ1 exon-exon boundary on the negative strand but fail to be assigned as counts to CASZ1, because of the positive strand overlap; a single exon/transcript ENSG00000272078.

Thanks again for your time, Chris

casz1_ambiguous

ADD REPLY • link 3 months ago chris2.a.white • 0

score 1 · Answer 2 · 2024-12-11

The behavior that you have documented is expected and is explained in the featureCounts documentation.

The first thing to understand is that featureCounts detects junction reads de novo, without reference to the GTF annotation. This is a very powerful facility, because it allows featureCounts to find novel exon-exon junctions that are not annotated. Experience shows that novel transcript and junctions that are not in any GTF file are empirically very common in RNA-seq data. However, annotating novel junctions is not straightforward and doing that comprehensively is somewhat beyond the current scope of featureCounts.

In general, featureCounts has no way of knowing for sure which gene an exon-exon junction belongs to. The junction might be novel and it might originate from exons or transcripts that are not in the GTF file. So featureCounts currently takes the very simple approach of listing all the possible genes that the junction might potentially belong to. Specifically, it lists all the genes for which the junction falls within the genomic region, from TSS to TES, of that gene.

Each junction consists of two genomic positions ("splice sites"). If there is only one gene containing both splice sites, then that gene is specified as the "Primary Gene" and all other candidates are listed as "Secondary Genes". If there is only one gene containing containing either of the two splice sites then, again, that gene is specified as the "Primary Gene" and all other candidates are lists as "Secondary Genes". If there are two or more genes that contain both splice sites, then featureCounts makes an arbitrary choice -- it chooses the candidate gene with the first occuring genomic region to be the "Primary Gene". This behaviour is explained pretty clearly in the documentation, which says:

When the primary and secondary genes overlap same number of splice sites, the gene with the smallest leftmost base position is selected as the primary gene.

Now you may say, featureCounts could check whether the detected junction matches one of the annotated exon-exon junctions in the GTF file, because that might resolve which candidate gene is the correct Primary Gene, and I would agree with you. I am been thinking about this for a couple of years, but it is not as straightforward as you might think. What if the junction is just one base different from an annotated junction? Is that acceptable variation? Perhaps we might associate the junction with the annotated junction, on the same strand, that it is closest to? But such as strategy is far from foolproof and involves more programming. So, for the time being, featureCounts is taking a simple and sometimes arbitrary choice that nevertheless clearly documented.

The behavior is not a bug or an error, because the software behaves as documented.