featureCounts - paired-end data
1
0
Entering edit mode
@samuel-collombet-6574
Last seen 7.6 years ago
France

Hi,

In single-end mode, does feature counts use the the template length (col9 in sam file) to check for an overlap with the feature, or the read length?

I have paired-end data with no unmated reads, I allow multimatchs, so I would have assume that feature count use the template length, my count in in single-end mode should be exactly twice than my count in paired-end, but it does not seems to be the case...

I would like to use the single end mode because my bam is quite big, and already sorted by coordinates, and I takes much more time to ask featureCounts to re-order the reads to use paire end mode (and I have a lot of bam to count, onto various features, so it is taking too long, and I cannot keep the bam sorted by names on my disk because of storage problems...).

featurecounts rsubread • 20k views
ADD COMMENT
0
Entering edit mode

Have you read the featureCounts webpage or paper? They both seem to answer your questions.

ADD REPLY
0
Entering edit mode

Hi James,

I did read it but was not sure to understand ; I re-read it after your message, does:

"A read is said to overlap a feature if at least one read base is found to overlap the feature. For paired-end data, a fragment (or template) is said to overlap a feature if any of the two reads from that fragment is found to overlap the feature" (webpage)

and

"featureCounts preforms precise read assignment by comparing mapping location of every base in the read or fragment with the genomic region spanned by each feature" (paper)

mean that it take into account the read himself, and not the fragment length? (now that you tell me the answer is there I guess it is so, but I was not sure before...sorry)

ADD REPLY
1
Entering edit mode

Hi Samuel,

It's just a logic problem, and I think you are over thinking it. Let's say you have paired end data, and it looks like this:

XXXXXXXXXXXXXX              XXXXXXXXXXXX  <- these are your pairs

          *******************                           <- this is an exon

That read pair overlaps the exon of that gene, so if you are using them as paired ends, you get a single count. However, if you say they are single ends, you still get one count. Now consider

XXXXXXXXXXXXXXX             XXXXXXXXXXXXXXXXXX <- paired reads

   ***********************                     *****************  <- exons from same gene

In this case you get one count from paired-end mode, and two counts from single end mode.

XXXXXXXXXXXXXXXX              XXXXXXXXXXXXXX  <- paired reads

*************                      +++++++++++++    <- exons from two different genes

In this situation, counting in paired end mode gives you zero counts, but in single end mode gives you TWO counts, one for each gene.

XXXXXXXXXXXX                          XXXXXXXXXXXXXX <- paired reads

                          *************                                   <- exon

What do we get here? Zero counts! Regardless of how you count, because the reads don't overlap the exon. You could argue that this should count as one count for this gene (and it would if you aligned against the entire extent of each gene, from 5' UTR to 3' UTR, rather than the exonic regions).

You would be surprised to see how many more counts you get if you summarize based on the overlap within the gene body, rather than just the exons - there is apparently lots of nascent mRNA floating about, that hasn't had the introns excised yet. The same is true for the Affy arrays that have intronic controls that are supposed to measure background - these controls have surprisingly high binding. But I'm getting off on a tangent here...

ADD REPLY
0
Entering edit mode

Hi James,

I do see your point and I do see why my question does not make sense for you :) I should have mention that I am not working with RNAseq data (ie counting reads on exons of genes) but with ChIPseq data, counting fragments of IP on enriched regions. That's why I am more interested in the fragment as one entity and not as two read mates paired together. Maybe I should rather transform my reads data as single fragments data and count those, that would solve the problem...

ADD REPLY
1
Entering edit mode

If you are doing ChIP-Seq, then I doubt you want to be using featureCounts anyway. Instead you should probably be using the csaw package.

ADD REPLY
0
Entering edit mode

Ho Thanks a lot, I did not know about this package!! I was using DESeq with counts on peaks (with an home made variant of the pipeline developed in the DiffBind package), but this look actually much better than my tinkering!

ADD REPLY
0
Entering edit mode

Your explanation makes it very clear how the counting is performed. I think though that the way the option is named and the way its description is worded, makes it not entirely clear what precisely is counted and what isn't. "count read pairs" could be reasonably inferred to mean mapped read pairs, and the user is left to guess what happens when only one partner of a pair maps to a feature. The description does not clear up this ambiguity. And in particular, when users whose goal is not differential expression, are trying to leverage the basic counting functionality, it's those use cases where the ambiguity of the option name/description is more problematic. My 2¢.

ADD REPLY
1
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 18 days ago
Australia/Melbourne

For paired end reads, you should count read pairs (fragments) rather than reads because counting fragments will give you more accurate counts.

There are several reasons why you cannot get the fragment counts by simply dividing the counts you got from counting reads by two. One reason is that a fragment with two mapped reads will give you two counts when you count reads, but a fragment with only one mapped read will only contribute one count (this fragment should get 1 count in fragment counting but it ended up with 0.5 count when you count reads instead of fragments).  Another reason is that some reads may be found to be overlapping with more than one gene and therefore were not counted, but the corresponding fragments may be counted because the ambiguity was solved by longer fragment length.

 

 

ADD COMMENT
0
Entering edit mode

Thanks Wei Shi, However I still would like to know what is the region considered to be covered by a read/fragment :)

Concerning you answer, as I mention I do not have un-mated pairs (I use STAR for mapping, which consider the pair as one sequence to map, so he does not give un-mated reads from paired end).
For your second point, although I agree that taking into account the pair will change the assignment of the reads, I do not see how your example is possible: a single mate reads is not assign because overlapping 2 features, but the pair is there is no more ambiguity? Is that possible? I see how the opposite is possible (single reads is unambiguously assign to one feature, but the pair is not because the 2 mates overlap 2 different features), but not this scenario.
Anyway if the program consider the fragment length in the SAM 9th column, paired end or not does not change anything as I have only mated pairs, and I am not taking the strand into account, so there are always 2 fragment overlaping the exact same region :)

ADD REPLY
0
Entering edit mode

Firstly, it sounds strange to me that you used STAR to map your chip-seq data since STAR is designed for mapping RNA-seq data.

Un-mated pairs are those pairs which have one or both ends unmapped or have both ends mapped but mapped to a distance much greater than the nominal fragment length (or mapped to different chromosomes). Are you sure you did not have any un-mated read pairs in your mapping results? I have never seen such mapping results.

A problem with using fragment length to assign fragments to features is that the reported fragment lengths in the mapping results could be very big, meaning that they may span more than one feature. As a result of that, the fragments will be incorrectly assigned to all of their overlapping features or not assigned at all (depending on the setting).

For each read pair, featureCounts checks if each of the two reads in the pair can be unambiguously assigned. As long as one read is found to be able to be unambiguously assigned, featureCounts will take the whole read pair as a unambiguously assigned fragment. This is the reason why read pairs are more likely to be assigned unambiguously than reads in featureCounts.

What you really want to have might be to extend the reads to their full fragment length before assigning them to features. This is a better way to take into account fragment length in your read assignment. The options "readExtension5" and "readExtension3" in featureCounts can help you to do this. But the csaw package is certainly a good alternative.

 

ADD REPLY
0
Entering edit mode

For the use of STAR for ChIPseq (or genomic DNA in general), I think I am not the only one who tried (see https://groups.google.com/forum/#!topic/rna-star/E_mKqm9jDm0). I compared STAR mapping to bowtie2 and got quite similar results (in therm of mapping efficiency, signal  correlation and enriched regions). So i use it as it is much faster. But if you have some examples of bad results, it would be great to have your feedback!

for the readExtension5/3, you have to give a fixed length, right? My problem is that I would like to use the fragment length, wich is different for each read... 
At the end in this case, it is just easier (and probably better) to use csaw, but thanks for all the clarifications :)

 

ADD REPLY
0
Entering edit mode

Hey Wei shi,

May I ask you something related to featureCounts?

I have only paired end reads (RNAseq) mapping to its genome by bowtie2, and I use featureCounts, too. Could you help me understand my code :

featureCounts -t gene -g "Id" (the rest default, not specified)

Does this code make sure that paired-end reads matching to a gene only count 1? Or 2?

I am not sure should I specify -p to count fragment not reads, and could you help me understand if the following code give the same output as the above one?

featureCounts -p -t gene -g "Id" (the rest default, not specified)

I am wondering since I only have paired reads these two codes should be both OK? right?

Last question: when should I specify strandness? does strandSpecific=2 or 1 give half of count than not specified strandness(0)?

I am very confused about this, and I am not sure if what I understand is right or wrong. I hope to get some confirmation or correction from you.

Thanks a lot. Yanfang

ADD REPLY
0
Entering edit mode

You need to specify -p and --countReadPairs to count fragments instead of reads. The value provided to the -t option is typically exon. Regarding stranded counting, you only need to use it when your data were generated from using a stranded sequencing protocol. Reads are always given a full count when stranded counting is performed.

ADD REPLY
0
Entering edit mode

Thanks Wei Shi. when is -t gene reccomanded? In my bacteria gff file, I do not see exon, but in my fungi gff file, I do. I wonder can I just set -t gene for both bacteria and fungi? I really appreciate your help. Best regards, Yanfang

ADD REPLY
0
Entering edit mode

If there is no exons in your annotation file then using gene is probably allright.

ADD REPLY
0
Entering edit mode

Hi Dr. Wei Shi,

I checked the SubreadUsersGuide.pdf on github, and found in section 6.3, there is a code example, here I quoted:

Summarize paired-end reads and count fragments (instead of reads):
featureCounts -p -a annotation.gtf -t exon -g gene id -o counts.txt mapping_results_PE.bam

It seems the manual's code example doesn't have a --countReadPairs tag, does the manual miss this tag?

I also wonder if this --countReadPairs is always recommended for counting Pair-end sequencing data, is there any plan to make it a default option for the featureCounts software?

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6