Question

DiffBind full library size not reliable

0

Entering edit mode

Marianna ▴ 20

@7cc5052f

Last seen 7 days ago

Italy

Dear colleagues,

I'm using DiffBind for the first time and I'm encountering some issues, one in particular is quite weird.

I've got a simple experiment, 3 conditions, 4 replicates each. I've obtained peaks using MACS2 (-- broad function).


   SampleID Condition Replicate
1        S1   control         1
2        S2   control         2
3        S3   control         3
4        S4   control         4
5        S5   flu_low         1
6        S6   flu_low         2
7        S7   flu_low         3
8        S8   flu_low         4
9        S9  flu_high         1
10      S10  flu_high         2
11      S11  flu_high         3
12      S12  flu_high         4

After doing the dba.count and looking at the dba object I realized that the number of reads per sample is extraordinarily high (at least two times the actual library size, which was around 60M SE). How is it possible?

Is this linked to the fact that I retained multi-mappers reads?

Any advice would be greatly appreciated.

Best

Marianna

DiffBind Reads MACS2 FullLibSize • 948 views

ADD COMMENT • link updated 8 months ago by Rory Stark ★ 5.2k • written 8 months ago by Marianna ▴ 20

0

Entering edit mode

It's unclear what you're asking. Please always provide some minimal code, data examples or plots to illustrate the problem. Generally though, from what I know and have seen over the last years, multimappers are usually not included in the analysis. If this is the issue here is unclear. Are you referring to raw counts or normalized counts?

ADD REPLY • link 8 months ago ATpoint ★ 4.8k

0

Entering edit mode

Thank you ATpoint for your reply.

I'll try to describe the issue in more detail.

I've mapped my SE reads to the ref genome using bowtie2 --end-to-end -k 6.
I've sorted and indexed the reads using samtools
I've marked duplicates using Picard
I've called peaks using MACS2 (with the --broad function)

Below is the code used for DiffBind

>samples<-read.csv("samples.csv")
>samples
>flu <- dba(sampleSheet=samples)
>flu

12 Samples, 12455 sites in matrix (15815 total):
ID Condition Replicate Intervals
1   S1   control         1      8771
2   S2   control         2      8937
3   S3   control         3      9608
4   S4   control         4      6338
5   S5   flu_low         1      9127
6   S6   flu_low         2      7849
7   S7   flu_low         3      9081
8   S8   flu_low         4      7916
9   S9  flu_high         1     10432
10 S10  flu_high         2      9284
11 S11  flu_high         3      8311
12 S12  flu_high         4     11119

>flu_counts <- dba.count(flu)
>flu_counts

12 Samples, 12374 sites in matrix:
ID Condition Replicate     Reads FRiP
1   S1   control         1 137864693 0.06
2   S2   control         2 111003187 0.09
3   S3   control         3 142181269 0.07
4   S4   control         4 110355671 0.07
5   S5   flu_low         1 166460540 0.05
6   S6   flu_low         2 128119574 0.06
7   S7   flu_low         3 164164658 0.05
8   S8   flu_low         4 133877017 0.06
9   S9  flu_high         1 169378646 0.05
10 S10  flu_high         2 162547426 0.07
11 S11  flu_high         3 173471350 0.05
12 S12  flu_high         4 196066616 0.05

Now the weird thing is that it seems there are, for instance in sample S1, more than 130M reads. This is not possible as for this sample there mere 60M raw reads! Do you think that running bowtie2 with -k argument can be a problem? As setting -k 6 the sam file will retain reads mapping to a maximum of 6 different locations.

Thank you

Best

Marianna

ADD REPLY • link 8 months ago Marianna ▴ 20

0

Entering edit mode

I see no code, and you did not answer whether it's about raw or normalized counts.

ADD REPLY • link 8 months ago ATpoint ★ 4.8k

0

Entering edit mode

Sorry, I accidentally replied before completing the answer. I'm referring to raw counts

ADD REPLY • link 8 months ago Marianna ▴ 20

0

Entering edit mode

Yes, this could potentially be due to multimappers, since (iirc) -k 6 outputs up to 6 locations per multimapper. My recommendation is to remove any multimappers, for example by a MAPQ filter, e.g. samtools view -q 20. Note, this is my recommendation, I am not the DiffBind maintainer. Multimappers are usually never considered due to their uncertainty.

ADD REPLY • link 8 months ago ATpoint ★ 4.8k

0

Entering edit mode

Thank you ATpoint!

I tried by filtering with samtools view -q 42, to keep only uniquely mapping reads but the results were very similar to the one reported above. I'm afraid of the -k in bowtie2, so I'm trying to run it without -k, to completely get rid of the multimappers.

Thank you

Marianna

ADD REPLY • link 8 months ago Marianna ▴ 20

0

Entering edit mode

I am not exactly sure how the k parameter works, I never use it, but I agree, realignment without it and filter for MAPQ > 0 makes sense. Maybe not 42, that is strict, maybe use something like 20.

ADD REPLY • link 8 months ago ATpoint ★ 4.8k

0

Entering edit mode

DiffBind should already be filtering out multi-mapped reads using the mapQCth parameter in dba.count(). This is set to mapQCth=15 by default.

ADD REPLY • link 8 months ago Rory Stark ★ 5.2k