Question

Getting normalized read counts from region counts

0

Entering edit mode

saadmurtazakhan ▴ 10

@saadmurtazakhan-6782

Last seen 12 months ago

United States

Hi I was wondering if you could let me know as to what would be the correct way to get TMM normalized region counts.

Here is the code. Kindly let me know which is the correct way to do it :-

library(csaw)
options(echo=TRUE) # if you want see commands in output file
args <- commandArgs(trailingOnly = TRUE)
if(length(args) < 2) {
  args <- c("--help")
}
cell_type <- as.character(unlist(args[1]))
histone_name <- as.character(unlist(args[2]))
bam.files <- unlist(args[3:length(args)])
TSS1000_counts <-  paste("TSS1000_",histone_name,"_",cell_type,".txt",sep='')
TSS_1000_human <- read.table("human_TSS1000bp.txt",sep="\t",header=F)
colnames(TSS_1000_human) <- c("chr","start","end","strand","ensembl_id")
TSS_1000_human_granges <- with(TSS_1000_human, GRanges(chr, IRanges(start, end), strand, id=ensembl_id))

param <- readParam(minq=10)

frag.len <- 36
reg_counts_1000 <- regionCounts(bam.files,TSS_1000_human_granges,ext=frag.len,param=param)

head(assay(reg_counts_1000))

binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)
normfacs <- normOffsets(binned)
test_normalize_counts <- (assay(reg_counts_1000))/normfacs
TSS1000_TMM <- asDGEList(reg_counts_1000, norm.factors=normfacs)

regionCounts TMM csaw • 2.4k views

ADD COMMENT • link 8.8 years ago saadmurtazakhan ▴ 10

Gordon Smyth · Answer 1 · 2016-06-10

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 6 hours ago

The city by the bay

The concept of "normalized counts" isn't something that's supported by edgeR. Once you scale the counts by an arbitrary normalization factor, they aren't interpretable as counts anymore. We prefer to calculate normalized expression values (or in this case, binding values), which can be computed as CPMs. All you would have to do is:

cpm(TSS1000_TMM)

... and you'll get a matrix of CPMs that you can use for whatever relative quantification you want to do. If you must have count data, it's far better to incorporate the normalization information into your modelling procedure, e.g., as GLM offsets. This avoids screwing up the mean-variance relationship when small counts get scaled up (inflating the Poisson noise) or large counts get scaled down (leading to underdispersion).

P.S. Your test_normalize_counts doesn't make any sense. You're dividing by the wrong thing (you need to use the effective library size, not just the normalization factor) and you're not performing the divisions column-wise. Use the cpm command instead.

P.P.S. You use a fragment length of 36 bp, which I presume is your read length. You can achieve the same effect by setting ext=NA, which might be better as it ensures your code will still work on another data set with different read lengths.

ADD COMMENT • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

In the previous post (C: Normalization of chip-seq data in manually specified regions using CSAW) you had mentioned that I should calculate normalization factors and use the promoter counts to get the fold changes. You also suggested that I should use fold changes which had a significant p-value (I assume I would use edgeR over here to calculate p-value). I was wondering that in some cases the fold-change with significant p-values would be too low. In that case can I just use all the values with normalized foldchange for what I set out to do?

PS: another question (also perhaps the last one :)). How do you get the seq_ids along with the region counts from regionCounts since they are already present as metadata in Granges object?

ADD REPLY • link updated 8.8 years ago by Gordon Smyth 52k • written 8.8 years ago by saadmurtazakhan ▴ 10

1

Entering edit mode

Yes, I think you could, as your differential entropy calculations don't depend on differential binding. It's not surprising that you don't get a lot of significant regions, because only have one ChIP and input per species.

As for your other question, I'm not sure what you mean by the seq_ids. Do you mean your ENSEMBL identities? In so, you should be able to get them with rowRanges(reg_counts_1000)$id.

P.S. If you're going to post additional comments on this thread, add the csaw tag otherwise I won't get notified.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Actually I have 2 replicates for both Chip and input for each species. Thanks for all the help hopefully this would be the last of the comment for this post.

Cute cat BTW!

ADD REPLY • link 8.8 years ago saadmurtazakhan ▴ 10

0

Entering edit mode

Does the order of bam files matter if I am calculating fold changes against input. Should it be group1(histone) followed by group2(input) or otherwise? I know we can always multiply by -1 in each case. I was confused since according to edgeR manual control samples should follow treated in columns. Let me know if I am wrong.

ADD REPLY • link 8.8 years ago saadmurtazakhan ▴ 10

1

Entering edit mode

There is no requirement for any specific ordering of the BAM files, as long as the order of files that you use for counting matches up to the ordering of ChIP/input factor used to construct the design matrix.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k