Hi I was wondering if you could let me know as to what would be the correct way to get TMM normalized region counts.
Here is the code. Kindly let me know which is the correct way to do it :-
library(csaw) options(echo=TRUE) # if you want see commands in output file args <- commandArgs(trailingOnly = TRUE) if(length(args) < 2) { args <- c("--help") } cell_type <- as.character(unlist(args[1])) histone_name <- as.character(unlist(args[2])) bam.files <- unlist(args[3:length(args)]) TSS1000_counts <- paste("TSS1000_",histone_name,"_",cell_type,".txt",sep='') TSS_1000_human <- read.table("human_TSS1000bp.txt",sep="\t",header=F) colnames(TSS_1000_human) <- c("chr","start","end","strand","ensembl_id") TSS_1000_human_granges <- with(TSS_1000_human, GRanges(chr, IRanges(start, end), strand, id=ensembl_id)) param <- readParam(minq=10) frag.len <- 36 reg_counts_1000 <- regionCounts(bam.files,TSS_1000_human_granges,ext=frag.len,param=param) head(assay(reg_counts_1000)) binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param) normfacs <- normOffsets(binned) test_normalize_counts <- (assay(reg_counts_1000))/normfacs TSS1000_TMM <- asDGEList(reg_counts_1000, norm.factors=normfacs)
In the previous post (C: Normalization of chip-seq data in manually specified regions using CSAW) you had mentioned that I should calculate normalization factors and use the promoter counts to get the fold changes. You also suggested that I should use fold changes which had a significant p-value (I assume I would use edgeR over here to calculate p-value). I was wondering that in some cases the fold-change with significant p-values would be too low. In that case can I just use all the values with normalized foldchange for what I set out to do?
PS: another question (also perhaps the last one :)). How do you get the seq_ids along with the region counts from regionCounts since they are already present as metadata in Granges object?
Yes, I think you could, as your differential entropy calculations don't depend on differential binding. It's not surprising that you don't get a lot of significant regions, because only have one ChIP and input per species.
As for your other question, I'm not sure what you mean by the
seq_ids
. Do you mean your ENSEMBL identities? In so, you should be able to get them withrowRanges(reg_counts_1000)$id
.P.S. If you're going to post additional comments on this thread, add the
csaw
tag otherwise I won't get notified.Actually I have 2 replicates for both Chip and input for each species. Thanks for all the help hopefully this would be the last of the comment for this post.
Cute cat BTW!
Does the order of bam files matter if I am calculating fold changes against input. Should it be group1(histone) followed by group2(input) or otherwise? I know we can always multiply by -1 in each case. I was confused since according to edgeR manual control samples should follow treated in columns. Let me know if I am wrong.
There is no requirement for any specific ordering of the BAM files, as long as the order of files that you use for counting matches up to the ordering of ChIP/input factor used to construct the design matrix.