Hi to the experts,
First of all, I would like to thank you for the efforts you put in to the Bioconductor team. Those efforts make the lives of struggling bioinformatics students like me much easy.
This is a very naive inquiry about an issue I face with counting features using summarizeOverlaps() function.
I have an analysis on stranded, paired end RNA-Seq data of transgenic Arabidopsis. A snippet of my code used is below.
> metadata <- as.data.frame(read.table("bamFiles/metadata.txt",
+ header=TRUE))
> metadata$Code <- paste(metadata$Tissue,
+ metadata$Treatment, metadata$Genotype,
+ sep="_")
> metadata$ExpGroup <- paste(substr(metadata$Tissue, 1, 1),
+ metadata$Genotype,
+ substr(metadata$Treatment, 1, 1),
+ substr(metadata$Time, 1, 1), sep='-')
> metadata$biolrep <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,
+ 10,10,11,11,12,12,13,13,14,14,15,15,
+ 16,16,17,17,18,18,19,19,20,20,21,21,
+ 22,22,23,23,24,24,25,25,26,26,27,27,
+ 28,28,29,29,30,30,31,31)
> #extract out the names of the BAM files
> #to the bamFls variable (character variable)
> bamFls <- list.files(path="bamFiles/",
+ pattern="bam$", full=TRUE)
> #ordering according to the order of metadata
> bamFls_ordered <- bamFls[match(metadata$Filename,
+ (sub("\\..*", "", basename(bamFls))))]
> #Assigning names to each data in bamFls_ordred using a
> #section of its basename
> names(bamFls_ordered) <- sub("\\..*", "",
+ basename(bamFls_ordered))
> source("https://bioconductor.org/biocLite.R")
> biocLite(c("GenomicAlignments", “GenomicFeatures”))
> library("GenomicAlignments")
> library("GenomicFeatures")
> txdb <- makeTxDbFromGFF(
+ file="TAIR10_GFF3_genes.gff",
+ format="gff",
+ dataSource="TAIR",
+ organism="Arabidopsis thaliana")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
This is where the issue begins. I managed to count the reads for each gene the first time without an issue using the following code.
> gnModel <- transcriptsBy(txdb, "gene")
> options(mc.cores=1)
> ine_counts <- summarizeOverlaps(gnModel,
+ bamFls_ordered, ignore.strand=FALSE,
+ singleEnd=FALSE,
+ mode="Union")
This seem to function properly. However, the first attempt to create the count matrix gave me an “acceptable” count matrix (completed in December 2015) with reads spanning up to ~1200 for the overexpressed transgene (attached the counts barplot as 1.png).
As I had a little time on my hand until the second dataset arrives, I tried to reproduce the data (in Jan 2016) using the same codes I used previously. However, now the count matrix looked very different compared to the first attempt (has very low numbers). I have attached the barplot for the counts of my overexpressed transgene from this count matrix as 2.png.
I tried the following to see whether it is a problem with my gff file, bam files, metadata file, etc.
> counter <- function(fl, gnModel) {
+ aln <- readGAlignments(fl)
+ strand(aln) <- "*" # for strand-blind sample prep protocol
+ hits <- countOverlaps(aln, gnModel)
+ counts <- countOverlaps(gnModel, aln[hits==1])
+ names(counts) <- names(gnModel)
+ counts
+ }
> counts <- sapply(bamFls_ordered, counter, gnModel)
And the resulting count matrix was much similar to the initial count matrix I produced (attached the barplot for counts for the overexpressed transgene as 3.png). Of course there are higher numbers as the double counting is allowed.
Therefore, I concluded that (I might be wrong) it has got to do something with the summarizeOverlaps() function or the GenomicFetures() package. I tried looking at the information from updates. However, I was unable to locate anything related to this. I prefer to use the summarizeOverlaps() function, as it lets me define whether my data is stranded/unstranded and paired/single end. And also the double counting is avoided.
If you find a free time, please advise me on where I might be going wrong, or please direct me to a place where I can find a solution. I pondered on it for nearly 2 weeks now, and cannot seem to find a way out. I’m sure it is a very naïve issue that has just slipped out of my sight. I would really appreciate your help.
Thanks you so much for your time.
Yours sincerely,
Shani A.
Hi Valerie,
Thank you for your response. However, maybe I wasn't clear. I have no issue in getting different results for countOverlaps() and summarizeOverlaps() functions. You are correct on the fact that I wasn't using the same criteria for both. Even if I did, I believe they should not give the same count values for samples due to the difference in the underlying methodology.
My issue lies in not being able to reproduce the summarizeOverlaps() count matrix the second time I ran it. For an example, if you can see in the 1.png and 2.png, I took the counts from all the samples for the over-expressed gene in my experiment. Since we have double checked using RT-PCR, I know that over-expressed gene is expressed largely in transgenic plants. It is quite obvious in 1.png (this is the time I ran summarizeOverlaps() on my samples for the first time). But surprisingly the second time I tried to reproduced the data using the same script, and same samples, the reads for the over-expressed transgene have reduced from maximum ~1200 (from 1.png) to ~14 (this is evident in 2.png).
My question is, what could be really causing this non-reproducability, while all my BAM files, paths, gff file and the metadata file are correct. Given that I get maximum ~2000 counts for the over-expressed gene from countOverlaps() function, I believe it might have got to do something with the...umm, the summarizeOverlaps() function or the GenomicFetures package. Because when I checked, every other thing is a constant. I'm using GenomicFeatures 1.22.4 on R version 3.2.1 (2015-06-18) on x86_64-pc-linux-gnu platform.
I therefore, kindly would like to know whether an update for summarizeOverlaps() function or genomicFetures() package could cause this?
I'm sorry, if I am not clear again. Please let me know I can elaborate more. In the meantime, I will redo the summarizeOverlaps() and countOverlaps() with a single BAM file.
Thanks heaps.
Shani.
If the bam files and TxDb have not changed then results from summarizeOverlaps() should not have changed. No, there have not been changes to summarizeOverlaps() or makeTxDbFromGFF() that should cause this type of discrepancy. Both of the GenomicAlignments and GenomicFeatues packages are under active development but changes made should not cause this type of reduction in overlaps.
It is difficult to help when I can't reproduce the inconsistent results. You say you ran the same script with the same bam files with R 3.2.1.
- what were the versions of GenomicFeatures and GenomicAlignments for the two runs?
- did you save the GFF/TxDb from that first run or did you have to re-download the GFF and re-generate the TxDb for the second run?
If you know the different package versions (first run vs second) and have the TxDb(s) I can try to reproduce this on one of your bam files.
Valerie
Hi Valerie,
Thanks for your reply. .
The versions I used for first run was GenomicAlignments 1.4.2 but for the second run 1.6.3. For GenomicFeatures first run was in 1.20.6 and for the next 1.22.12. New versions were used for countOverlaps() as well.
I saved the TxDb from the first run and loaded it back to use for the second run (this loaded TxDb was used for the countOverlaps() function as well). I have added the TxDb in sqlite format here. The BAM file can be found here.
Thanks a lot for your help.
Shani.
Thanks. I will take a look. (sorry for misspelling your name in that first post!).
Thanks Valerie.
No issue with the name. It is not quite a familiar name anyways... Looking forward to your update on the data.
Thanks,
Shani
I've done a run with the current release, sessionInfo() below.
There were some warnings related to differently named seqlevels. This may not be relevant to your gene of interest but I thought it worth mentioning.
You can take a closer look at the seqlevels with
seqlevels(txdb)
andseqlevels(BamFile(bf))
.I'm not sure which genes in the txdb (names start with AT*) correspond to "B32-R-Transgenic-S-3" in the figure so here I've included all genes with counts above 1000:
Let me know which gene(s) in the txdb correspond to the over-expressed gene of interest and I'll re-run with the earlier package versions.
Valerie
editing to add sessionInfo():
> sessionInfo()
R version 3.2.3 Patched (2016-01-04 r69860)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 22 (Twenty Two)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3
[3] GenomicAlignments_1.6.3 Rsamtools_1.22.0
[5] Biostrings_2.38.4 XVector_0.10.0
[7] SummarizedExperiment_1.0.2 Biobase_2.30.0
[9] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3
[11] IRanges_2.4.7 S4Vectors_0.8.11
[13] BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 BiocParallel_1.4.3 tools_3.2.3
[4] DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1
[7] rtracklayer_1.30.2 futile.options_1.0.0 bitops_1.0-6
[10] RCurl_1.95-4.7 biomaRt_2.26.1 RSQLite_1.0.0
[13] XML_3.98-1.3
Thanks a lot for this Valerie.
The gene over-expressed would be AT2G25090. I checked the same with my data for the two runs of summerizeOverlaps(). ..._1 stands for the first run and ..._2 is the second run.
As you can see, you have reproduced the results from my second run which is very different to the first run for the same sample. This is still so confusing. Can I please know whether there is a new normalization step taking place within the summerizeOverlaps() function now that could cause this?
Thanks a lot.
Shani.
Yes, I understand the code above was equivalent to your second run. I wanted information about the gene before running the older versions so I could use a ScanBamParm to focus on the region of interest - you'll see this in the code below.
The older package versions (release April-Oct 2015: GenomicFeatures 1.20.6, GenomicAlignments 1.4.2) produce the exact same results as the current release (Oct 2015 - April 2016: GenomicFeatures 1.22.13, GenomicAlignments 1.6.3). The difference I see is in the value of 'ignore.strand'. With ignore.strand=TRUE counts are 11 and with ignore.strand=FALSE counts are 1249. These numbers correspond to what I see in 1.png and 2.png.
Valerie
Yes Valerie, you are right. The values correspond with the results from the two runs as well.
However, I used
ignore.strand = FALSE
in both instances. This is because my data are paired end reads.Thank you,
Shani.
This is now a non-reproducible error. The sessionInfo() I show above demonstrates that ignore.strand=FALSE produces 11 counts with the old and new package versions. If you can download a fresh R 3.2 and install the appropriate Bioconductor 3.1 packages and show that using ignore.strand=FALSE produces 1249 counts (show your sessionInfo()) I will look into this further. Otherwise I'm considering this issue closed.
The best way to install the old packages is into a fresh R 3.2 install get BiocInstaller from
svn co https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_3_1/madman/Rpacks/BiocInstaller
and install with R CMD INSTALL BiocInstaller. Then from within an R session with this fresh install use biocLite() to get GenomicFeatures and GenomicAlignments.
Valerie