Hello,
I'm running SGSeq on a set of bam files that have reads extracted from a region on chromosome 22 (29190548,29196560). When I run SGSseq on my bam files I always end up with read counts of 0 across all the features. I have sorted and indexed my bam files and the chromosome name is the same in my granges object and bam files. I will be grateful for any explanation for what I'm doing wrong.
I have my code below.
Thanking You,
Vakul
f1 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = FALSE)
f2 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = TRUE)
files = cbind(f1,f2)
files = as.data.frame(files)
colnames(files) = c("sample_name","file_bam")
files[,1] = as.character(files[,1])
files[,2] = as.character(files[,2])
#getting summary of information about bam files required by the package
sc = getBamInfo(files,cores =4)
#GRanges object of the region of intere
gr = GRanges(seqnames = "chr22",ranges = IRanges(29190548,29196560),strand = "*")
#TxDb Object
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
txdb = keepSeqlevels(txdb,"chr22")
#extracting features
txf <- convertToTxFeatures(txdb)
txf_f <- txf[txf %over% gr]
sgfc <- analyzeFeatures(sc, features = txf_f)
Session Info
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] XVector_0.8.0 TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
[3] GenomicFeatures_1.20.6 AnnotationDbi_1.30.1
[5] SGSeq_1.2.2 DEXSeq_1.14.2
[7] DESeq2_1.8.2 RcppArmadillo_0.6.500.4.0
[9] Rcpp_0.12.3 GenomicRanges_1.20.8
[11] GenomeInfoDb_1.4.3 IRanges_2.2.9
[13] S4Vectors_0.6.6 Biobase_2.28.0
[15] BiocGenerics_0.14.0 BiocParallel_1.2.22
loaded via a namespace (and not attached):
[1] genefilter_1.50.0 statmod_1.4.24 locfit_1.5-9.1 splines_3.2.1
[5] lattice_0.20-33 colorspace_1.2-6 rtracklayer_1.28.10 survival_2.38-3
[9] XML_3.98-1.4 foreign_0.8-66 DBI_0.3.1 RColorBrewer_1.1-2
[13] lambda.r_1.1.7 plyr_1.8.3 stringr_1.0.0 zlibbioc_1.14.0
[17] Biostrings_2.36.4 munsell_0.4.3 gtable_0.2.0 futile.logger_1.4.1
[21] hwriter_1.3.2 latticeExtra_0.6-28 geneplotter_1.46.0 biomaRt_2.24.1
[25] acepack_1.3-3.3 xtable_1.8-2 scales_0.4.0 Hmisc_3.17-2
[29] annotate_1.46.1 Rsamtools_1.20.5 gridExtra_2.2.1 ggplot2_2.1.0
[33] stringi_1.0-1 numDeriv_2014.2-1 grid_3.2.1 tools_3.2.1
[37] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.8 RSQLite_1.0.0
[41] Formula_1.2-1 cluster_2.0.3 futile.options_1.0.0 ABSOLUTE_1.0.6
[45] rpart_4.1-10 igraph_1.0.1 GenomicAlignments_1.4.2 nnet_7.3-12
Hi Leonard,
My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools. I'll check for the features you mentioned in my BAM files and update my Bioc version.
If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them.
Thanking You,
Vakul
Hi Leonard,
My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools. I'll check for the features you mentioned in my BAM files and update my Bioc version.
If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them.
Thanking You,
Vakul
It’s probably easiest if you can send test files via dropbox or as an email attachment (assuming they are not too large if they only contain reads for XBP1). You can find my email address on the SGSeq landing page.
Thank you,
I mailed you a text file of the reads from one of my BAM files.
Thanks for providing the test file. It looks like SGSeq, and the Bioc functions that SGSeq uses to read the data, cannot pair the mate reads in your file, because values in the QNAME field have suffix /1 and /2 (i.e the QNAME for mate reads are not identical). When stripping off the suffix everything works fine. Looking at the SAM specifications and online forums, it seems like the QNAME should be identical for mate reads. So the easiest solution would be to remove the suffixes in your file, or maybe your alignment program has an option that can remove them during read mapping? I hope this helps.
BamFile()
has argumentsqnamePrefixEnd
,qnameSuffixStart
, which can be used to strip the prefix. The idea would be to support input asBamFile()
/BamFileList()
; I don't know if that's what SGSeq does.Hi Martin, thanks for the comments. Yes I saw BamFile() has these arguments, which should make it straightforward to strip suffixes. However I don’t want to hard-code this behavior. So supporting it will require changes in the API of several high-level functions. Probably not a good idea to do this in the release version, but I can do it in devel. Vakul, if you are happy to work with bioc-devel, or wait until the next release, this is an option. I can let you know when I have implemented it.
I'll make the changes to my BAM files and give it another go.
Thank you for the help.
Also It will be awesome if you let me know about the changes Thanx!
Based on Martin’s suggestion I made the following change in SGSeq 1.5.9 (available via svn now). In the ‘sample_info’ data frame, column ‘file_bam’ can now be either a character vector as previously, or it can be a
BamFileList
object.If you can switch to bioc-devel, this allows you to use your current BAM files. All you have to do is change the input in your example as follows:
Then run
getBamInfo()
andanalyzeFeatures()
and you should get the correct counts.Thank you,
I'll give it a shot!
Hi I got the R_devel and also found the svn page for SGSeq (https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/SGSeq/), I'm however not familiar with working on the bioc-devel I will be grateful on a pointer to how I can get the package and run it.
Thank you.
Hi, in case you haven't seen it already, there are instructions on
https://www.bioconductor.org/install/
In order to switch to bioc-devel, run function useDevel() in the BiocInstaller package, then you can use biocLite() for installing packages as usual. You don't normally have to checkout packages via SVN. When developers commit changes for a package, the latest version is not available via biocLite() immediately (it usually becomes available on the following day, depending on the time that changes were committed to the repository). If you don't want to wait until then, you can check out the most recent version via SVN immediately. However, at this point SGSeq 1.5.9 is also available via biocLite().
Thank you that was really informative.
I managed to obtain and run SGSeq 1.5.9., The code runs perfectly for one sample but fails when I input multiple BAM files. Is it possible for me to use one of the apply family of functions and coerce it into the required datastructure?
files$file_bam = BamFileList(files$file_bam, asMates = TRUE, qnameSuffixStart = "/")
> sc = getBamInfo(files[1,])
TCGA-3C-AAAU-01A.bam complete.
> sc = getBamInfo(files)
TCGA-3C-AAAU-01A.bam complete.
TCGA-3C-AALK-01A.bam complete.
TCGA-4H-AAAK-01A.bam complete.
TCGA-5L-AAT0-01A.bam complete.
TCGA-5L-AAT1-01A.bam complete.
TCGA-5T-A9QA-01A.bam complete.
Error in DataFrame(df, check.names = FALSE) :
cannot coerce class "list" to a DataFrame
Sorry - this is a bug I introduced when adding support for BamFileLists. Please try again with SGSeq 1.5.10 (you can obtain it via svn now or biocLite() tomorrow). If you have any other problems please let me know.
To answer your question - the 'sample_info' input is just a data.frame or DataFrame object. So you can run getBamInfo() for individual samples and then create the full data frame yourself.
Thank you for all the help Leonard, SGSeq work like a charm now.