Hi,
This is my first time with RNA-Seq analysis, so I might be doing something wrong! I got some bam files from a collaborator that I need to process. I followed the pipeline, but I seem to be getting 0 counts! The warnings given are "The 2 combined objects have no sequence levels in common." What do I need to do?
I followed the bioconductor pipeline https://www.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html
Here is my code and screen output:
*********************
library("Rsamtools")
library("GenomicFeatures")
library("GenomicAlignments")
library("BiocParallel")
register(MulticoreParam(1))
refdir <- "/home/xxxx/ref_databases/rna_gtf/"
refgtf <- paste0(refdir,"Homo_sapiens.GRCh38.90.gtf") # downloaded from ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens
> txdb <- makeTxDbFromGFF(refgtf, format="gtf")
Import genomic features from the file as a GRanges object ...
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(type, phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
> ebg <- exonsBy(txdb, by="gene")
> ebg
GRangesList object of length 58302:
$ENSG00000000003
GRanges object with 20 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] X [100627109, 100629986] - | 676055 ENSE00003730948
[2] X [100628670, 100629986] - | 676056 ENSE00001459322
[3] X [100630759, 100630866] - | 676057 ENSE00000868868
[4] X [100632063, 100632068] - | 676058 ENSE00003731560
[5] X [100632485, 100632568] - | 676059 ENSE00000401072
... ... ... ... . ... ...
[16] X [100635558, 100635746] - | 676070 ENSE00003733424
[17] X [100636191, 100636689] - | 676071 ENSE00001886883
[18] X [100636608, 100636806] - | 676072 ENSE00001855382
[19] X [100636793, 100637104] - | 676073 ENSE00001863395
[20] X [100639945, 100639991] - | 676074 ENSE00001828996
...
<58301 more elements>
-------
seqinfo: 47 sequences (1 circular) from an unspecified genome; no seqlengths
>
> bamfiles <- BamFileList(filenames[1:2], yieldSize=1000000)
>
> se <- summarizeOverlaps(features=ebg,
+ reads=bamfiles,
+ mode="Union",
+ singleEnd=FALSE,
+ ignore.strand=TRUE,
+ fragments=TRUE )
There were 28 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: call dbDisconnect() when finished working with a connection
2: In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
3: In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
>
> print(se)
class: RangedSummarizedExperiment
dim: 58302 2
metadata(0):
assays(1): counts
rownames(58302): ENSG00000000003 ENSG00000000005 ... ENSG00000284747
ENSG00000284748
rowData names(0):
colnames(2):
run1798_lane12_read_indexN701-S517=BR-337-All_Aligned_Reads.bam
run1798_lane12_read_indexN702-S502=BR-101-All_Aligned_Reads.bam
colData names(0):
>
>
>
>
> sum(assay(se))
[1] 0
*****************************
How/What do I troubleshoot? The gtf file seems ok (each line has ENS ids, coordinates etc.).
thanks!
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux 7.3 (Nitrogen)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BiocParallel_1.12.0 Rsubread_1.28.0
[3] GenomicAlignments_1.14.1 SummarizedExperiment_1.8.0
[5] DelayedArray_0.4.1 matrixStats_0.52.2
[7] GenomicFeatures_1.30.0 AnnotationDbi_1.40.0
[9] Biobase_2.38.0 Rsamtools_1.30.0
[11] Biostrings_2.46.0 XVector_0.18.0
[13] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0
[15] IRanges_2.12.0 S4Vectors_0.16.0
[17] BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 compiler_3.4.2 prettyunits_1.0.2
[4] bitops_1.0-6 tools_3.4.2 zlibbioc_1.24.0
[7] progress_1.1.2 biomaRt_2.34.0 digest_0.6.12
[10] bit_1.1-12 lattice_0.20-35 RSQLite_2.0
[13] memoise_1.1.0 tibble_1.3.4 pkgconfig_2.0.1
[16] rlang_0.1.4 Matrix_1.2-11 DBI_0.7
[19] GenomeInfoDbData_0.99.1 rtracklayer_1.38.0 stringr_1.2.0
[22] grid_3.4.2 bit64_0.9-7 R6_2.2.2
[25] XML_3.98-1.9 RMySQL_0.10.13 blob_1.1.0
[28] magrittr_1.5 assertthat_0.2.0 stringi_1.1.6
[31] RCurl_1.95-4.8
Hi,
Have you checked that the chromosome names in your bam files corresponds to the annotations you are using? If there is mismatches in chromosome name between annotations and bams (most often either 1 or chr1 for chromosome 1) you might get 0 counts.
Thanks thokall. You were right!