Hi everyone,
I'm analyzing my first paired-end RNA-seq data in R/BioConductor and I'm having some trouble. I've given detail below but please let me know if I need to give more information. I'd be really grateful for any help or pointers.
I downloaded the D. melanogaster reference genome (dm6, the most recent version) from the UCSC server, and used SAMtools and Picard to create a reference genome .fa file. I carried out my alignment to this reference file with R BioC 'QuasR' package, command qAlign, and this gave me .bam files for all my samples. I used:
proj <- qAlign(sampleFile,genomeFile,cacheDir=results)
with sampleFile and genomeFile providing the paths to the sample and reference .fa files, respectively. This step seemed to go fine.
Now I am trying to generate read counts with R BioC 'GenomicFeatures' package. Following the package notes and 'how to' guide, I used:
txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz",format="gff")
exbygene <- exonsBy(txdb, by="gene")
I get a warning message for makeTxDbFromGFF, stating that:
Warning messages: 1: In matchCircularity(seqlevels(gr), circ_seqs) : None of the strings in your circ_seqs argument match your seqnames. 2: In makeTxDbFromGRanges(gr, metadata = metadata) : The following transcripts were dropped because their exon ranks could not be inferred (either because the exons are not on the same chromosome/strand or because they are not separated by introns): FBtr0084079, FBtr0084080, FBtr0084081, FBtr0084082, FBtr0084083, FBtr0084084, FBtr0084085, FBtr0307759, FBtr0307760 3: In .reject_transcripts(bad_tx, because) : The following transcripts were rejected because they have CDSs that cannot be mapped to an exon: FBtr0100857, FBtr0100861, FBtr0100863, FBtr0100866, FBtr0100868, FBtr0100870, FBtr0100883, FBtr0433498, FBtr0433502
The exonsBy command runs fine though, so I tried to continue with qCount in 'QuasR' with this code:
anycounts <- qCount(proj, txdb, reportLevel="gene", orientation="any")
and I get the following error:
Error in qCount(proj, txdb, reportLevel = "gene", orientation = "any") : sequence levels in 'query' not found in alignment files: 2L, 2R, 3L, 3R, 4, X, Y, Unmapped_Scaffold_8_D1580_D1567, 211000022278158, 211000022278279, 211000022278282, 211000022278298, 211000022278307, 211000022278309, 211000022278436, 211000022278449, 211000022278498, 211000022278522, 211000022278603, 211000022278604, 211000022278664, 211000022278724, 211000022278750, 211000022278760, 211000022278875, 211000022278877, 211000022278878, 211000022278879, 211000022278880, 211000022278985, 211000022279055, 211000022279108, 211000022279132, 211000022279134, 211000022279165, 211000022279188, 211000022279222, 211000022279264, 211000022279342, 211000022279392, 211000022279446, 211000022279528, 211000022279529, 211000022279531, 211000022279555, 211000022279681, 211000022279708, 211000022280133, 211000022280328, 211000022280341, 211000022280347, 211000022280481, 211000022280494, 211000022280645, 211000022280703, mitochondrion_genome, rDNA
As far as my beginner knowledge can tell, the reference I used to create .bam files is not matching up with the reference I am using to count reads against. Is this the case?
If so, I understand why that would be a problem, but I don't understand how the problem has occurred, nor how to fix it. As far as I can tell, I used the same reference for both of these steps (in the form of the .fa file for alignment and the .gff to make the gene model). Can anyone help me solve this one?
Thanks in advance for any help,
Fiona
sessionInfo() R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Scientific Linux release 6.6 (Carbon) attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] ShortRead_1.26.0 GenomicAlignments_1.4.1 BiocParallel_1.2.3 [4] Rsamtools_1.20.4 Biostrings_2.36.1 XVector_0.8.0 [7] rtracklayer_1.28.4 edgeR_3.10.2 limma_3.24.10 [10] DESeq_1.20.0 lattice_0.20-31 locfit_1.5-9.1 [13] GenomicFeatures_1.20.1 AnnotationDbi_1.30.1 Biobase_2.28.0 [16] QuasR_1.8.3 Rbowtie_1.8.0 GenomicRanges_1.20.5 [19] GenomeInfoDb_1.4.0 IRanges_2.2.4 S4Vectors_0.6.0 [22] BiocGenerics_0.14.0 loaded via a namespace (and not attached): [1] BiocInstaller_1.18.3 RColorBrewer_1.1-2 futile.logger_1.4.1 [4] bitops_1.0-6 futile.options_1.0.0 tools_3.2.0 [7] zlibbioc_1.14.0 biomaRt_2.24.0 annotate_1.46.0 [10] RSQLite_1.0.0 BSgenome_1.36.0 DBI_0.3.1 [13] genefilter_1.50.0 hwriter_1.3.2 grid_3.2.0 [16] XML_3.98-1.2 survival_2.38-2 latticeExtra_0.6-26 [19] geneplotter_1.46.0 lambda.r_1.1.7 splines_3.2.0 [22] xtable_1.7-4 RCurl_1.95-4.6
Thank you for the code and for the explanation! This makes sense, and the code fixes the problem.