RNA-seq analysis with R/BioC: alignment and read counts
3
0
Entering edit mode
Fiona ▴ 70
@fiona-5790
Last seen 8.8 years ago
United Kingdom

Hi everyone,

I'm analyzing my first paired-end RNA-seq data in R/BioConductor and I'm having some trouble. I've given detail below but please let me know if I need to give more information. I'd be really grateful for any help or pointers.

I downloaded the D. melanogaster reference genome (dm6, the most recent version) from the UCSC server, and used SAMtools and Picard to create a reference genome .fa file. I carried out my alignment to this reference file with R BioC 'QuasR' package, command qAlign, and this gave me .bam files for all my samples. I used:

proj <- qAlign(sampleFile,genomeFile,cacheDir=results)

with sampleFile and genomeFile providing the paths to the sample and reference .fa files, respectively. This step seemed to go fine.

Now I am trying to generate read counts with R BioC 'GenomicFeatures' package. Following the package notes and 'how to' guide, I used:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz",format="gff")

exbygene <- exonsBy(txdb, by="gene")

I get a warning message for makeTxDbFromGFF, stating that:

Warning messages:
1: In matchCircularity(seqlevels(gr), circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  FBtr0084079, FBtr0084080, FBtr0084081, FBtr0084082, FBtr0084083,
  FBtr0084084, FBtr0084085, FBtr0307759, FBtr0307760
3: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have CDSs that
  cannot be mapped to an exon: FBtr0100857, FBtr0100861, FBtr0100863,
  FBtr0100866, FBtr0100868, FBtr0100870, FBtr0100883, FBtr0433498,
  FBtr0433502

The exonsBy command runs fine though, so I tried to continue with qCount in 'QuasR' with this code:

anycounts <- qCount(proj, txdb, reportLevel="gene", orientation="any")

and I get the following error:

Error in qCount(proj, txdb, reportLevel = "gene", orientation = "any") :
  sequence levels in 'query' not found in alignment files: 2L, 2R, 3L, 3R, 4, X, Y, Unmapped_Scaffold_8_D1580_D1567, 211000022278158, 211000022278279, 211000022278282, 211000022278298, 211000022278307, 211000022278309, 211000022278436, 211000022278449, 211000022278498, 211000022278522, 211000022278603, 211000022278604, 211000022278664, 211000022278724, 211000022278750, 211000022278760, 211000022278875, 211000022278877, 211000022278878, 211000022278879, 211000022278880, 211000022278985, 211000022279055, 211000022279108, 211000022279132, 211000022279134, 211000022279165, 211000022279188, 211000022279222, 211000022279264, 211000022279342, 211000022279392, 211000022279446, 211000022279528, 211000022279529, 211000022279531, 211000022279555, 211000022279681, 211000022279708, 211000022280133, 211000022280328, 211000022280341, 211000022280347, 211000022280481, 211000022280494, 211000022280645, 211000022280703, mitochondrion_genome, rDNA


As far as my beginner knowledge can tell, the reference I used to create .bam files is not matching up with the reference I am using to count reads against. Is this the case?

If so, I understand why that would be a problem, but I don't understand how the problem has occurred, nor how to fix it. As far as I can tell, I used the same reference for both of these steps (in the form of the .fa file for alignment and the .gff to make the gene model). Can anyone help me solve this one?

Thanks in advance for any help,

Fiona

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Scientific Linux release 6.6 (Carbon)

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ShortRead_1.26.0        GenomicAlignments_1.4.1 BiocParallel_1.2.3     
 [4] Rsamtools_1.20.4        Biostrings_2.36.1       XVector_0.8.0          
 [7] rtracklayer_1.28.4      edgeR_3.10.2            limma_3.24.10          
[10] DESeq_1.20.0            lattice_0.20-31         locfit_1.5-9.1         
[13] GenomicFeatures_1.20.1  AnnotationDbi_1.30.1    Biobase_2.28.0         
[16] QuasR_1.8.3             Rbowtie_1.8.0           GenomicRanges_1.20.5   
[19] GenomeInfoDb_1.4.0      IRanges_2.2.4           S4Vectors_0.6.0        
[22] BiocGenerics_0.14.0    

loaded via a namespace (and not attached):
 [1] BiocInstaller_1.18.3 RColorBrewer_1.1-2   futile.logger_1.4.1
 [4] bitops_1.0-6         futile.options_1.0.0 tools_3.2.0         
 [7] zlibbioc_1.14.0      biomaRt_2.24.0       annotate_1.46.0     
[10] RSQLite_1.0.0        BSgenome_1.36.0      DBI_0.3.1           
[13] genefilter_1.50.0    hwriter_1.3.2        grid_3.2.0          
[16] XML_3.98-1.2         survival_2.38-2      latticeExtra_0.6-26
[19] geneplotter_1.46.0   lambda.r_1.1.7       splines_3.2.0       
[22] xtable_1.7-4         RCurl_1.95-4.6     
drosophila melanogaster quasR genomicfeatures genomicranges alignment • 3.6k views
ADD COMMENT
0
Entering edit mode
Sonali Arora ▴ 390
@sonali-arora-6563
Last seen 8.7 years ago
United States

Hi Fiona, 

My understanding from the information that you've given is, that the seqlevelsStyle of the GFF files is not UCSC and that is giving rise
to these issues. I suggest using the following steps 

a) read in the GFF file as a GRanges
b) read in one of the bam files as a GenomicAlignments. 
c) check if both of them have the same chromosome naming convention ?
d) convert everything to have same naming convention
e) then create the txdb object. 

I have shown some code below - but have used test files which you can 
replace with your own GFF and BAM files

Test GFF file

library(rtracklayer)
test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")


Test Bam file

library(pasillaBamSubset)
fl <- untreated3_chr4()   # filename as a character


Here is an example of how to read in a GFF file as a GRanges - 

library(rtracklayer)
gff_gr <- import(test_gff3, format="gff")
seqlevelsStyle(gff_gr)

Read in your BAM files into R with  - 

library(GenomicAlignments) 
bamData <- readGAlignmentsList(fl)
seqlevelsStyle(bamData) 

Checking step: 

if(seqlevelsStyle(galist1)==seqlevelsStyle(gff_gr))
{
    # create the txdb
    txdb <- makeTxdbFromGRanges(gff_gr) # should work!
} 


This is how you change the chromosome naming style to "UCSC"

library(GenomeInfoDb)
seqlevelsStyle(gff_gr) <- "UCSC"

then run 

txdb <- makeTxdbFromGRanges(gff_gr)


Hope that helps!
Sonali. 

ADD COMMENT
0
Entering edit mode

Thank you for the code and for the explanation! This makes sense, and the code fixes the problem.

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

Hi Fiona,

I don't want to confuse you with more alternatives, but there is a complete start to finish case study of analysing D. melanogaster RNA-seq data in the last chapter of the limma User's Guide:

 http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

This example uses featureCounts() in the Rsubread package to count the reads. It may be worth considering.

ADD COMMENT
0
Entering edit mode

Thanks very much for the suggestion - I didn't realize limma could be used for RNAseq as well as microarray analysis, so this will definitely be worth me trying out.

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States

Hi Fiona,

I suppose you got dmel-all-r6.05.gff.gz from FlyBase. AFAIK it contains annotations based on reference genome BDGP Release 6 which is indeed the same assembly as dm6 from UCSC. However, as Sonali suggested, UCSC doesn't use the same chromosome naming convention as FlyBase. An easy way to address this is to grab the gene/transcript/exon models from UCSC. That is, just replace:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz", format="gff")

with

txdb <- makeTxDbFromUCSC(genome="dm6", tablename="refGene")

See ?makeTxDbFromUCSC for more information.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thanks for pointing this out - I didn't know about the makeTxDbFromUCSC function, and it fixes the problem.

ADD REPLY

Login before adding your answer.

Traffic: 665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6