Entering edit mode
Richard Friedman
★
2.0k
@richard-friedman-513
Last seen 10.3 years ago
Dear Bioconductor List,
I am working through the easyRNASeq human sequence use case
with a difference. I am using Bowtie/Tophat/Cufflinks aligned reads
and transcripts. It works fine for getting counts of exons,
transcripts, and
genes for the genes.gtf file from the UCSD assembly. The difficulty
comes in when I try to get counts for novel exons, genes and
transcripts.
To get this, I am uses the merged.gtf file which is generated by
cuffmerge
rather than the genes.gtf file from UCSD. I get Exon counts just fine
(although I lose the old Ensembl Exon symbols). What I can't get is
transcript
counts. Here are excerpts from my session:
> library(easyRNASeq)
Loading required package: parallel
Loading required package: genomeIntervals
Loading required package: intervals
Loading required package: BiocGenerics
Attaching package: BiocGenerics
The following object(s) are masked from package:stats:
xtabs
The following object(s) are masked from package:base:
anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find,
get, intersect,
lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin,
pmin.int, Position,
rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply,
union, unique
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite
Bioconductor, see 'citation("Biobase")', and for packages
'citation("pkgname")'.
Loading required package: biomaRt
Loading required package: edgeR
Loading required package: limma
Loading required package: Biostrings
Loading required package: IRanges
Attaching package: IRanges
The following object(s) are masked from package:intervals:
expand, reduce
Attaching package: Biostrings
The following object(s) are masked from package:intervals:
type
Loading required package: BSgenome
Loading required package: GenomicRanges
Loading required package: DESeq
Loading required package: locfit
locfit 1.5-8 2012-04-25
Attaching package: locfit
The following object(s) are masked from package:GenomicRanges:
left, right
Loading required package: lattice
Attaching package: DESeq
The following object(s) are masked from package:limma:
plotMA
Loading required package: Rsamtools
Loading required package: ShortRead
Loading required package: latticeExtra
Loading required package: RColorBrewer
Warning messages:
1: replacing previous import coerce when loading intervals
2: replacing previous import initialize when loading intervals
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(Rsamtools)
>
> chr.sizes=seqlengths(Hsapiens)
> chr.sizes
chr1 chr2 chr3
chr4
249250621 243199373 198022430
191154276
chr5 chr6 chr7
chr8
180915260 171115067 159138663
146364022
chr9 chr10 chr11
chr12
141213431 135534747 135006516
133851895
chr13 chr14 chr15
chr16
115169878 107349540 102531392
90354753
chr17 chr18 chr19
chr20
81195210 78077248 59128983
63025520
chr21 chr22 chrX
chrY
48129895 51304566 155270560
59373566
chrM chr1_gl000191_random chr1_gl000192_random
chr4_ctg9_hap1
16571 106433 547496
590426
chr4_gl000193_random chr4_gl000194_random chr6_apd_hap1
chr6_cox_hap2
189789 191469 4622290
4795371
chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5
chr6_qbl_hap6
4610396 4683263 4833398
4611984
chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random
chr8_gl000197_random
4928567 182896 38914
37175
chr9_gl000198_random chr9_gl000199_random chr9_gl000200_random
chr9_gl000201_random
90085 169874 187035
36148
chr11_gl000202_random chr17_ctg5_hap1 chr17_gl000203_random
chr17_gl000204_random
40103 1680828 37498
81310
chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random
chr19_gl000208_random
174588 41001 4262
92689
chr19_gl000209_random chr21_gl000210_random chrUn_gl000211
chrUn_gl000212
159169 27682 166566
186858
chrUn_gl000213 chrUn_gl000214 chrUn_gl000215
chrUn_gl000216
164239 137718 172545
172294
chrUn_gl000217 chrUn_gl000218 chrUn_gl000219
chrUn_gl000220
172149 161147 179198
161802
chrUn_gl000221 chrUn_gl000222 chrUn_gl000223
chrUn_gl000224
155397 186861 180455
179693
chrUn_gl000225 chrUn_gl000226 chrUn_gl000227
chrUn_gl000228
211173 15008 128374
129120
chrUn_gl000229 chrUn_gl000230 chrUn_gl000231
chrUn_gl000232
19913 43691 27386
40652
chrUn_gl000233 chrUn_gl000234 chrUn_gl000235
chrUn_gl000236
45941 40531 34474
41934
chrUn_gl000237 chrUn_gl000238 chrUn_gl000239
chrUn_gl000240
45867 39939 33824
41933
chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
chrUn_gl000244
42152 43523 43341
39929
chrUn_gl000245 chrUn_gl000246 chrUn_gl000247
chrUn_gl000248
36651 38154 36422
39786
chrUn_gl000249
38502
countrabtex <- easyRNASeq(filesDirectory=getwd(),
+ organism="Hsapiens",
chr.sizes=chr.sizes,
+ readLength=58L,
+ annotationMethod="gtf",
+ annotationFile="merged.gtf",
+ count="exons",
+ gapped=TRUE,
+ filenames=bamfiles[1])
Checking arguments...
Fetching annotations...
Read 454815 records
Summarizing counts...
Processing SRR490224.bam
Preparing output
Warning messages:
1: In .Method(..., deparse.level = deparse.level) :
number of columns of result is not a multiple of vector length (arg
1)
2: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens",
chr.sizes = chr.sizes, :
There are 229182 features/exons defined in your annotation that
overlap! This implies that some reads will be counted more than once!
Is that really what you want?
3: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens",
chr.sizes = chr.sizes, :
You enforce UCSC chromosome conventions, however the provided
annotation is not compliant. Correcting it.
4: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
You enforce UCSC chromosome conventions, however the provided
alignments are not compliant. Correcting it.
5: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
The read length stored in the object (probably provided as
argument): 58
is not the same as the ones: 58, 57, 56, 55, 54, 53, 52, 51, 50, 49,
48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32,
31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 determined from the
file: /Documents/clients/Phyllis/learning_easyRNAseq/SRR490224.bam
Updating the readLength slot.
6: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
Not all the chromosome names in your chromosome size list
'chr.sizes' are present in your read file(s) (aln or bam).
7: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
The available chromosomes in both your read file(s) (aln or bam) and
'chr.sizes' list were restricted to their common term.
These are: chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16,
chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5,
chr6, chr7, chr8, chr9, chrM, chrX, chrY.
> write.csv(countrabtex,"countrabtexmerged.csv" )
THE OUTPUT FILE CONTAINED AN EXON COUNT LIST.
EXONS WERE LABELLED "XLOC".
QUESTION: IS THERE A WAY TO GET THE ENSEMBL NUMBER FOR
KNOWN EXONS AND RESERVE "XLOC" FOR NEW ONES?
NOW FOR TRANSCRIPTS:
> RnaSeq <- easyRNASeq(filesDirectory=getwd(),
+ organism="Hsapiens",
chr.sizes=chr.sizes,
+ readLength=58L,
+ annotationMethod="gtf",
+ annotationFile="merged.gtf",
+ count="exons",
+ gapped=TRUE,
+ filenames=bamfiles[1],
+ outputFormat="RNAseq")
Checking arguments...
Fetching annotations...
Read 454815 records
Summarizing counts...
Processing SRR490224.bam
Preparing output
Warning messages:
1: In .Method(..., deparse.level = deparse.level) :
number of columns of result is not a multiple of vector length (arg
1)
2: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens",
chr.sizes = chr.sizes, :
There are 229182 features/exons defined in your annotation that
overlap! This implies that some reads will be counted more than once!
Is that really what you want?
3: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens",
chr.sizes = chr.sizes, :
You enforce UCSC chromosome conventions, however the provided
annotation is not compliant. Correcting it.
4: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
You enforce UCSC chromosome conventions, however the provided
alignments are not compliant. Correcting it.
5: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
The read length stored in the object (probably provided as
argument): 58
is not the same as the ones: 58, 57, 56, 55, 54, 53, 52, 51, 50, 49,
48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32,
31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 determined from the
file: /Documents/clients/Phyllis/learning_easyRNAseq/SRR490224.bam
Updating the readLength slot.
6: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
Not all the chromosome names in your chromosome size list
'chr.sizes' are present in your read file(s) (aln or bam).
7: In fetchCoverage(rnaSeq, format = format, filename = filename,
filter = filter, :
The available chromosomes in both your read file(s) (aln or bam) and
'chr.sizes' list were restricted to their common term.
These are: chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16,
chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5,
chr6, chr7, chr8, chr9, chrM, chrX, chrY.
> class(RnaSeq)
[1] "RNAseq"
attr(,"package")
[1] "easyRNASeq"
>
> RnaSeqTranscripts<-transcriptCounts(RnaSeq)
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 easyRNASeq_1.4.2
[3] ShortRead_1.16.3 latticeExtra_0.6-24
[5] RColorBrewer_1.0-5 Rsamtools_1.10.2
[7] DESeq_1.10.1 lattice_0.20-10
[9] locfit_1.5-8 BSgenome_1.26.1
[11] GenomicRanges_1.10.5 Biostrings_2.26.2
[13] IRanges_1.16.4 edgeR_3.0.7
[15] limma_3.14.3 biomaRt_2.14.0
[17] Biobase_2.18.0 genomeIntervals_1.14.0
[19] BiocGenerics_0.4.0 intervals_0.13.3
loaded via a namespace (and not attached):
[1] annotate_1.36.0 AnnotationDbi_1.20.3 bitops_1.0-4.2
DBI_0.2-5
[5] genefilter_1.40.0 geneplotter_1.36.0 grid_2.15.2
hwriter_1.3
[9] RCurl_1.95-3 RSQLite_0.11.2 splines_2.15.2
stats4_2.15.2
[13] survival_2.37-2 XML_3.95-0.1 xtable_1.7-0
zlibbioc_1.4.0
>
QUESTION: IS THERE ANY WAY TO GET TRANSCRIPT COUNTS
INCLUDING NEW TRANSCRIPTS ASSEMBLED BY CUFFMERGE?
QUESTION: DO YOU HAVE ANY SUGGESTIONS FOR ANOTHER WAY OF GETTING
COUNTS OF NOVEL TRANSCRIPTS ASSEMBLED BY CUFFMERGE?
CUFFDIFF GIVES NON-INTEGRAL TRANSCRIPT NUMBERS SO I THINK THAT IT DOES
SOME SORT OF NORMALIZATION WHICH I DON'T WANT.
Thanks and best wishes,
Rich
Richard A. Friedman, PhD
Associate Research Scientist,
Biomedical Informatics Shared Resource
Herbert Irving Comprehensive Cancer Center (HICCC)
Lecturer,
Department of Biomedical Informatics (DBMI)
Educational Coordinator,
Center for Computational Biology and Bioinformatics (C2B2)/
National Center for Multiscale Analysis of Genomic Networks (MAGNet)/
Columbia Initiative in Systems Biology
Room 824
Irving Cancer Research Center
Columbia University
1130 St. Nicholas Ave
New York, NY 10032
(212)851-4765 (voice)
friedman@cancercenter.columbia.edu
http://cancercenter.columbia.edu/~friedman/
"Complex numbers! Ha! Ha! There is nothing weirder
than imaginary numbers. Architects don't need to know
complex numbers. Whenever I get a negative root for
an area, I throw it out. And don't talk to me about
quaternions. I am not going into computer animation."
-Rose Friedman, age 16
[[alternative HTML version deleted]]