Question

Error in strsplit() : non-character argument - using DEXSeq

0

Entering edit mode

sakura.nussbaum ▴ 10

@sakuranussbaum-10160

Last seen 9.0 years ago

Dear all,

I am a new user of DEXSeq and I would need some help concerning an error message that I get.

When I call the DEXSeqDataSetFromHTSeq, I obtain the following error:

> library(DEXSeq)
> dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile)
Error in strsplit(rownames(dcounts), ":") : non-character argument

Here is what countFiles and sampleTable look like:

> countFiles
[1] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_1.txt"   
[2] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_2.txt"   
[3] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_3.txt"   
[4] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_1.txt"   
[5] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_2.txt"   
[6] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_3.txt"   
[7] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_1.txt"    
[8] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_2.txt"    
[9] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_3.txt"    
[10] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_4.txt"    
[11] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_1.txt"     
[12] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_2.txt"     
[13] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_4.txt"     
[14] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_1.txt"     
[15] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_4.txt"     
[16] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_5.txt"     
[17] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_6.txt"     
[18] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_1.txt" 
[19] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_2.txt" 
[20] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_1.txt" 
[21] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_2.txt" 
[22] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_3.txt" 
[23] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_1.txt"
[24] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_2.txt"
[25] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_3.txt"
[26] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_1.txt"
[27] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_2.txt"
[28] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_3.txt"
> sampleTable
          condition    libType
basal1        basal single-end
basal2        basal single-end
basal3        basal single-end
bulge1        bulge single-end
bulge2        bulge single-end
bulge3        bulge single-end
cdh31          cdh3 single-end
cdh32          cdh3 single-end
cdh33          cdh3 single-end
cdh34          cdh3 single-end
gli11          gli1 single-end
gli12          gli1 single-end
gli14          gli1 single-end
lef11          lef1 single-end
lef14          lef1 single-end
lef15          lef1 single-end
lef16          lef1 single-end
lgr6101        lgr6 single-end
lgr6102        lgr6 single-end
lgr6401        lgr6 single-end
lgr6402        lgr6 single-end
lgr6403        lgr6 single-end
pdgfra101    pdgfra single-end
pdgfra102    pdgfra single-end
pdgfra103    pdgfra single-end
pdgfra401    pdgfra single-end
pdgfra402    pdgfra single-end
pdgfra403    pdgfra single-end

I guess this is correct. However, when looking at the count files, I have the impression that the .txt look weird. Here is the first one (sample: basal_1) as an example. This is all that is contained in the file:

_ambiguous   0
_ambiguous_readpair_position   0
_empty   9651846
_lowaqual   846199
_notaligned   421258

I created this file with this command (and successfully installed pysam before):

$ python /Library/Frameworks/R.framework/Versions/3.2/Resources/library/DEXSeq/python_scripts/dexseq_count.py -f bam mm10.genes.gff Sample_Basal_1_159/accepted_hits_no_rRNA.bam counts/basal_1.txt

Is this output normal with dexseq_count.py? If not, does anyone know what could be the problem? I tried with different options, but as my data are not paired-end and strand specific, I should use it like this.

Here is my sessionInfo() output:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DEXSeq_1.16.10             DESeq2_1.10.1             
 [3] RcppArmadillo_0.6.700.3.0  Rcpp_0.12.4               
 [5] SummarizedExperiment_1.0.2 GenomicRanges_1.22.4      
 [7] GenomeInfoDb_1.6.3         IRanges_2.4.8             
 [9] S4Vectors_0.8.11           Biobase_2.30.0            
[11] BiocGenerics_0.16.1        BiocParallel_1.4.3        

loaded via a namespace (and not attached):
 [1] genefilter_1.52.1    statmod_1.4.24       locfit_1.5-9.1      
 [4] splines_3.2.3        lattice_0.20-33      colorspace_1.2-6    
 [7] survival_2.39-2      XML_3.98-1.4         foreign_0.8-66      
[10] DBI_0.3.1            RColorBrewer_1.1-2   lambda.r_1.1.7      
[13] plyr_1.8.3           stringr_1.0.0        zlibbioc_1.16.0     
[16] Biostrings_2.38.4    munsell_0.4.3        gtable_0.2.0        
[19] futile.logger_1.4.1  hwriter_1.3.2        latticeExtra_0.6-28 
[22] geneplotter_1.48.0   biomaRt_2.26.1       AnnotationDbi_1.32.3
[25] acepack_1.3-3.3      xtable_1.8-2         scales_0.4.0        
[28] Hmisc_3.17-3         annotate_1.48.0      XVector_0.10.0      
[31] Rsamtools_1.22.0     gridExtra_2.2.1      ggplot2_2.1.0       
[34] stringi_1.0-1        grid_3.2.3           tools_3.2.3         
[37] bitops_1.0-6         magrittr_1.5         RCurl_1.95-4.8      
[40] RSQLite_1.0.0        Formula_1.2-1        cluster_2.0.4       
[43] futile.options_1.0.0 Matrix_1.2-5         rsconnect_0.4.2.1   
[46] rpart_4.1-10         nnet_7.3-12

dexseq_count DEXSeqDataSetFromHTSeq DEXSeq strsplit • 3.6k views

ADD COMMENT • link updated 9.0 years ago by aditi ▴ 20 • written 9.0 years ago by sakura.nussbaum ▴ 10

0

Entering edit mode

Hi Sakura,

Can you post the first lines of your annotation file?

mm10.genes.gff

Alejandro

ADD REPLY • link 9.0 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hello, of course, here they are:

chr1   unknown   gene   3214482   3671498   .   -   .   ID=Xkr4;Name=NM_001011874
chr1   unknown   exon   3214482   3216968   .   -   .   Parent=Xkr4
chr1   unknown   stop_codon   3216022   3216024   .   -   .   Parent=Xkr4
chr1   unknown   CDS   3216025   3216968   .   -   2   Parent=Xkr4
chr1   unknown   CDS   3421702   3421901   .   -   1   Parent=Xkr4
chr1   unknown   exon   3421702   3421901   .   -   .   Parent=Xkr4
chr1   unknown   CDS   3670552   3671348   .   -   0   Parent=Xkr4
chr1   unknown   exon   3670552   3671498   .   -   .   Parent=Xkr4
chr1   unknown   start_codon   3671346   3671348   .   -   .   Parent=Xkr4
chr1   unknown   gene   4290846   4293012   .   -   .   ID=Rp1;Name=NM_001195662

ADD REPLY • link 9.0 years ago sakura.nussbaum ▴ 10

0

Entering edit mode

Hi Sakura,

You file is not a gtf file, it is a gff file. You need a gtf file, as the ones provided by ensembl, and preprocess it using the script dexseq_prepare_annotation.py. Please read the DEXSeq vignette for more details.

Alejandro

ADD REPLY • link 9.0 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Yes, sorry you are right, this was a gff file that I obtained previously by converting the gtf.

So I realised the problem happened when preprocessing the gtf. Here are the 10 first lines of gtf file:

chr1   unknown   exon   3214482   3216968   .   -   .   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   stop_codon   3216022   3216024   .   -   .   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   CDS   3216025   3216968   .   -   2   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   CDS   3421702   3421901   .   -   1   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   exon   3421702   3421901   .   -   .   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   CDS   3670552   3671348   .   -   0   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   exon   3670552   3671498   .   -   .   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   start_codon   3671346   3671348   .   -   .   gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1   unknown   exon   4290846   4293012   .   -   .   gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";
chr1   unknown   stop_codon   4292981   4292983   .   -   .   gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";

When I run the script like this:

python /Library/Frameworks/R.framework/Versions/3.2/Resources/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py mm10.genes.gtf mm10.genes.gff

The output gives an empty gff file. That is why my count file is also only with empty counts...

ADD REPLY • link 9.0 years ago sakura.nussbaum ▴ 10

0

Entering edit mode

Hi Sakura,

Could you send me your gtf file so I could have a closer look?

Alejandro

ADD REPLY • link 9.0 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Here it is:

https://www.dropbox.com/s/m2zhchpc6vf5nmo/mm10.genes.gtf?dl=0

ADD REPLY • link 9.0 years ago sakura.nussbaum ▴ 10

0

Entering edit mode

Hi Sakura,

I did get an error message while running your same line of code:

Traceback (most recent call last): File "/Users/reyes/Work/Rpcks/DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py", line 129, in <module> raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) ) ValueError: Same gene name found on two chromosomes: <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr18: 48046732 -> 48048383 (strand '+')>, <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr4: 150237196 -> 150237323 (strand '+')>

The error message should be self-explanatory.

Alejandro

ADD REPLY • link 9.0 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hello,

Thanks for the reply, I reinstalled DEXSeq and indeed obtained the same error, as it happened to me a few days ago. At this point I had tried to follow the advices given here: DEXSeq: problem with dexseq_prepare_annotation.py and this started to give me empty files. But now that I tried again, it seems to be better, so it was probably an error that came from the first time I tried to do modifications to the python script.

Thank you again for your help!!

Sakura

ADD REPLY • link 9.0 years ago sakura.nussbaum ▴ 10

score 0 · Answer 1 · 2016-04-27

0

Entering edit mode

aditi ▴ 20

@aditi-9925

Last seen 6.7 years ago

Indian Institute of Science,Bangalore, …

Hi Sakura,

Your count files seem to be full of empty counts(9million seems too high). If your library is strand specific I'm guessing that you must have taken care of the defining the library type during the alignment (firststrand or secondstrand if you used tophat, that is). Also, with some library prep kits like NEB ultra-directional kit and Illumina TruSeq kits the you need to specify strandedness as 'reverse' and not just 'yes' ('yes' is the default). You can also check for the kind of strandedness using IGV. Using the wrong information on strandedness usually gives you empty counts. Also, I'm hoping that the annotation file that you used for alignment is the same as the one that you used to create your gff.

Hope this helps!

ADD COMMENT • link 9.0 years ago aditi ▴ 20

0

Entering edit mode

Thank you for your answer!

It was indeed (one of) the problem(s). I put -s reverse and now I have 1 million empty counts. However, it still seems high. Do you think it can come from somewhere else?

Best,
Sakura

ADD REPLY • link 9.0 years ago sakura.nussbaum ▴ 10

0

Entering edit mode

Hi Sakura,

How many reads do you have per library?
In my experience, it is not that uncommon to get around ~10 to ~15% of "empty counts".

Alejandro

ADD REPLY • link 9.0 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi Alejandro,

Indeed, I have around 10mio per library. So it seems ok!
Thank you again,

Sakura

ADD REPLY • link 9.0 years ago sakura.nussbaum ▴ 10