Question

Error in dbCreator function at PGA package

0

Entering edit mode

gabrielw • 0

@gabrielw-15924

Last seen 5.0 years ago

Hello all I just tried to use with my data PGA package and i had an error in dbCreator:

library(BSgenome.Hsapiens.UCSC.hg19)
dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation_path,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path)
Error in y[, z] : subscript out of bounds

this is the vcf header:

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)">
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls">
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
#CHROM  POS ID  REF ALT QUAL
chr1    10043516    .   A   G   0.01
chr1    10044378    .   A   G   0.01

the gtf header:

iarc@iarc-H8QG6:/New_data/pancreatic/new_data$ head  -20 new_panc/merged_asm/merged.gtf 
1   Cufflinks   exon    852260  857889  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
1   Cufflinks   exon    857991  858025  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
1   Cufflinks   exon    878110  878438  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
1   Cufflinks   exon    878633  878757  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";

bedfile header:

1   12697   13221   JUNC_0  0   +   12697   13221   255,0,0 2   50,50   0,575
1   14829   14970   JUNC_1  1   -   14829   14970   255,0,0 2   50,50   0,192
1   14829   15021   JUNC_2  0   -   14829   15021   255,0,0 2   50,50   0,243
1   15012   25233   JUNC_3  0   +   15012   25233   255,0,0 2   50,50   0,10272
1   15038   15796   JUNC_4  3   -   15038   15796   255,0,0 2   50,50   0,809

annotation prepared with PrepareAnnotationRefseq2

any thoughts?

PGA Proteogenomics • 903 views

ADD COMMENT • link updated 5.0 years ago by Kevin Blighe ★ 4.0k • written 5.0 years ago by gabrielw • 0

Kevin Blighe · Answer 1 · 2020-04-28

0

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 5 weeks ago

Republic of Ireland

I re-formatted your post by highlighting the code chunks and pressing the 101 010 button... to improve its aesthetics.

The likely cause is that your contigs ('chromosomes') are labeled differently: in your VCF, they have a 'chr' prefix; in your GTF and BED, they have no prefix.

I suppose that you have different choices to make about which contig style you wish to use. For example, adding a 'chr' prefix to your GTF and BED is as simple as:

awk '{print "chr"$0}' merged.gtf
*et cetera*

On the other hand, removing the 'chr' prefix from the VCF is as simple as:

sed 's/^chr//' merged.vcf

That sed command will automatically ignore the VCF header because we match the start of each line via the caret (^).

Kevin

ADD COMMENT • link 5.0 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thank you for the patience to format it! and it passed straight from me this incompatibility of chr within files!! lol Thanks I fixed merged gtf and the bed file with chr but still I am getting the same error!

iarc@iarc-H8QG6:/New_data/pancreatic/new_data$ head merged.gtf
chr1    Cufflinks   exon    852260  857889  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1    Cufflinks   exon    857991  858025  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1    Cufflinks   exon    878110  878438  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1    Cufflinks   exon    878633  878757  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1    Cufflinks   exon    879078  879188  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1    Cufflinks   exon    879288  880180  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "4"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1    Cufflinks   exon    880422  880526  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "5"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1    Cufflinks   exon    880898  881033  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "6"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1    Cufflinks   exon    881527  881666  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "7"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1    Cufflinks   exon    881782  881925  .   +   .   gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "8"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
iarc@iarc-H8QG6:/New_data/pancreatic/new_data$ head junctions.bed 
chr1    12697   13221   JUNC_0  0   +   12697   13221   255,0,0 2   50,50   0,575

ADD REPLY • link updated 5.0 years ago by Kevin Blighe ★ 4.0k • written 5.0 years ago by gabrielw • 0

0

Entering edit mode

No problem. Okay, the next problem is that your VCF is corrupt and is not valid. A VCF should have the following mandatory headers:

1. #CHROM
2. POS
3. ID
4. REF
5. ALT
6. QUAL
7. FILTER
8. INFO

You need to go back a few steps and determine how this corrupt VCF was created

ADD REPLY • link 5.0 years ago Kevin Blighe ★ 4.0k