Question

Issues with SGSeq: analyzeFeatures() returns errors on BAM files

1

Entering edit mode

Sylvain Foisy ▴ 80

@sylvain-foisy-5539

Last seen 5.6 years ago

Canada

Hi,

I am trying to run SGSeq over my data and I have issues with the analyzeFeatures method. I prepared both my data with getBamInfo() and built my annotations with importTranscripts() method using the GTF file that I used for alignment with tophat 2. At the analyzeFeatures() step, I get this:

> sgfc_ncbi<-analyzeFeatures(data.r,features=sgfdb,cores=8)

Obtain counts...

Erreur :

Error in getSGFeatureCountsPerSample for MONOCYTES_169377:

Error in value[[3L]](cond) : 'scanBam' failed:

record: 0

error: 0

file: /path/toward/myBamFiles/accepted_hits_properly_paired.bam

index: /path/toward/myBamFiles/accepted_hits_properly_paired.bam

And so on for all my samples... These are paired-end BAM files, as checked via getBamInfo() and they do have the XS tag so I am at a lost right now.

Just in case:

> sessionInfo()

R version 3.2.0 (2015-04-16)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: CentOS release 6.5 (Final)

locale:

[1] LC_CTYPE=fr_CA.UTF-8 LC_NUMERIC=C

[3] LC_TIME=fr_CA.UTF-8 LC_COLLATE=fr_CA.UTF-8

[5] LC_MONETARY=fr_CA.UTF-8 LC_MESSAGES=fr_CA.UTF-8

[7] LC_PAPER=fr_CA.UTF-8 LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=fr_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] SGSeq_1.4.3 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3

[4] IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1

loaded via a namespace (and not attached):

[1] igraph_1.0.1 AnnotationDbi_1.32.3

[3] XVector_0.10.0 magrittr_1.5

[5] zlibbioc_1.16.0 GenomicAlignments_1.6.3

[7] BiocParallel_1.4.3 tools_3.2.0

[9] SummarizedExperiment_1.0.2 Biobase_2.30.0

[11] DBI_0.4-1 lambda.r_1.1.9

[13] futile.logger_1.4.3 rtracklayer_1.30.4

[15] futile.options_1.0.0 bitops_1.0-6

[17] RUnit_0.4.31 biomaRt_2.26.1

[19] RCurl_1.95-4.8 RSQLite_1.0.0

[21] GenomicFeatures_1.22.13 Biostrings_2.38.4

[23] Rsamtools_1.22.0 XML_3.98-1.4

Any idea?

Best regards

Sylvain Foisy, Ph. D.

Montreal Heart Institute

Montreal, Qc

sgseq • 3.7k views

ADD COMMENT • link updated 5.3 years ago by anjanhazra ▴ 10 • written 8.7 years ago by Sylvain Foisy ▴ 80

score 2 · Answer 1 · 2016-08-05

2

Entering edit mode

Leonard Goldstein ▴ 260

@leonard-goldstein-6845

Last seen 10 months ago

Australia

Hi Sylvain,

I don't know for sure without seeing your data, but one situation that leads to this type of error is if the annotation and BAM files have incompatible chromosome names i.e. one might follow UCSC naming conventions (e.g. "chr1") and the other NCBI conventions (e.g. "1"). Can you check that these are compatible? Also note that you are working with an outdated Bioconductor release (although this should be unrelated to your problem).

Best regards

Leonard

ADD COMMENT • link 8.7 years ago Leonard Goldstein ▴ 260

0

Entering edit mode

Hi Leonard,

Yes take 1, indeed my alignements were made with annotations from Illumina's iGenome data from NCBI; I used "seqlevelsStyle(txdb) <- "NCBI"" as specified elsewhere to go over this problem. Yes, take 2, you are right it ain't the latest but as of now, I am still stuck using this version for a while... I am trying the same code on a newer version on my Mac.

Is SGSeq using the BAI file created via samtools? I ask this because of the line "index: " which is pointing toward the BAM file, not its index... What is the scanBam method actually doing that might fail?

Best regards

S

ADD REPLY • link 8.7 years ago Sylvain Foisy ▴ 80

0

Entering edit mode

Hi Leonard,

Ok, I looked into running elsewhere and well, I have errors of a different kind. I build a smaller version of my analysis (2 BAM per condition with only 2 conditions) and using the latest R/Bioconductor on my MacBook. Using the same pipeline, I now get these:

Obtain counts...
Erreur :
Error in getSGFeatureCountsPerSample for MONOCYTES_169377:
Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), :
seqlevels(param) not in BAM header:
seqlevels: ‘chr1_GL383518v1_alt’, ‘chr1_GL383519v1_alt’, ‘chr1_GL383520v2_alt’, ‘chr1_KI270759v1_alt’, ‘chr1_KI270761v1_alt’, ‘chr1_KI270762v1_alt’, ‘chr1_KI270763v1_alt’, ‘chr1_KI270765v1_alt’, ‘chr1_KI270766v1_alt’, ‘chr1_KI270892v1_alt’, ‘chr2_GL383521v1_alt’, ‘chr2_GL383522v1_alt’, ‘chr2_GL582966v2_alt’, ‘chr2_KI270767v1_alt’, ‘chr2_KI270768v1_alt’, ‘chr2_KI270769v1_alt’, ‘chr2_KI270770v1_alt’, ‘chr2_KI270772v1_alt’, ‘chr2_KI270773v1_alt’, ‘chr2_KI270774v1_alt’, ‘chr2_KI270776v1_alt’, ‘chr2_KI270893v1_alt’, ‘chr2_KI270894v1_alt’, ‘chr3_GL383526v1_alt’, ‘chr3_JH636055v2_alt’, ‘chr3_KI270777v1_alt’, ‘chr3_KI270779v1_alt’, ‘chr3_KI270780v1_alt’, ‘chr3_KI270782v1_alt’, ‘chr3_KI270783'

This tells you something interesting? FYI, I did the original alignments using tophat 2 with the Illumina's iGenomes human GRCh38-based indexes and GTF files.

Best regards

S

ADD REPLY • link 8.7 years ago Sylvain Foisy ▴ 80

1

Entering edit mode

Hi Sylvain,

Yes the more recent Bioconductor installation gives a more informative error message. It tells you that the errors are due to chromosome names in your annotation that cannot be found in the BAM file header. If you remove the problematic chromosomes from your annotation, this should fix the problem, e.g. for your data you can use something like

> seqlevels_in_bam <- seqlevels(BamFile(data.r$file_bam[1])) > sgfdb2 <- keepSeqlevels(sgfdb, seqlevels_in_bam)

and then run analyzeFeatures() using sgfdb2.

Regarding your earlier question, the scanBam() function has an argument 'index' which can be used to specify the name of the index file. By default this is simply the name of the BAM file with extension '.bai' appended.

I hope this answers your questions, let me know if you run into any other problems.

Best wishes

Leonard

ADD REPLY • link 8.7 years ago Leonard Goldstein ▴ 260

0

Entering edit mode

Hi Leonard,

Success!! Your last inputs made the thing work ;-) I guess that I will have to make some preprocessing first on this criteria.

Thanks for all your help. Can't promise I won't need additional help in a few hours/days ;-)

S

ADD REPLY • link 8.7 years ago Sylvain Foisy ▴ 80

score 2 · Answer 2 · 2017-09-28

Just had a similar problem but figured it out thanks to this post! `seqlevels` in bam files did not match `hsapiens_gene_ensembl` annotations.

If using Ensembl annotation

> hsaEnsembl <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl")
> txf <- convertToTxFeatures(hsaEnsembl)
> Gene <- genes(edb, filter = list(GenenameFilter("CD19"), GeneBiotypeFilter("protein_coding")))
> txf_CD19 <- txf[txf %over% Gene]

But the chromosomes do not match

> seqlevels(BamFile(si$file_bam[1]))
[1] "1"          "2"          "3"          "4"          "5"          "6"          "7"          "8"          "9"          "10"      
[11] "11"         "12"         "13"         "14"         "15"         "16"         "17"         "18"         "19"         "20"      
[21] "21"         "22"         "X"          "Y"          "MT"         "GL000191.1" "GL000192.1" "GL000193.1" "GL000194.1" "GL000195.1"
[31] "GL000196.1" "GL000197.1" "GL000198.1" "GL000199.1" "GL000200.1" "GL000201.1" "GL000202.1" "GL000203.1" "GL000204.1" "GL000205.1"
[41] "GL000206.1" "GL000207.1" "GL000208.1" "GL000209.1" "GL000210.1" "GL000211.1" "GL000212.1" "GL000213.1" "GL000214.1" "GL000215.1"
[51] "GL000216.1" "GL000217.1" "GL000218.1" "GL000219.1" "GL000220.1" "GL000221.1" "GL000222.1" "GL000223.1" "GL000224.1" "GL000225.1"
[61] "GL000226.1" "GL000227.1" "GL000228.1" "GL000229.1" "GL000230.1" "GL000231.1" "GL000232.1" "GL000233.1" "GL000234.1" "GL000235.1"
[71] "GL000236.1" "GL000237.1" "GL000238.1" "GL000239.1" "GL000240.1" "GL000241.1" "GL000242.1" "GL000243.1" "GL000244.1" "GL000245.1"
[81] "GL000246.1" "GL000247.1" "GL000248.1" "GL000249.1"

> seqlevels(txf_CD19)[grepl("GL", seqlevels(txf_CD19))]
[1] "GL000008.2" "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000208.1" "GL000213.1" "GL000214.1" "GL000216.2" "GL000218.1"
[11] "GL000219.1" "GL000220.1" "GL000221.1" "GL000224.1" "GL000225.1" "GL000226.1"

Just keep only chromosomes 1-22,X,Y,MT

> seqlevels_in_bam <- seqlevels(BamFile(si$file_bam[1]))[1:25]
> txf2 <- keepSeqlevels(txf_CD19, seqlevels_in_bam)
>  sgfc_CD19 <- analyzeFeatures(si, features = txf2)

score 0 · Answer 3 · 2019-12-20

I followed the instructions privided by Dr. Leonard to overcome the similar issue and got this. Can you please help out to get rid of this error?

sgfWRKY332 <- keepSeqlevels(sgfWRKY33, seqlevelsinbam) Error in keepSeqlevels(sgfWRKY33, seqlevelsinbam) : invalid seqlevels: Scaffold2, Scaffold4, Scaffold5, Scaffold7, Scaffold8, Scaffold9, Scaffold11, Scaffold12, Scaffold13, Scaffold14, Scaffold16, Scaffold17, Scaffold18, Scaffold22, Scaffold23, Scaffold26, Scaffold27, Scaffold30, Scaffold31, Scaffold33, Scaffold35, Scaffold36, Scaffold38, Scaffold39, Scaffold41, Scaffold42, Scaffold43, Scaffold44, Scaffold45, Scaffold46, Scaffold47, Scaffold48, Scaffold51, Scaffold52, Scaffold54, Scaffold56, Scaffold57, Scaffold58, Scaffold60, Scaffold61, Scaffold62, Scaffold64, Scaffold65, Scaffold66, Scaffold67, Scaffold70, Scaffold71, Scaffold72, Scaffold73, Scaffold74, Scaffold76, Scaffold77, Scaffold78, Scaffold79, Scaffold80, Scaffold81, Scaffold82, Scaffold83, Scaffold85, Scaffold86, Scaffold87, Scaffold88, Scaffold89, Scaffold90, Scaffold91, Scaffold92, Scaffold93, Scaffold94, Scaffold95, Scaffold96, Scaffold97, Scaffold98, Scaffold99, Scaffold100, Scaffold102, Scaffold103, Scaffold104, Scaffold105, Scaffold106, Scaff