!genome-build-accession NCBI_Assembly:GCF_016699485.2

Question

Error in Create switchAnalyzeRlist

0

Entering edit mode

bpshtiwan ▴ 10

@bpshtiwan-14942

Last seen 3 months ago

Iraq

Hello Everyone

I am using isoformswitchinganalysisR and I I have stuck in step of Create switchAnalyzeRlist. According to warning error it looks I have problem in gtf file. Really appreciate anybody help how can I find gtf file contain phaplotyps info """You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf"""

Here it is the error output

""" For mor

Create switchAnalyzeRlist

aSwitchList <- importRdata(

isoformCountMatrix = Quant$counts,

isoformRepExpression = Quant$abundance,

designMatrix = myDesign,

isoformExonAnnoation = "../../Kallisto/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.gtf",

isoformNtFasta = "../../Kallisto/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.cdna.all.fa",

fixStringTieAnnotationProblem = TRUE,

showProgress = FALSE

) Step 1 of 7: Checking data... Step 2 of 7: Obtaining annotation... importing GTF (this may take a while)... Error in importRdata(isoformCountMatrix = Quant$counts, isoformRepExpression = Quant$abundance, : The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925). Either isforoms found in the annotation are not quantifed or vise versa. Specifically: 44937 isoforms were quantified. 85704 isoforms are annotated. Only 0 overlap. 44937 isoforms quantifed had no corresponding annoation

This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.

If there is no overlap (as in zero or close) there are two options: 1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files). 2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments. Examples from expression matrix are : ENSGALT00010031805.1, ENSGALT00010003870.1, ENSGALT00010064047.1 Examples of annoation are : XM_046899296.1, XR_005859820.2, XM_015278322.4 Examples of isoforms which were only found im the quantification are : ENSGALT00010063466.1, ENSGALT00010031621.1, ENSGALT00010000926.1

If there is a large overlap but still far from complete there are 3 possibilites: 1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files). 2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf 3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.) 4) It is somthing to do with how a subset of the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments. """

AlternativeSplicing RNASeq IsoformSwitchAnalyzeR gtf • 2.6k views

ADD COMMENT • link written 2.7 years ago by bpshtiwan ▴ 10

score 2 · Accepted Answer · 2022-08-15

2

Entering edit mode

k.vitting.seerup ▴ 120

@kvittingseerup-7956

Last seen 20 months ago

European Union

Looks like you are supplying the wrong annotation as a GTF file.

If you look at the examples the id's the GTF annotation are called "XM_*" whereas the quantified transcripts are called "ENSGA".

Cheers

Kristoffer

ADD COMMENT • link 2.6 years ago k.vitting.seerup ▴ 120

0

Entering edit mode

Definitely, you are right. I Checked the ref file (cDNA fasta and gtf file ) gene Id is different. I downloaded both gtf and cDNA fasts file from same source

Kinldy, how can I solve this problem? does it works if change ENSGA" to XM_ " in fasts file ?

Best regards

Blockquote

!genome-build-accession NCBI_Assembly:GCF_016699485.2

!annotation-source NCBI Gallus gallus Annotation Release 106

NC_052532.1 Gnomon gene 13550 19518 . - . gene_id "LOC124418406"; transcript_id ""; db_xref "GeneID:124418406"; gbkey "Gene"; gene "LOC124418406"; gene_biotype "lncRNA"; NC_052532.1 Gnomon transcript 13550 19518 . - . gene_id "LOC124418406"; transcript_id "XR_006939951.1"; db_xref "GeneID:124418406"; experiment "COORDINATES: cap analysis [ECO:0007248]"; gbkey "ncRNA"; gene "LOC124418406"; model_evidence "Supporting evidence includes similarity to: 1 long SRA read, and 100% coverage of the annotated genomic feature by RNAseq alignments"; product "uncharacterized LOC124418406"; transcript_biotype "lnc_RNA"; NC_052532.1 Gnomon exon 19484 19518 . - . gene_id "LOC124418406"; transcript_id "XR_006939951.1"; db_xref "GeneID:124418406"; experiment "COORDINATES: cap analysis [ECO:0007248]"; gene "LOC124418406"; model_evidence "Supporting evidence includes similarity to: 1 long SRA read, and 100% coverage of the annotated genomic feature by RNAseq alignments"; product "uncharacterized LOC124418406"; transcript_biotype "lnc_RNA"; exon_number "1";

This first lines of fast file

Blockquote ENSGALT00010000007.1 cdna primary_assembly:bGalGal1.mat.broiler.GRCg7b:MT:2824:3798:1 gene:ENSGALG00010000007.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ND1 description:NADH dehydrogenase subunit 1 [Source:NCBI gene (formerly Entrezgene);Acc:63549479] ATGACCCTGCCCACCCTAACAAACCTTCTAATCATAACCTTATCCTATATTCTCCCCATC CTAATCGCCGTGGCCTTCTTAACACTTGTAGAACGAAAAATCCTAAGCTACATGCAGGCC

ADD REPLY • link 2.6 years ago bpshtiwan ▴ 10

0

Entering edit mode

I just double checked and the fasta files found here should work. Please also take a look at this if you want to use Ensembl.

ADD REPLY • link 2.6 years ago k.vitting.seerup ▴ 120

0

Entering edit mode

Ok, have you looked cDNA fasta file for Chicken (maternal Broiler) Gallus gallus? I am using Kallisto for alignment and quantification. Here, I should use cDNA fasta.

ADD REPLY • link 2.6 years ago bpshtiwan ▴ 10

0

Entering edit mode

Oh sorry I missed the species. Yes they also match for chicken :-) (both being called "ENSGALT..."). The other is from NCBI

ADD REPLY • link 2.6 years ago k.vitting.seerup ▴ 120

0

Entering edit mode

Great... now it is same. but still I have problem with four possibilities during using IsoformSwitchAnalyzeR. Knidly, any idea where is the problem?

Create switchAnalyzeRlist

aSwitchList <- importRdata( isoformCountMatrix = Quant$counts, isoformRepExpression = Quant$abundance, designMatrix = myDesign, isoformExonAnnoation = "../../Kallisto/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.107.chr.gtf.gz", isoformNtFasta = "../../Kallisto/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.cdna.all.fa.gz", removeNonConvensionalChr = TRUE, fixStringTieAnnotationProblem = TRUE, ignoreAfterBar = TRUE, ignoreAfterSpace = TRUE, ignoreAfterPeriod = TRUE )

Blockquote

Step 2 of 7: Obtaining annotation... importing GTF (this may take a while)... Error in importRdata(isoformCountMatrix = Quant$counts, isoformRepExpression = Quant$abundance, : The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925). Either isforoms found in the annotation are not quantifed or vise versa. Specifically: 44937 isoforms were quantified. 44362 isoforms are annotated. Only 44362 overlap. 575 isoforms quantifed had no corresponding annoation

This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.

If there is no overlap (as in zero or close) there are two options: 1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files). 2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments. Examples from expression matrix are : ENSGALT00010007474, ENSGALT00010044269, ENSGALT00010044286 Examples of annoation are : ENSGALT00010021236, ENSGALT00010047725, ENSGALT00010061756 Examples of isoforms which were only found im the quantification are : ENSGALT00010004235, ENSGALT00010003686, ENSGALT00010000320

If there is a large overlap but still far from complete there are 3 possibilites: 1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files). 2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf 3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.) 4) It is somthing to do with how a subset of the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.

ADD REPLY • link 2.6 years ago bpshtiwan ▴ 10

0

Entering edit mode

That sounds like you have the wrong version of the annotation compared to which version you used for the quantification?

ADD REPLY • link 2.6 years ago k.vitting.seerup ▴ 120