DESeq2 - extracting results - Gnomon gene model
1
0
Entering edit mode
sieminsk • 0
@3fc834aa
Last seen 3.3 years ago
Canada

I am working with a non-model organism for which there are 2 versions of the genome. I would like to use the newer version as it has a better resolution and completeness. There is no Ensembl gene model for this version of the genome. I have instead used Gnomon gene model included with the genome version deposited at NCBI. I was able to perform DESeq analysis. My questions is with regards to extracting fasta sequences from the original genome based on the results.

The results obtained from DESeq use the following gene ID format:

gene-NC_027757.2:10030825..10032141
gene-NC_027757.2:10036650..10043774
gene-NC_027757.2:10047118..10049010
gene-NC_027757.2:10076606..10077880
gene-NC_027757.2:10099173..10101111

The GFF file associated with the Gnomon gene model has the following entry format (per line):

NC_027757.2     Gnomon  gene    45762   45986   .       -       .       ID=geneNC_027757.2:45762..45986;description=gene.37187;gbkey=Gene;gene_biotype=protein_coding

The problem lies with the fact that the results format for each gene corresponds to the 9th column of the GFF file and is embedded with other information within this column. I am wondering if there is package such as Biostrings that allows me to retrieve the DNA sequences from the genome based on the Gnomon gene ID format. I have attempted to do this in Biostrings however, from what I can understand unless the "gene-NC_027757.2:10030825..10032141" is used as the gene name rather than being in the 9th column of the GFF file the Biostrings can't retrieve the corresponding sequences.

Are there packages or code chunk that you might be able to suggest which would allow me to use the results from DESeq2 to retrieve the corresponding genomic sequences within Bioconductor?

R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] stringr_1.4.0               genefilter_1.72.0          
 [3] ggplot2_3.3.2               PoiClaClu_1.0.2.1          
 [5] RColorBrewer_1.1-2          pheatmap_1.0.12            
 [7] DESeq2_1.30.0               BiocParallel_1.24.1        
 [9] GenomicAlignments_1.26.0    SummarizedExperiment_1.20.0
[11] MatrixGenerics_1.2.0        matrixStats_0.57.0         
[13] GenomicFeatures_1.42.1      AnnotationDbi_1.52.0       
[15] Biobase_2.50.0              Rsamtools_2.6.0            
[17] Biostrings_2.58.0           XVector_0.30.0             
[19] GenomicRanges_1.42.0        GenomeInfoDb_1.26.1        
[21] IRanges_2.24.0              S4Vectors_0.28.0           
[23] BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.2             bit64_4.0.5            splines_4.0.4         
 [4] assertthat_0.2.1       askpass_1.1            BiocFileCache_1.14.0  
 [7] blob_1.2.1             GenomeInfoDbData_1.2.4 yaml_2.2.1            
[10] progress_1.2.2         pillar_1.4.7           RSQLite_2.2.1         
[13] lattice_0.20-41        glue_1.4.2             digest_0.6.27         
[16] colorspace_2.0-0       Matrix_1.3-2           XML_3.99-0.5          
[19] pkgconfig_2.0.3        biomaRt_2.46.0         zlibbioc_1.36.0       
[22] purrr_0.3.4            xtable_1.8-4           scales_1.1.1          
[25] tibble_3.0.4           openssl_1.4.3          annotate_1.68.0       
[28] generics_0.1.0         ellipsis_0.3.1         withr_2.3.0           
[31] survival_3.2-7         magrittr_2.0.1         crayon_1.3.4          
[34] memoise_1.1.0          xml2_1.3.2             tools_4.0.4           
[37] prettyunits_1.1.1      hms_0.5.3              lifecycle_0.2.0       
[40] locfit_1.5-9.4         munsell_0.5.0          DelayedArray_0.16.0   
[43] compiler_4.0.4         tinytex_0.27           rlang_0.4.8           
[46] grid_4.0.4             RCurl_1.98-1.2         rstudioapi_0.13       
[49] rappdirs_0.3.1         bitops_1.0-6           gtable_0.3.0          
[52] DBI_1.1.0              curl_4.3               R6_2.5.0              
[55] dplyr_1.0.2            rtracklayer_1.50.0     bit_4.0.4             
[58] stringi_1.5.3          Rcpp_1.0.5             vctrs_0.3.5           
[61] geneplotter_1.68.0     dbplyr_2.0.0           tidyselect_1.1.0
DESeq2 Gnomon Biostrings • 1.2k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

I am not sure I completely understand your question. There doesn't seem to be anything particularly useful in the 9th (actually 8th by my count) column.

> library(rtracklayer)
> download.file("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/686/985/GCF_000686985.2_Bra_napus_v2.0/GCF_000686985.2_Bra_napus_v2.0_genomic.gff.gz", "GCF_000686985.2_Bra_napus_v2.0_genomic.gff.gz")
> gr <- import("GCF_000686985.2_Bra_napus_v2.0_genomic.gff.gz")
> gr
GRanges object with 1794744 ranges and 60 metadata columns:
               seqnames      ranges strand |   source            type     score
                  <Rle>   <IRanges>  <Rle> | <factor>        <factor> <numeric>
        [1] NC_027757.2  1-35822559      + |   RefSeq          region        NA
        [2] NC_027757.2   6637-9236      - |   Gnomon          gene          NA
        [3] NC_027757.2   6637-9236      - |   Gnomon          mRNA          NA
        [4] NC_027757.2   9092-9236      - |   Gnomon          exon          NA
        [5] NC_027757.2   8813-8994      - |   Gnomon          exon          NA
        ...         ...         ...    ... .      ...             ...       ...
  [1794740] NC_004946.1   6765-7175      + |   RefSeq gene                   NA
  [1794741] NC_004946.1   6765-7175      + |   RefSeq CDS                    NA
  [1794742] NC_004946.1  8175-11411      - |   RefSeq gene                   NA
  [1794743] NC_004946.1  8175-11411      - |   RefSeq CDS                    NA
  [1794744] NC_004946.1 11314-11640      + |   RefSeq inverted_repeat        NA
                phase                     ID
            <integer>            <character>
        [1]      <NA> NC_027757.2:1..35822..
        [2]      <NA>      gene-LOC106345161
        [3]      <NA>     rna-XM_013814943.2
        [4]      <NA>  exon-XM_013814943.2-1
        [5]      <NA>  exon-XM_013814943.2-2
        ...       ...                    ...
  [1794740]      <NA>        gene-BrnaMpl_p5
  [1794741]         0        cds-NP_862327.1
  [1794742]      <NA>        gene-BrnaMpl_p6
  [1794743]         0        cds-NP_862328.1
  [1794744]      <NA> id-NC_004946.1:11314..

<snip>

This GRanges object is what you would need to get the sequences (note that the RefSeq IDs in the seqnames column correspond to the chromosomes). You would have to do some mapping from the row.names of your DESeq2, results to extract the chromosome and positions for those genes, but that's just a simple text parsing exercise.

Now you need a BSgenome object, which you can make using the BSgenome package. I'm not going to go into that, except to note that there is a vignette, and that there are some example BSgenomePkgSeeds in that package:

> dir(system.file("extdata/GentlemanLab", package="BSgenome"))
  [1] "BSgenome.Amellifera.BeeBase.assembly4-seed"    
  [2] "BSgenome.Amellifera.BeeBase.assembly4-tools"   
  [3] "BSgenome.Amellifera.UCSC.apiMel2-seed"         
  [4] "BSgenome.Amellifera.UCSC.apiMel2.masked-seed"  
  [5] "BSgenome.Athaliana.TAIR.04232008-seed"         
  [6] "BSgenome.Athaliana.TAIR.TAIR10.1-seed"         
  [7] "BSgenome.Athaliana.TAIR.TAIR10.1-tools"        
  [8] "BSgenome.Athaliana.TAIR.TAIR9-seed"            
  [9] "BSgenome.Athaliana.TAIR.TAIR9-tools"    

<snip>

And one of the A thaliana examples might be useful, being plants and all.

> readLines(dir(system.file("extdata/GentlemanLab", package="BSgenome"), full.names = TRUE)[6])
 [1] "Package: BSgenome.Athaliana.TAIR.TAIR10.1"                                                                                                                               
 [2] "Title: Full genome sequences for Arabidopsis thaliana (TAIR10.1)"                                                                                                        
 [3] "Description: Full genome sequences for Arabidopsis thaliana as provided by TAIR (TAIR10.1, RefSeq assembly accession: GCF_000001735.4) and stored in Biostrings objects."
 [4] "Version: 1.5.0"                                                                                                                                                          
 [5] "organism: Arabidopsis thaliana"                                                                                                                                          
 [6] "common_name: Thale cress"                                                                                                                                                
 [7] "genome: TAIR10.1"                                                                                                                                                        
 [8] "provider: TAIR"                                                                                                                                                          
 [9] "release_date: 2018/03/15"                                                                                                                                                
[10] "source_url: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001735.4"                                                                                                       
[11] "organism_biocview: Arabidopsis_thaliana"                                                                                                                                 
[12] "BSgenomeObjname: Athaliana"                                                                                                                                              
[13] "SrcDataFiles: GCF_000001735.4_TAIR10.1_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/"                           
[14] "PkgExamples: genome[[\"1\"]]"                                                                                                                                            
[15] "seqs_srcdir: /home/hpages/BSgenomeForge/srcdata/BSgenome.Athaliana.TAIR.TAIR10.1/seqs"                                                                                   
[16] "ondisk_seq_format: rds"

Which you could copy and adjust to suit. And then to get the sequences you would just use subsets of your GRanges object of interest to get the sequences, using getSeq in BSgenome.

Anyway, that's a lot, but you do work with a non-model organism so you are +/- on your own. So give it a shot and let us know if you get jammed up somewhere and have a specific question.

ADD COMMENT
0
Entering edit mode

Oh, and I should note that the 'GRanges` object can be subsetted, as it does include things like the whole extent of each chromosome.

> gr[gr$type == "region"]
GRanges object with 1471 ranges and 60 metadata columns:
               seqnames     ranges strand |   source     type     score
                  <Rle>  <IRanges>  <Rle> | <factor> <factor> <numeric>
     [1]    NC_027757.2 1-35822559      + |   RefSeq   region        NA
     [2]    NC_027758.2 1-35332830      + |   RefSeq   region        NA
     [3]    NC_027759.2 1-49139117      + |   RefSeq   region        NA
     [4]    NC_027760.2 1-23556786      + |   RefSeq   region        NA
     [5]    NC_027761.2 1-31473442      + |   RefSeq   region        NA
     ...            ...        ...    ... .      ...      ...       ...
  [1467] NW_019170001.1     1-1001      + |   RefSeq   region        NA
  [1468] NW_019170002.1      1-723      + |   RefSeq   region        NA
  [1469]    NC_008285.1   1-221853      + |   RefSeq   region        NA
  [1470]    NC_016734.1   1-152860      + |   RefSeq   region        NA
  [1471]    NC_004946.1    1-11640      + |   RefSeq   region        NA
             phase                     ID          Dbxref           Name
         <integer>            <character> <CharacterList>    <character>
     [1]      <NA> NC_027757.2:1..35822..      taxon:3708             A1
     [2]      <NA> NC_027758.2:1..35332..      taxon:3708             A2
     [3]      <NA> NC_027759.2:1..49139..      taxon:3708             A3
     [4]      <NA> NC_027760.2:1..23556..      taxon:3708             A4
     [5]      <NA> NC_027761.2:1..31473..      taxon:3708             A5
     ...       ...                    ...             ...            ...
  [1467]      <NA> NW_019170001.1:1..1001      taxon:3708        Unknown
  [1468]      <NA>  NW_019170002.1:1..723      taxon:3708        Unknown
  [1469]      <NA>  NC_008285.1:1..221853      taxon:3708             MT
  [1470]      <NA>  NC_016734.1:1..152860      taxon:3708           Pltd
  [1471]      <NA>   NC_004946.1:1..11640      taxon:3708 linear plasmid
          chromosome     cultivar       gbkey        genome    mol_type
         <character>  <character> <character>   <character> <character>
     [1]          A1         ZS11         Src    chromosome genomic DNA
     [2]          A2         ZS11         Src    chromosome genomic DNA
     [3]          A3         ZS11         Src    chromosome genomic DNA
     [4]          A4         ZS11         Src    chromosome genomic DNA
     [5]          A5         ZS11         Src    chromosome genomic DNA
     ...         ...          ...         ...           ...         ...
  [1467]     Unknown         ZS11         Src       genomic genomic DNA
  [1468]     Unknown         ZS11         Src       genomic genomic DNA
  [1469]        <NA>       Westar         Src mitochondrion genomic DNA
  [1470]        <NA>         <NA>         Src   chloroplast genomic DNA
  [1471]        <NA> Isuzu-natane         Src mitochondrion genomic DNA

Which is boring, so you could do

> gr[!gr$type %in% "region"]
GRanges object with 1793273 ranges and 60 metadata columns:
               seqnames      ranges strand |   source            type     score
                  <Rle>   <IRanges>  <Rle> | <factor>        <factor> <numeric>
        [1] NC_027757.2   6637-9236      - |   Gnomon            gene        NA
        [2] NC_027757.2   6637-9236      - |   Gnomon            mRNA        NA
        [3] NC_027757.2   9092-9236      - |   Gnomon            exon        NA
        [4] NC_027757.2   8813-8994      - |   Gnomon            exon        NA
        [5] NC_027757.2   8688-8744      - |   Gnomon            exon        NA
        ...         ...         ...    ... .      ...             ...       ...
  [1793269] NC_004946.1   6765-7175      + |   RefSeq gene                   NA
  [1793270] NC_004946.1   6765-7175      + |   RefSeq CDS                    NA
  [1793271] NC_004946.1  8175-11411      - |   RefSeq gene                   NA
  [1793272] NC_004946.1  8175-11411      - |   RefSeq CDS                    NA
  [1793273] NC_004946.1 11314-11640      + |   RefSeq inverted_repeat        NA

And like that.

ADD REPLY
0
Entering edit mode

Hello,

I have a follow-up question. I have followed your instructions. I am attempting to forge the genome for Bnapus using the BSGenomeForge function.

My seed file (in dcf format) is as follows:

Package: BSgenome.Bnapus.NCBI.Bra_napus_v2.0

Title: Full genome sequences for Brassica napus (Bra_napus_v2.0)

Description: Full genome sequences for Brassica napus as provided by NCBI (v2.0, RefSeq assembly accession: GCF_000686985.2)

Version: 1.0.0

organism: Brassica napus

common_name: Rape

genome: Bra_napus_v2.0

provider: NCBI

release_date: 2017/09/22

source_url: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Brassica_napus/latest_assembly_versions/GCF_000686985.2_Bra_napus_v2.0/

organism_biocview: Brassica_napus

seqnames: c("GCF_000686985.2_Bra_napus_v2.0_genomic.fa")

BSgenomeObjname: Bnapus

SrcDataFiles: GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Brassica_napus/latest_assembly_versions/GCF_000686985.2_Bra_napus_v2.0/GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz

PkgExamples: genome[["1"]]

seqs_srcdir: /home/edytas/scratch/bnapus_genome/BSgenomeForge/seqs_srcdir

ondisk_seq_format: fa

When running the following command: forgeBSgenomeDataPkg("/home/edytas/scratch/bnapus_genome/BSgenomeForge/seqs_srcdir/bnapus_seed.dcf")

I receive the following error: Error in .make_Seqinfo_from_genome(genome) : "Bra_napus_v2.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)

I have checked and this genome is not currently registered with GenomeInfoDb however I feel that this error has something to do with my seed file. Would you be able to take a look and let me know you see any glaring errors?

ADD REPLY

Login before adding your answer.

Traffic: 436 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6