Question

Help using rtracklayer for gene ID extraction

0

Entering edit mode

Aichatou • 0

@7e63d4bf

Last seen 3 months ago

United States

Greetings,

I ran genome-wide association studies for wheat quality traits in wheat (Triticum aestivum) which returned a few significant regions of the genome. I would like to extract the genes within those regions using the version 2.1 of the wheat reference genome, using the "rtracklayer". I do not know where to start, I consulted the package documentation and I still don't understand it.

Please advise.

Best regards.

rtracklayer • 708 views

ADD COMMENT • link written 3 months ago by Aichatou • 0

score 0 · Answer 1 · 2024-12-12

You don't need rtracklayer for this (not directly anyway), but instead you need txdbmaker which you can use to make a TxDb object that contains the gene locations, etc. There are (at least) two annotation services that have gene data for wheat; NCBI and Ensembl. I imagine MSU has one as well, so it depends on what genome you generated your SNP data from. Ideally you stick with the same annotation service because they all do things slightly differently, and trying to map things between any of them is a fool's errand. I generally use NCBI annotations, but they have a (terrible, IMO) habit of naming chromosomes using RefSeq IDs, and I don't know an easy way of changing the chromosomes, so I will show you how to use Ensembl instead.

> library(txdbmaker)
>> txdb <- makeTxDbFromGFF("https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/triticum_aestivum/Triticum_aestivum.IWGSC.60.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/triticum_aestivum/Triticum_aestivum.IWGSC.60.gff3.gz'
Content type 'application/x-gzip' length 24463854 bytes (23.3 MB)
downloaded 23.3 MB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature = "exon",  :
  68 exons couldn't be linked to a
  transcript so were dropped
  (showing only the first 6):
  seqid     start       end strand
1    1A 326329197 326329304      +
2    1A 366114101 366114204      +
3    1A 403039760 403039877      -
4    1B 357916232 357916339      -
5    1B 395058787 395058890      +
6    1B 432758094 432758212      +
    ID               Name
1 <NA> ENSRNA050014038-E1
2 <NA> ENSRNA050013952-E1
3 <NA> ENSRNA050013948-E1
4 <NA> ENSRNA050017309-E1
5 <NA> ENSRNA050019912-E1
6 <NA> ENSRNA050022692-E1
                         Parent
1 transcript:ENSRNA050014038-T1
2 transcript:ENSRNA050013952-T1
3 transcript:ENSRNA050013948-T1
4 transcript:ENSRNA050017309-T1
5 transcript:ENSRNA050019912-T1
6 transcript:ENSRNA050022692-T1
  Parent_type
1        <NA>
2        <NA>
3        <NA>
4        <NA>
5        <NA>
6        <NA>
> genes(txdb)
GRanges object with 120676 ranges and 1 metadata column:
                    seqnames
                       <Rle>
    ENSRNA050007810       3A
    ENSRNA050007817       3A
    ENSRNA050007820       3A
    ENSRNA050007823       3A
    ENSRNA050007826       3A
                ...      ...
  TraesCSU02G272776       Un
  TraesCSU02G272800       Un
  TraesCSU02G272807       Un
  TraesCSU02G272900       Un
  TraesCSU02G272997       Un
                                 ranges
                              <IRanges>
    ENSRNA050007810     8594746-8594817
    ENSRNA050007817     8598700-8598771
    ENSRNA050007820   19771253-19771326
    ENSRNA050007823   27007038-27007108
    ENSRNA050007826   32161321-32161392
                ...                 ...
  TraesCSU02G272776 478410340-478410699
  TraesCSU02G272800 478934578-478934940
  TraesCSU02G272807 478901536-478902039
  TraesCSU02G272900 479850130-479850488
  TraesCSU02G272997 480840068-480840121
                    strand |
                     <Rle> |
    ENSRNA050007810      + |
    ENSRNA050007817      + |
    ENSRNA050007820      + |
    ENSRNA050007823      + |
    ENSRNA050007826      + |
                ...    ... .
  TraesCSU02G272776      - |
  TraesCSU02G272800      + |
  TraesCSU02G272807      - |
  TraesCSU02G272900      + |
  TraesCSU02G272997      + |
                              gene_id
                          <character>
    ENSRNA050007810   ENSRNA050007810
    ENSRNA050007817   ENSRNA050007817
    ENSRNA050007820   ENSRNA050007820
    ENSRNA050007823   ENSRNA050007823
    ENSRNA050007826   ENSRNA050007826
                ...               ...
  TraesCSU02G272776 TraesCSU02G272776
  TraesCSU02G272800 TraesCSU02G272800
  TraesCSU02G272807 TraesCSU02G272807
  TraesCSU02G272900 TraesCSU02G272900
  TraesCSU02G272997 TraesCSU02G272997
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

Now is where you have to take over. I showed you how to get a GRanges of the gene locations. You can do all sorts of things with these objects and it's well beyond the scope of a support site post to cover them. You can read the vignettes for GenomicFeatures and GenomicRanges and maybe GenomicAlignments.