GenomicFeatures and custom genomes?
3
1
Entering edit mode
david.rinker ▴ 10
@davidrinker-13538
Last seen 7.2 years ago

Hi I want to use GenomicFeatures to extract upstream sequences from a genome using gene_IDs

My organism (Anopheles melas) has an assembled genome (fasta) and a gtf file.

It seems like these are all the files I should need, however I am not finding it easy to figure out on how to load these two data types for genomicFeatures to use? So far all I have figured out that I need to make a TxDb from gff; I have no idea what to do with the fasta file.

Could someone (BRIEFLY) tell me which packages and objects I need to use and how they all mesh together? 

I have read over the manual as well as the Rsamtools manual but am more confused than anything.

Thank you

genomicfeatures • 1.8k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

Depending on your OS, i'd suggest creating a '2bit' (rtracklayer, TwoBitFile) or indexed fasta (FaFile, Rsamtools) file.

For the gff, I think you can rtracklayer::import() it and get something lighter weight but still useful, or create the TxDb for greater flexibility.

The glue is Biostrings::getSeq(), which will take the 2bit or fasta file as first argument and genomic ranges as second returning the sequences corresponding to the ranges.

ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 5 days ago
Seattle, WA, United States

Hi,

Using extractUpstreamSeqs() from the GenomicFeatures package:

library(Biostrings)
genome <- readDNAStringSet(fasta_file)

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(gtf_file)
extractUpstreamSeqs(genome, txdb)

If you have a 2-bit file instead of FASTA, replace genome <- readDNAStringSet(fasta_file) with:

library(rtracklayer)
genome <- TwoBitFile(twobit_file)

The 2-bit format supports very efficient random access so this solution avoids loading the full genome sequences in memory. Note however that this won't work if the sequences contain IUAPAC ambiguity letters other than N (the 2-bit format does not support them).

More information and examples in ?extractUpstreamSeqs.

Cheers,

H.

ADD COMMENT
1
Entering edit mode

Just realizing now that extractUpstreamSeqs() doesn't take a DNAStringSet object, sorry. We should add this at some point. So for now you could either use:

library(Rsamtools)
genome <- FaFile(fasta_file)

as Martin suggested, or convert your FASTA file to 2-bit format with something like:

export(readDNAStringSet(fasta_file), "mygenome.2bit")

Note that using FaFile on a compressed FASTA file has been reported to be unreliable on Windows in the past.

Hope this helps,

H.

ADD REPLY
0
Entering edit mode
david.rinker ▴ 10
@davidrinker-13538
Last seen 7.2 years ago

Ok, guess this question is a bust. 

If the developers are reading, it might be a useful addition to the manual and/or vignette to outline this process. More and more "non-model" organisms are being sequenced and analyzed--those users are inevitably going to want to do what I'm currently trying to do. 

If I can figure this out I'll post my summary here to hopefully enlighten others. 

ADD COMMENT

Login before adding your answer.

Traffic: 492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6