Hi,
Please be aware that some familiarity with R is highly recommended before you can efficiently use Bioconductor.
It's not clear to me whether you want to extract the upstream and downstream regions (i.e. their coordinates with respect to the reference genome) or the genomic sequences (DNA) corresponding to these regions. If you want the latter you'll need to have access to the genomic sequences of your organism. I'm assuming you want the latter. The genomic sequences are typically made available by the annotation provider as FASTA or 2-bit files. They need to match exactly the assembly that the annotations in your GTF file are based on.
First, use makeTxDbFromGFF()
to import your GTF file as a TxDb object:
txdb <- makeTxDbFromGFF("path/to/GTF/file")
Note that makeTxDbFromGFF()
only imports gene, transcript, exon, and CDS locations. Strictly speaking that's all what is needed to represent gene models. This means that even if your GTF file had entries for UTRs, makeTxDbFromGFF()
would ignore them (UTRs can easily be inferred from exons and CDS, which is exactly what fiveUTRsByTranscript(txdb)
and threeUTRsByTranscript(txdb)
do for you).
Then to extract the 150 bp upstream sequences (with respect to the genes):
genes <- genes(txdb)
gene_upstream150_seqs <- extractUpstreamSeqs(x, genes, width=150)
where x
is typically a BSgenome object containing the genomic sequences of your assembly, or a TwoBitFile or FaFile object (see ?extractUpstreamSeqs
for more information). Making a BSgenome data package for your organism is a whole topic per se so I won't describe it here. See the "How to forge a BSgenome data package" vignette in the BSgenome software package if you are interested.
gene_upstream150_seqs
will be a named DNAStringSet object parallel to genes
i.e. it will contain one upstream sequence per gene in genes
. Also the names on it will be the same as the names on genes
. In addition it will carry some metadata of interest that you can get with mcols(gene_upstream150_seqs)
.
Unfortunately we don't have an extractDownstreamSeqs()
function (was never requested, should be easy to add though), so a little bit more work is required in order to extract the 150 bp downstream sequences:
gene_downstream150 <- trim(suppressWarnings(flank(genes, width=150, start=FALSE)))
gene_downstream150_seqs <- getSeq(x, gene_downstream150)
mcols(gene_downstream150_seqs) <- mcols(gene_upstream150_seqs)
gene_downstream150_seqs
will also be a named DNAStringSet object parallel to genes
.
Finally to extract the CDS sequences:
cds <- cdsBy(txdb, by="tx")
cds_seqs <- extractTranscriptSeqs(x, cds)
cds_seqs
will again be a named DNAStringSet object but this time it won't necessarily be parallel to genes
. That's because a given gene can have zero or more than one CDS sequence. The names on cds_seqs
will be the internal ids of the corresponding transcripts. The corresponding transcript names and gene ids can be put on the object as metadata columns with:
tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
mapping <- mcols(tx)
mapping$gene_id <- as.character(mapping$gene_id)
rownames(mapping) <- mapping$tx_id
mcols(cds_seqs) <- mapping[names(cds_seqs), ]
Hope this will help get you started.
Cheers,
H.
You can try this, but probably won't work for cds.