Create Db for exon starts/ends using rtracklayer
0
0
Entering edit mode
@mehmet-ilyas-cosacak-9020
Last seen 6.8 years ago
Germany/Dresden/ CRTD - DZNE

Hi,

I am trying to generate a data.frame with columns: gene_id, gene_name, seqnames,strand,start,end,exon_starts,exon_ends using Danio rerio GTF file from ensembl.

I have written an R script as below, but it takes long time to create the data.frame that I am interested. Is there a quicker way?

library("rtracklayer")
get_gene_features <- function(filename, output_file){
    ngtf <- import(filename)
    ngtf_df <- as.data.frame(ngtf)
    ngtf <- NA
    len_ngtf_df <- sum(ngtf_df$type == "gene", na.rm = TRUE)
    geneFeatureDf <- subset(ngtf_df[,c(10,12,1,5,2,3,7)], ngtf_df$type == "gene")
    geneFeatureDf["exon_start"] <- NA
    geneFeatureDf["exon_end"] <- NA
    for (i in 1:length(geneFeatureDf$gene_id)){
        geneid <- geneFeatureDf[i,]$gene_id
        geneFeatureDf[i,8] <- paste(subset(ngtf_df$start[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]],ngtf_df$gene_id[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == geneid & ngtf_df$type[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == "exon"),collapse = ",")
        geneFeatureDf[i,9] <- paste(subset(ngtf_df$end[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]],ngtf_df$gene_id[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == geneid & ngtf_df$type[rownames(geneFeatureDf)[i]:rownames(geneFeatureDf)[i+1]] == "exon"),collapse = ",")
        write.table(geneFeatureDf[i,],file = output_file, append = TRUE,sep = "\t", col.names = FALSE, row.names = TRUE, na = "NA")
    }
    j = length(geneFeatureDf)-1
    geneid <- geneFeatureDf[j+1,]$gene_id
    geneFeatureDf[j+1,8] <- paste(subset(ngtf_df$start[rownames(geneFeatureDf)[j],],ngtf_df$gene_id[rownames(geneFeatureDf)[j],] == geneid & ngtf_df$type[rownames(geneFeatureDf)[j],] == "exon"),collapse = ",")
    geneFeatureDf[j+1,9] <- paste(subset(ngtf_df$end[rownames(geneFeatureDf)[j],],ngtf_df$gene_id[rownames(geneFeatureDf)[j],] == geneid & ngtf_df$type[rownames(geneFeatureDf)[j],] == "exon"),collapse = ",")
    write.table(geneFeatureDf[j+1,],file = output_file, append = TRUE,sep = "\t", col.names = FALSE, row.names = TRUE, na = "NA")
    return(geneFeatureDf)
}

geneFeatureDf <- get_gene_features("Danio_rerio.GRCz10.82.gtf","geneFeatureDf.tab")

 

an example of the output is below:

","    "gene_id"    "gene_name"    "chromosome"    "strand"    "start"    "end"    "type"    "exon_start"    "exon_end"

"1"    "ENSDARG00000104632"    "si:ch73-252i11.3"    "4"    "-"    6733    52120    "gene"    "52002,48613,13755,9547,6733,52002,48613"    "52120,48740,13811,9620,7380,52061,48740"
"11"    "ENSDARG00000100660"    "ZC3HAV1"    "4"    "+"    16217    28312    "gene"    "16217,18669,19012,20698,20832,22332,22924,23103,23513,24516,25485,20691,20832,22332,22924,23103,23513,24516,25485"    "16549,18804,19372,20748,20958,22618,23020,23178,23643,24667,26320,20748,20958,22618,23020,23178,23643,24667,28312"
 

 

thanks in advance.

rtracklayer gtf • 1.3k views
ADD COMMENT
1
Entering edit mode

Hi Mehmet,

Sounds like it would be easier to just get that data.frame out of a TxDb object that you could first make with the makeTxDbFromGFF() or makeTxDbFromBiomart() functions from the GenomicFeatures package. Any reason you want to extract the data.frame directly from the GRanges object you get by importing the GTF file with rtracklayer::import()? This object tends to be hard to work with. Working with the TxDb object is generally much simpler.

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

thank you very much for the reply.

I tried makeTxDbFromGFF() and it worked !!!

Thanks again.

ilyas. 

ADD REPLY

Login before adding your answer.

Traffic: 995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6