Question

goseq after edgeR

0

Entering edit mode

mictadlo ▴ 10

@mictadlo-10885

Last seen 5.3 years ago

I have followed this tutorial for edgeR. On this side it appears I could run goseq when I have created the following files:

Differentially expressed gene file but edgeR does not appear to produce a tabular file with gene names in the first column, and TRUE or FALSE in the last column. How is it possible to generate this kind of file?
Gene length file for length bias correction: I have a gff3 file and found (here)[http://seqanswers.com/forums/archive/index.php/t-39797.html] this code

#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
GTFfile = "some_annotation_file.GTF"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")

Should the ouput file contain any header and gene id assioated with the length?
import.gff has a argument genome="GRCm38.71" but my genome is a non model species. Should I provide the file path to the assembly?

Gene category file: I could obtain a mapping of gene id to gene ontology out of the below interproscan file:

sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    198    261    1.4E-15    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    7    91    1.1E-25    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021

Any advice what would be the best way to generarate the 3 above files for goseq?

Thank you in advance.

edger goseq • 1.8k views

ADD COMMENT • link updated 7.6 years ago by Aaron Lun ★ 28k • written 7.6 years ago by mictadlo ▴ 10

score 0 · Answer 1 · 2017-09-11

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 1 hour ago

The city by the bay

Rather than going through some web-based interface, the easiest approach to running a goseq analysis is to just install the goseq package via biocLite and follow the instructions in the user's guide. The instructions in Section 5 for non-native organisms and the case study in Section 6 should answer most of your questions.

ADD COMMENT • link 7.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

I definitely would prefer to run goseq directly in R. As suggested, I read section 5 and I have still some questions how to implement the necessary code structures for goseq:

5.1 Length data format: The length data must be formatted as a numeric vector, of the same length as the main named vector specifying gene names/DE genes. Each entry should give the length of the corresponding gene in bp. If length data is unavailable for some genes, that entry should be set to NA.

I found the following code:


#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
GTFfile = "some_annotation_file.GTF"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")

What should I set for genome argument inside import.gff or should I provide the file path to the non model species assembly?
How to change the above code to satisfy section 5.1?

5.2 Category mapping format: The mapping between category names and genes should be given as a data frame with two columns. One column should contain the gene IDs and the other the name of an associated category. As the mapping between categories and genes is usually many-to-many, this data frame will usually have multiple rows with the same gene name and category name. Alternatively, mappings between genes and categories can be given as a list. The names of list entries should be gene IDs and the entries themselves should be a vector of category names to which the gene ID corresponds.

I could obtain a mapping of gene id to gene ontology out of the below interproscan file:

sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    198    261    1.4E-15    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    7    91    1.1E-25    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021

What would be the best way to read that above file and put the data in correct structure?

Thank you in advance.

ADD REPLY • link 7.6 years ago mictadlo ▴ 10

0

Entering edit mode

The genome= argument is optional, it isn't needed to import a GTF file.
See width.
See read.table.

ADD REPLY • link 7.6 years ago Aaron Lun ★ 28k