I have followed this tutorial for edgeR. On this side it appears I could run goseq
when I have created the following files:
Differentially expressed gene file but edgeR does not appear to produce a tabular file with gene names in the first column, and TRUE or FALSE in the last column. How is it possible to generate this kind of file?
Gene length file for length bias correction: I have a gff3 file and found (here)[http://seqanswers.com/forums/archive/index.php/t-39797.html] this code
#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
GTFfile = "some_annotation_file.GTF"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")
- Should the ouput file contain any header and gene id assioated with the length?
import.gff
has a argumentgenome="GRCm38.71"
but my genome is a non model species. Should I provide the file path to the assembly?
- Gene category file: I could obtain a mapping of gene id to gene ontology out of the below interproscan file:
sp0000006-mRNA-1 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 198 261 1.4E-15 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
sp0000006-mRNA-1 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 7 91 1.1E-25 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
Any advice what would be the best way to generarate the 3 above files for goseq
?
Thank you in advance.
I definitely would prefer to run
goseq
directly in R. As suggested, I read section 5 and I have still some questions how to implement the necessary code structures forgoseq
:5.1 Length data format: The length data must be formatted as a numeric vector, of the same length as the main named vector specifying gene names/DE genes. Each entry should give the length of the corresponding gene in bp. If length data is unavailable for some genes, that entry should be set to NA.
I found the following code:
What should I set for
genome
argument insideimport.gff
or should I provide the file path to the non model species assembly?How to change the above code to satisfy section 5.1?
5.2 Category mapping format: The mapping between category names and genes should be given as a data frame with two columns. One column should contain the gene IDs and the other the name of an associated category. As the mapping between categories and genes is usually many-to-many, this data frame will usually have multiple rows with the same gene name and category name. Alternatively, mappings between genes and categories can be given as a list. The names of list entries should be gene IDs and the entries themselves should be a vector of category names to which the gene ID corresponds.
Thank you in advance.
genome=
argument is optional, it isn't needed to import a GTF file.width
.read.table
.