I am trying to use DESeq2 to identify differentially expressed genes across different rodent species as part of a larger project on rodent evolution. I have carried out alignment using STAR, and generated my counts with HTSeq. I am only interested in one-to-one orthologues that span all of the rodent species I am working with, so these count files have been amended to show the mouse orthologue gene ID for their respective genes, but the format of the file remains the same, with ID on the left and count on the right.
Here is what I have at the moment:
directory = "/Users/emma/Desktop/Differential Expression Analysis/Data"
sampleFiles = list.files(directory)
sampleName = unlist(strsplit(sampleFiles, "Ortho.txt", fixed = TRUE))
condition = strsplit(sampleName, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
species = sapply(condition, "[[", 3)
fileInfo = data.frame(sampleName, sampleFiles, species)
ddsHTSeq = DESeqDataSetFromHTSeqCount(sampleTable = fileInfo, directory = directory, design= ~ species)
Which gives me the following fileInfo table:
sampleName sampleFiles species
1 SRR594397_1_Mouse SRR594397_1_MouseOrtho.txt Mouse
2 SRR594397_2_Mouse SRR594397_2_MouseOrtho.txt Mouse
3 SRR594405_1_Mouse SRR594405_1_MouseOrtho.txt Mouse
4 SRR594405_2_Mouse SRR594405_2_MouseOrtho.txt Mouse
5 SRR594414_1_Mouse SRR594414_1_MouseOrtho.txt Mouse
6 SRR594414_2_Mouse SRR594414_2_MouseOrtho.txt Mouse
7 SRR594423_1_Rat SRR594423_1_RatOrtho.txt Rat
8 SRR594423_2_Rat SRR594423_2_RatOrtho.txt Rat
9 SRR594432_1_Rat SRR594432_1_RatOrtho.txt Rat
10 SRR594432_2_Rat SRR594432_2_RatOrtho.txt Rat
11 SRR594441_1_Rat SRR594441_1_RatOrtho.txt Rat
12 SRR594441_2_Rat SRR594441_2_RatOrtho.txt Rat
But the following errors when I try to create ddsHTSeq:
Error in Ops.factor(a$V1, l[[1]]$V1) :
level sets of factors are different
In addition: Warning messages:
1: In `==.default`(a$V1, l[[1]]$V1) :
longer object length is not a multiple of shorter object length
2: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
3: In `==.default`(a$V1, l[[1]]$V1) :
longer object length is not a multiple of shorter object length
4: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
5: In `==.default`(a$V1, l[[1]]$V1) :
longer object length is not a multiple of shorter object length
6: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
I have tried a number of ways, and read many documents and forum posts online, but can't seem to get past this stage. So I am just trying to work out if there is a problem with what I am doing within RStudio, or whether there is a bigger problem that I am missing in my experimental design. Being new to DESeq2 and bioinformatics in general, I am in way over my head! Any help understanding what do do with this error would be very much appreciated.
Thank you for your help. I did what you suggested, and everything is working fine.