Hi I'm having an issue with DESeq2 results. I'm working with 2 cell lines from different patients and originally I created a .csv excel file with both cell lines in it and then contrasted the results from the DESEq Analysis:
setwd("C:/Users/lchen/Documents/transfer.lilly/TMZ.052215")
sample=read.csv("sample.csv", header=T)
#Get the paths to the file
#Make sure the sample names in the sample.csv match up with the end path text
path=paste(getwd(),"/",sample$sample,".sorted.rmdup.bam",sep="")
library(GenomicRanges)
#Clump the bam files into a BamFileList with the same samples as listed in the .csv
list=BamFileList(path, index=character())
library(GenomicFeatures)
#Obtain Reference genome from UCSC
Txdb=makeTxDbFromUCSC(
genome="hg19",
tablename="ensGene",
transcript_ids=NULL,
circ_seqs=DEFAULT_CIRC_SEQS,
url="http://genome.ucsc.edu/cgi-bin/",
goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
miRBaseBuild=NA)
#Sort the exons of the Txdb by gene
exons<-exonsBy(Txdb, by="gene")
#SummarizeOverlaps to count unique reads
library(BiocParallel)
se <- summarizeOverlaps(features=exons, reads=list,
mode="Union",
singleEnd=FALSE,
ignore.strand=FALSE,
fragments=TRUE, BPPARAM = SerialParam())
#Add column as listed in sample.csv
colData(se)$group= sample$group
#Run DESeq2 with variables
library(DESeq2)
dds<-DESeqDataSet(se, design=~group)
dds<-DESeq(dds)
#Results that contrast the Control and Treated groups
res<-results(dds, contrast=c("group", "HF2303.Control", "HF2303.TMZ"))
#Match Gene Names to the Ensembl IDs from the Txdb
res$ensembl <- sapply( strsplit( rownames(res), split="\\+" ), "[", 1 )
library( "biomaRt" )
ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
filters = "ensembl_gene_id",
values = res$ensembl,
mart = ensembl )
idx <- match( res$ensembl, genemap$ensembl_gene_id )
res$entrez <- genemap$entrezgene[ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
#Create a MAPlot and Dispersion Estimate Plot
plotMA(res, ylim=c(-1, 1), main="DESeq2")
plotDispEsts(dds)
#Filter by subsetting results for padj<0.1 and log2FoldChange>=2
res=as.data.frame(res)
sig=subset(res, padj<=0.1)
sig=subset(res, abs(log2FoldChange)>=2)
#Save sig and gene list as csv file
write.csv(sig, "sig.HF2303.csv")
gene=unique(sig$hgnc_symbol)
gene=gene[!is.na(gene)]
write.csv(gene, "gene.HF2303.csv")
The code runs and gives me about 81 significant genes
However, when I separate the 2 cell lines into separate 2 .csv excel files (one for each cell line) and run the same code, I only get about 5 significant genes. At first I thought it would be better to work the cell lines separately, but I didn't think that it would make a huge difference if only the cell lines were contrasted in the results, so I'm confused why I am getting different results.
Here is a code of when I separated the 2 cell lines into 2 excel files.
setwd("C:/Users/lchen/Documents/transfer.lilly/TMZ.052215")
sample=read.csv("HF2927.csv", header=T)
#Get the paths to the file
#Make sure the sample names in the sample.csv match up with the end path text
path=paste(getwd(),"/",sample$sample,".HF2927.sorted.rmdup.bam",sep="")
library(GenomicRanges)
#Clump the bam files into a BamFileList with the same samples as listed in the .csv
list=BamFileList(path, index=character())
library(GenomicFeatures)
#Obtain Reference genome from UCSC
Txdb=makeTxDbFromUCSC(
genome="hg19",
tablename="ensGene",
transcript_ids=NULL,
circ_seqs=DEFAULT_CIRC_SEQS,
url="http://genome.ucsc.edu/cgi-bin/",
goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
miRBaseBuild=NA)
#Sort the exons of the Txdb by gene
exons<-exonsBy(Txdb, by="gene")
#SummarizeOverlaps to count unique reads
library(BiocParallel)
se <- summarizeOverlaps(features=exons, reads=list,
mode="Union",
singleEnd=FALSE,
ignore.strand=FALSE,
fragments=TRUE, BPPARAM = SerialParam())
#Add column as listed in sample.csv
colData(se)$group= sample$group
#Run DESeq2 with variables
library(DESeq2)
dds<-DESeqDataSet(se, design=~group)
dds<-DESeq(dds)
#Results that contrast the Control and Treated groups
res<-results(dds, contrast=c("group", "Control", "TMZ"))
#Match Gene Names to the Ensembl IDs from the Txdb
res$ensembl <- sapply( strsplit( rownames(res), split="\\+" ), "[", 1 )
library( "biomaRt" )
ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
filters = "ensembl_gene_id",
values = res$ensembl,
mart = ensembl )
idx <- match( res$ensembl, genemap$ensembl_gene_id )
res$entrez <- genemap$entrezgene[ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
#Create a MAPlot and Dispersion Estimate Plot
plotMA(res, ylim=c(-1, 1), main="DESeq2")
plotDispEsts(dds)
#Filter by subsetting results for padj<0.1 and log2FoldChange>=2
res=as.data.frame(res)
sig=subset(res, padj<=0.1)
sig=subset(res, abs(log2FoldChange)>=2)
#Save sig and gene list as csv file
write.csv(sig, "sig.HF2927.csv")
gene=unique(sig$hgnc_symbol)
gene=gene[!is.na(gene)]
write.csv(gene, "gene.HF2927.csv")
Any help or tips as to why would be very appreciated! Thank you!
Sorry for the confusion, but yes that's what I'm working with. Basically, I have two tumor cell lines each from a different patient, and each one of the tumor cell lines has a control and treated group, so as you were saying that would be 4 groups all together when I created a .csv file with both tumor cell lines. I'm trying to look at the changes in gene expression that occur in each tumor cell line between untreated and those treated with Treatment A.
That's when i tried to just work with the two tumor cell lines separately. Is there a way to contrast the tumor cell lines when working with samples from all the groups? I've been trying to edit the design when creating a DESeqDataSet but i've been getting an error saying that "-- standard model matrices are used for factors with two levels and an interaction,
where the main effects are for the reference level of other factors.
see the 'Interactions' section of the vignette for more details: vignette('DESeq2')"
Here's the code i've been working with:
> exons<-exonsBy(Txdb, by="gene")
> library(BiocParallel)
> se <- summarizeOverlaps(features=exons, reads=list,
+ mode="Union",
+ singleEnd=FALSE,
+ ignore.strand=FALSE,
+ fragments=TRUE, BPPARAM = SerialParam())
> colData(se)=DataFrame(sample)
> library(DESeq2)
Loading required package: Rcpp
Loading required package: RcppArmadillo
> dds<-DESeqDataSet(se, design=~Cell + Treat + Cell:Treat)
> #Relevel Treat column so that Control is first
> dds$Treat <- relevel(dds$Treat, "Cont")
> #Run DESeq pipeline
> dds<-DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- standard model matrices are used for factors with two levels and an interaction,
where the main effects are for the reference level of other factors.
see the 'Interactions' section of the vignette for more details: vignette('DESeq2')
answer posted on the new post:
A: Running DESeqDataSet with 4 factors?