How to add pseudocount to DESeq2 dds object
1
0
Entering edit mode
Jitendra ▴ 10
@nabiyogesh-11718
Last seen 4 months ago
United Kingdom

Hi,

I am getting error while running DESeq2, as some of the samples contain 0 so I want to add pseudo-count to dds so that I can able to run it without any error but I am not sure how should I add 1 to dds?

    dds <- DESeqDataSetFromMatrix(countData=count,colData=colData,design=~Season)
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    dds <- DESeq(dds, test="Wald")
    estimating size factors
    Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :
      every gene contains at least one zero, cannot compute log geometric means

After adding pseudocount I am thinking to run further code like this:

dds<-estimateSizeFactors(dds, type = 'iterate')
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
deseq2 r • 5.3k views
ADD COMMENT
0
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 19 hours ago
San Diego

It's not normal for the kind of bulk seq DESeq was designed for to have zeroes in every single gene. Are you sure that you don't have a couple of failed samples included?

ADD COMMENT
0
Entering edit mode

Hi

I used the below code to check if samples with all 0; and it shows no samples with all 0.

all.zero <- apply(count, 2, function(x) all(x==0))
    > all.zero
        PN0086D.1.S1     PN0086D.2.S2     PN0086D.3.S3     PN0086D.4.S4
               FALSE            FALSE            FALSE            FALSE
        PN0086D.5.S5     PN0086D.6.S6     PN0086D.7.S7     PN0086D.8.S8
               FALSE            FALSE

Thanks

ADD REPLY
1
Entering edit mode

The problem is that you have no genes with all positive counts. This is strange for bulk RNA-seq, and worth exploring.

To answer your question at the outset, we don't have any way to add pseudo-counts. I'd recommend to figure out what is going on though with all genes having a zero.

ADD REPLY
0
Entering edit mode

Hi Michael,

This is microbial amplicon sequencing data for a marker genes and we collected soil samples from different location so it can be possible that no rep.seq with all positive count.

sorry for this confusion!

many thanks

ADD REPLY
0
Entering edit mode

Hi,

I came across this thread while exploring different pre-filtering criteria and was curious about your statement in regards to observing no genes with all positive counts. I've inherited an RNA-seq project that sat around for a while before I started in my new role where indeed, there are no genes with all positive raw counts. Long story short, there were strange results in other analyses that were done before I took over the project, so I went back to the raw counts to perform sample QC, plotted the total number of reads for each sample, and found that there are some differences between sequencing runs. Specifically, samples were sequenced across three runs and the median number of reads for all transcripts across samples for one run was ~ 100 million, while the median numbers of reads for the other two were ~ 30 million.

I understand that there's essentially no way of knowing for sure if the transcripts or genes with 0 observed counts are due to lack of expression at the time of sampling or technical artifacts like sequencing depth. However, given the differences in sequencing depth I mentioned above, it seems plausible that the 0 counts are due to technical differences between runs. Do you recommend an approach for teasing out what's going on with genes? My initial thought was to create a series of bar plots with the number of transcripts or genes that have greater than X number of 0's across samples and coloring by sequencing run, with the idea being that if it this effect was due to differences in sequencing depth, I'd see a greater number of 0-count transcripts or genes in the runs with lower median number of reads.

Thanks in advance for your time.

Sincerely, Stan

UPDATE: I went ahead with my initial idea and plotted the total number of genes from each sequencing run with > 5, 10, 15, and 20 samples with 0 counts for each gene. I hope that description makes sense. Interestingly, in all of the plots the sequencing run with the highest median number of reads had the most genes with 0 counts. I would have expected the opposite if the 0 counts was driven predominantly by sequencing depth. I'm curious to know if you have any additional ideas to explore the lack of genes with all positive counts.

ADD REPLY
1
Entering edit mode

But even with low sequencing depth, it would be an indication of a problem beside sequencing depth that all genes have a zero. Typically there are at the least hundreds or thousands of genes with high counts across all samples. I’d look into other QC aspects like offered by MultiQC to find a subset of samples to analyze.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I do have many genes with high counts across samples. As a follow-up to my last comment, almost all genes (have at least a single 0 count for both the two low-depth runs, with fewer for the high-depth run (but still ~ 25,000). I'll take a look at MulitQC to find which types of problems all genes having a 0 could indicate.

ADD REPLY
0
Entering edit mode

If these questionable samples have 0s for genes where other samples have high counts, where do those samples have their counts? Maybe they are failed samples with a lot of ribosomal content?

ADD REPLY
0
Entering edit mode

I looked at the genes with 0 counts in some samples and higher counts in others and there seems to be a pattern driven by sequencing run. Across many genes, there are 0's (or very low counts, < 10) in samples from the first sequencing run, then higher counts (hundreds or thousands) in samples from the other two runs. There is a relatively even split in the gene biotype as well, split predominantly between protein-coding genes and processed pseudogenes. This may be linked with another strange result I've found- there are substantially more mapped processed pseudogenes in one sequencing run relative to others, although I'm not totally sure why.

Thanks for the ribosomal content suggestion. I'll have to go back to the FASTQC reports and check for evidence of contamination.

ADD REPLY
0
Entering edit mode

Yeah, what I meant to say was, if you have a sample that gives a 0 for a gene that nearly all other samples give a high count, does this sample have high counts elsewhere?

One way to do this kind of analysis is with PCA. If these samples are outliers in PCA, then the loadings will tell you which genes distinguish those samples. Suppose the sample is outlying in PC1 in the positive direction. You would then look for large positive values for the PC1 loading vector.

ADD REPLY
0
Entering edit mode

In most cases, yes- a sample that gives a 0 for a gene has high counts elsewhere.

I actually performed a PCA early on for my sample-level QC when I took over the project and found a group that was substantially different from all of the others. I used the vst algorithm to transform the counts and then extracted just a single timepoint of interest. Almost 80% of the variance in that timepoint is along PC1 and the samples cluster almost perfectly by sequencing run (see attached plot). This is another issue I've been going down a rabbit hole trying to solve but unfortunately, batch and timepoint/cohort are perfectly confounded for nearly all samples so I'm not sure how much I can do.

Along with that, I took the top 10 genes loaded onto PC1 and found that the odd-balls had relatively high counts for these genes (in some cases, over 100,000), while all other samples had 0 counts. As a bit more background for all of this, I'm working with a dataset comparing gene expression from tumor biopsies at different timepoints. The timepoint in the plot is baseline, where I would expect (barring any extreme variability) that everything should look relatively similar.

PCA baseline

ADD REPLY
0
Entering edit mode

What are those genes?

ADD REPLY
0
Entering edit mode

The top genes loaded onto PC1 in the positive direction (in descending order) are USP17L6P, HRNR, MUC12, CT47A12, PRR20A, MUC17, three SPATA31 genes and one other. I combed through the Human Protein Atlas and generally can't find anything that would explain why I'm observing positive counts for these genes ONLY in a single sequencing run at a given timepoint. There was another gene loaded on PC1 in this direction, DSPP, that is associated with dentin mineralization, and certainly not something I'd expect to see in my tissue type.

I also looked at the counts for these genes at the two other timepoints and they're nearly all zero (with the exception of low counts for one sample). So, this is leading me to believe there's an issue unique to just the baseline timepoint.

Have you seen results like this before? I'm wondering if this is an indication of potential contamination in the library prep.

ADD REPLY
1
Entering edit mode

Those are not recognizable to me, but I'd ask around on that point. Seems like this batch should be set aside.

ADD REPLY
0
Entering edit mode

At this point, I agree, I think the most sensible conclusion would be to set the batch for this timepoint aside. It's unfortunate, as this particular batch is critical to answering a central research question for this project, but better to discover this now rather than later.

ADD REPLY
0
Entering edit mode

It was not my suggestion that you have one sample with all zeros. However, you might have 2 or more failed samples which, between them, have zeros in every single gene. If you have this, adding pseudocounts is not the answer. The answer is to get rid of the bad samples. If you have very atypical samples that really do have zeros in every gene, don't just blindly add pseducounts to force the data through a workflow that is not designed for your kind of data. You'll have to stop and consider if a different workflow would be more appropriate for your experiment.

ADD REPLY

Login before adding your answer.

Traffic: 881 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6