More generally, how should we calculate differential expression of genes (or transcripts) based on long-read nanopore data?
Is there some option I can activate to get our long reads to work better with DESeq2 (or any other tool)?
My core problem is that the Log2FC values reported by DESeq2 do not reflect values I calculate myself by taking counts of genome-mapped reads for a particular transcript and determining an unadjusted fold change value. In extreme cases, there's a log difference of about 4 (i.e. DESeq2 is reporting a Log2FC value of 8, whereas I calculate it as about 4). In addition, zero counts are being inconsistently scaled by the rlog function, here's an example:
| Raw counts | rlog counts | |
| Sample 1 | Sample 2 | Sample 1 | Sample 2 | L2FC |
Gstk1 | 0 0 0 | 5 36 17 3 | 0.38 0.02 0.26 | 2.36 3.50 3.03 1.50 | 4.69 |
Psmb8 | 0 0 0 | 12 16 64 5 | 1.43 0.92 1.26 | 4.04 3.26 5.19 2.70 | 5.16 |
Sumo3 | 0 0 0 | 14 8 9 4 | 1.74 1.22 1.57 | 4.36 2.78 3.21 2.73 | 5.23 |
Mgp | 0 0 0 | 1 66 24 16 | 1.72 1.18 1.54 | 2.30 4.98 4.23 4.03 | 5.54 |
Ccl2 | 0 0 0 | 197 37 33 6 | 1.73 1.13 1.53 | 7.68 4.39 4.66 3.09 | 7.67 |
Why do the zero values have different adjusted levels, both for different genes, and within replicates of the same sample for the same gene?
I have the following workflow for processing nanopore data (more details here):
- Demultiplex reads by barcode
- Orient reads to be relative to transcription direction
- Map to transcriptome using LAST*
- Convert unique barcode/transcript/direction tuples into a count table
* I currently use LAST due to its high specificity, but am considering whether minimap2 (or something else) would be a better option
After this, I get a bit lost, because there doesn't seem to be any established method for analysing long-read data. Long-read data from nanopore sequencing tends to be very specific (e.g. a read count of 1 or 2 is a good indication that the transcript is actually present and expressed), but not very sensitive (i.e. read counts are typically quite low for each transcript). We probably have substantial batch effects, because our sequencing runs are a few months apart, and Oxford Nanopore usually changes at least one of their sample preparation protocol, reagent kit, basecalling software, or flow cell in that time frame.
Is it okay to analyse our data using DESeq2? Is there something specific that I should be doing for low-coverage long-read data (e.g. zero-value inflation)?
A count table and metadata file can be found here and here respectively [note: unpublished / draft data; subject to change without notice].
Edit: After Michael Love's comment about rlog vs VST, I did a VST transformation on the counts, which I'm much more comfortable with. This still produced a positive read count for zero values, but zero values were at least consistently scaled to the same transformed number, which made it easy to filter them out. I subtracted the minimum from the transformed values, then rescaled to make the 99th percentile for non-zero values the same value, which ended up with something that seemed a reasonable approximation of the raw read counts. Finally, I set all the zero values to -2 (so that they would be rounded to 0 in non-log space):
| Raw counts | rlog counts | |
| Sample 1 | Sample 2 | Sample 1 | Sample 2 | L2FC |
Gstk1 | 0 0 0 | 5 36 17 3 | -2 -2 -2 | 2.52 3.68 3.13 1.58 | 4.69 |
Psmb8 | 0 0 0 | 12 16 64 5 | -2 -2 -2 | 3.66 2.61 5.22 2.01 | 5.16 |
Sumo3 | 0 0 0 | 14 8 9 4 | -2 -2 -2 | 3.89 1.9 2.37 1.81 | 5.23 |
Mgp | 0 0 0 | 1 66 24 16 | -2 -2 -2 | 1.18 4.65 3.61 3.36 | 5.54 |
Ccl2 | 0 0 0 | 197 37 33 6 | -2 -2 -2 | 8.81 3.72 4.1 2.18 | 7.67 |
Here's a snip of my current DESeq2 processing script, which processes a merged count table using DESeq2 [tdir = transcript + strand direction]:
analysisDate <- format(Sys.Date(), "%Y-%b-%d");
countDate <- "2019-Apr-18";
count.df <- read.csv(sprintf("raw_counts_%s.csv", countDate),
stringsAsFactors=FALSE);
## Re-scan header to avoid replacement of characters with '.'
colnames(count.df) <- scan(sprintf("raw_counts_%s.csv", countDate),
what="character", nlines=1,
sep=",");
count.cols <- setdiff(colnames(count.df),
c("transcript", "Chr", "Strand", "Start", "End",
"Description", "Gene", "dir", "tdir"));
...
count.mat <- as.matrix(count.df[,count.cols]);
count.mat[is.na(count.mat)] <- 0;
## Only include genes with total counts >=2
rownames(count.mat) <- paste0(count.df$tdir);
count.mat <- count.mat[apply(count.mat,1,sum) >= 2,];
...
meta.df <- read.csv("metadata.csv");
excluded.factors <- "Treatment"; ## Factors to exclude from statistical model
## Identify factors for the statistical model from the metadata file
factorNames <- setdiff(colnames(meta.df), c(c("SampleID","Label","Replicate","Notes"), excluded.factors));
## Convert to differential expression structure
dds <- DESeqDataSetFromMatrix(count.mat, meta.df,
as.formula(paste0("~ ",paste(factorNames,collapse=" + "))));
## Run differential expression tests
dds <- DESeq(dds);
...
## Collect up comparisons to make
resultList <- NULL;
for(fi in factorNames){
vn <- unique(metasub.df[,fi]);
vn <- vn[order(-xtfrm(vn))];
if(length(vn) == 1){
next;
}
for(fai in seq(1, length(vn)-1)){
for(fbi in seq(fai+1, length(vn))){
cat(sprintf("%s: %s vs %s\n", fi, vn[fai], vn[fbi]));
resultList <- c(resultList, list(c(fi, as.character(vn[fai]), as.character(vn[fbi]))));
}
}
}
## Generate base count table
countsub.df <- subset(count.df, tdir %in% rownames(count.mat));
dds.withCounts.tbl <- as.tbl(countsub.df);
## Generate DE results, add to base table
for(rn in resultList){
print(rn);
results.df <-
if(l2FCShrink){
as.data.frame(lfcShrink(dds, contrast=rn, type = "ashr"));
} else {
as.data.frame(results(dds, contrast=rn));
}
results.df$log2FoldChange <- round(results.df$log2FoldChange, 2);
results.df$pvalue <- signif(results.df$pvalue, 3);
results.df$padj <- signif(results.df$padj, 3);
rn.label <- paste(rn, collapse="-");
results.tbl <- as.tbl(results.df[, c("log2FoldChange", "pvalue", "padj")]);
colnames(results.tbl) <- paste0(c("L2FC.","pval.", "padj."), rn.label);
results.tbl$tdir <- rownames(results.df);
dds.withCounts.tbl <-
left_join(dds.withCounts.tbl, results.tbl, by="tdir");
}
## Replace original count array with normalised log2 count array
dds.counts <- assay(rlog(dds));
dds.withCounts.tbl[,colnames(dds.counts)] <- round(dds.counts,2);
## Exclude any zero/negative total counts after regularised log transform
dds.withCounts.tbl <-
dds.withCounts.tbl[rowSums(dds.counts, na.rm=TRUE) > 0,];
## Replace count column names with labels from metadata file
colnames(dds.withCounts.tbl)[match(meta.df$SampleID,
colnames(dds.withCounts.tbl))] <-
as.character(meta.df$Label);
## Write out to a file
write.csv(dds.withCounts.tbl, row.names=FALSE,
sprintf("DE_normalised_%s.csv", analysisDate));
You can try out the pipeline at https://github.com/nanoporetech/pipeline-transcriptome-de which is loosely based on the Swimming Downstream workflow. It uses minimap2 for mapping, salmon for quantification and edgeR quasi-likelihood for differential gene expression. Also, I would not orient the reads in this case as it is not necessary and might just discard some reads.
Read orientation is necessary for the discoveries we're carrying out.
The most relevant for us is negative strand expression on the mitochondrial genome in unannotated regions. I appreciate that this is mostly not going to be picked up by mapping to a transcriptome, and am on the lookout for alternative ways to do mapping of unannotated regions.
It doesn't make sense to me to discard that information, given that it is a strand-specific sequencing process.