Question

EdgeR diffSpliceDGE: Obtaining exon-level counts and normalisation step

0

Entering edit mode

oakhamwolf • 0

@oakhamwolf-24374

Last seen 4.1 years ago

UK

Hi all,

I'm looking for some help when analysing RNA-Seq data for differential exon usage/splicing. I've previously used edgeR for DE analysis and was happy to see diffSpliceDGE as part of that package. What I don't know is:

how to obtain exon level counts
whether there is a normalisation step involved like there is in a DGE pipeline (i.e. TMM)

For the first point, I have BAM files for the samples I'd like to analyse that were mapped using STAR to the GRCh38 reference. Is there a way to extract exon-level counts from these files? In case it matters I'm analysing data from an exon skipping experiment and so will be looking at differential splicing for our gene of interest as well as any off-target effects.

For the second point, I've been through the example on the documents page (https://rdrr.io/bioc/edgeR/man/diffSpliceDGE.html) and while that made perfect sense I noticed there was no normalisation step used. How, if at all, should the exon level count data be normalised prior to analysis?

Many thanks for your help and time.

edger normalization differential exon usage splicing • 2.8k views

ADD COMMENT • link updated 4.1 years ago by Yunshun Chen ▴ 890 • written 4.1 years ago by oakhamwolf • 0

score 2 · Accepted Answer · 2020-10-15

2

Entering edit mode

Yunshun Chen ▴ 890

@yunshun-chen-5451

Last seen 1 day ago

Australia

how to obtain exon level counts

One way to obtain exon level counts is to run featureCounts of the Rsubread package with useMetaFeatures = FALSE.

whether there is a normalisation step involved like there is in a DGE pipeline (i.e. TMM)

The standard filtering and normalization steps for the gene-level analysis can also be applied to the exon-level. There is a case study on a differential splicing analysis in the edgeR user's guide that describes the details of the analysis pipeline.

ADD COMMENT • link 4.1 years ago Yunshun Chen ▴ 890

0

Entering edit mode

Thank you for your reply. I had an old copy of the edgeR manual and hadn't seen the new case studies. Excellent! I've been through the Case Study on the Pasilla knockdown, which looks exactly what I am after. Although I have a couple of questions.

I've read in my BAM files and successfully generated exon level counts as you described and as per the manual. However when I get to the creation of the DGEList object, I get an error.

Here's my code and the resulting error:

fc <- featureCounts(files=unlist(bam_files), annot.ext=ens_gtf, isGTFAnnotationFile=TRUE,
                    GTF.featureType="exon", GTF.attrType="gene_id", useMetaFeatures=FALSE,
                    allowMultiOverlap=TRUE)
y.all <- DGEList(counts=fc$counts, genes=fc$annotation)

error:

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘ENSG00000000003’, .......

My counts data looks like this:

head(fc)

    $counts
                         RNAseq2_58      RNAseq2_59      RNAseq2_60      RNAseq2_61 ...
    ENSG00000223972               9               5               9              17 ...
    ENSG00000223972               4               4               8               6 ...
    ENSG00000223972              54              57              52              78 ...
    ENSG00000223972               2               4               6               9 ...

annotation looks like this:

head(fc$annotation)
           GeneID Chr Start   End Strand Length
1 ENSG00000223972   1 11869 12227      +    359
2 ENSG00000223972   1 12613 12721      +    109
3 ENSG00000223972   1 13221 14409      +   1189

As these are exon-level counts, I'd expect the same ID appearing on multiple rows. Any ideas why there's an error?

In the Pasilla example, I'm trying to understand the design matrix to account for the batch effect of sequencing type and date.

I see how the use of Batch <- factor(c(1,3,4,1,3,4)) is trying to account for both with sequencing type being c('SE','PE','PE','SE','PE','PE') but the date is more complex. It looks to me that N1 is run over two dates, N3 is run over two dates and N4 is run on one date, it's the D samples where I'm confused, D1 is run over 4 different days but D3 and D4 are both run on the same day. Given that D3 and D4 are also both PE, should it be Batch <- factor(c(1,3,4,1,3,3)) or something similar?

Many thanks again for your help.

ADD REPLY • link 4.1 years ago oakhamwolf • 0

0

Entering edit mode

The error is caused by having duplicate row.names, which is not allowed. To get rid of this error, you can simply drop the row.names of the count matrix before creating the DGEList object.

counts=fc$counts
rownames(counts) <- NULL
y.all <- DGEList(counts=counts, genes=fc$annotation)

The exon information is still preserved in y.all$genes.

I don't see anything wrong with the design. There are 6 RNA-seq libraries from 3 individual samples under 2 conditions. It is a very classic paired two group comparison. The design matrix accounts for the batch effect between different samples. It doesn't take into account the sequencing type/date.

ADD REPLY • link 4.1 years ago Yunshun Chen ▴ 890

0

Entering edit mode

Thank you again. Regarding the design matrix, I guess my confusion comes from the manual saying

The MDS plot shows clear separation of the Pasilla down vs normal samples, but also a batch effect associated with sequencing type and date.

and then:

To account for the batch effect observed from the MDS plot, we create a design matrix as follows

So I couldn't work out how the Batch <- factor(c(1,3,4,1,3,4)) accounted for the largest source of variation in this data, the sequencing type. To clarify, is that accounted for simply by the fact that the N* and D* samples are paired and also paired in terms of their sequencing type and so the batch effect seen from sequencing type can be accounted for by simply providing a paired design? Would it be redundant to add another level in there specifically for sequencing type e.g. SeqType <- c(0,1,1,0,1,1) and then use design <- model.matrix(~ Batch + SeqType + Pasilla)?

Apologies if this is obvious, I'm just trying to learn and get it right, so thank you for you help :) .

ADD REPLY • link 4.1 years ago oakhamwolf • 0

0

Entering edit mode

The Batch is nested within your SeqType. It would be redundant to add SeqType in the design matrix when Batch is already included.

ADD REPLY • link 4.1 years ago Yunshun Chen ▴ 890

0

Entering edit mode

So to be clear on the nesting, the 1 group in the SeqType is further split into 3,4 within the Batch and so the use of SeqType is redundant as it's not adding any more information than just including Batch. If that's correct then thank you, you have been very helpful indeed and I'll stop pestering you!