Question

How should I handle read counts derived from SGSeq when they were passed to limma/edgeR

1

Entering edit mode

zliu ▴ 10

@zliu-23226

Last seen 4.3 years ago

I am planning to do alternative splicing analyses with SGSeq and limma/edgeR, which was suggested in SGSeq vignette. But I could not find any literature to do it this way. So, I am trying to figure it out how to do it, and I am using a Chr22 RNA-seq data from an experiment of our lab to demonstrate it.

I obtained the SGVariant Count object, and got the count and variant information as shown below. I created a DGEList with these data. When I observed these data, I found one EventID always corresponds to 2 or more variantIDs. And these 2 variantIDs have the same genomic ranges. However, they have very different read counts. Usually, one variant has very low counts, and the other has fair counts. My understanding is that the counts from different variants, which are according to different paths of a splice graph. Am I right?

When I used these data to do alternative splicing analyses with limma/edgeR, I could not decide how to handle these data. First, the low count variants are always filtered out. Also, to my understanding, the diffSplice() function compared an exon with other exons in the same gene to determine whether that exon is differentially expressed. However, in this count table, the exon information for a single gene is incomplete, although there are many variant information inside one gene. I tried to use "gene_id" and "EventID" showed in the anno data frame to define geneid in the diffSplice() function, and they gave different results. But when I check the results with IGV, none of them seems correct.

Another way is to treat every variant as a single gene. However, in this way, without the whole gene correction, we cannot distinguish differential variant expression from differential gene expression.

So, I wonder what is the count-read engine behind? Is it possible to do read counting of every exon of the whole genes using the same engine? This may be quite important for the alternative splicing quantification analysis.

Anyone has done this before? Any suggestions would be helpful. Thanks!

I didn't encounter any errors, so I did not include the session information here.

> ## Get count table
> counts <- counts(sgvc_pred)
> head(counts, n =30)
      shCTRL1 shCTRL2 shCTRL3 shPRPF8_1_1 shPRPF8_1_2 shPRPF8_1_3
 [1,]       0       0       0           3           1           4
 [2,]       0       0       0           0           0           1
 [3,]       0       0       0           1           4           3
 [4,]       0       0       0           0           0           1
 [5,]       0       0       0           1           0           0
 [6,]       0       0       0           1           0           0
 [7,]       2       2       4           0           8           2
 [8,]       2       3      16           8          10          11
 [9,]       0       0       0           0           1           0
[10,]       0       3       0           2           0           2
[11,]       0       3       0           1           1           5
[12,]       0       0       2           2           2           2
[13,]       0       0       0           1           0           0
[14,]       7       5       9          10          13          35
[15,]       1       0       1           2           0           3
[16,]       0       3       1           1           0           3
[17,]     104     117      99          62          91          88
[18,]      13      12       9           4           9           3
[19,]       1       0       0           0           0           0
[20,]      14       4      18           5          14           9
[21,]       0       2       0           0           0           0
[22,]     203     220     189         114         183         161
[23,]     152     163     147          82         149         127
[24,]       1       4       5           0           0           1
[25,]       0       0       0           0           0           2
[26,]      48      30      48          13          36          36
[27,]       0       0       0           1           0           0
[28,]      30      19      22           6          13          15
[29,]       1       0       0           0           0           0
[30,]      19      18      14           5          12           5
> ## Obtain geneName, txName, featureID, geneID, ect. for exon annotations.
> gene.name <- sapply(sapply(geneName(sgvc_pred), unlist), function(x) {paste(x,collapse = "+")})
> seqnames<- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 2)
> starts <- as.numeric(sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 3))
> strands <- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 4)
> ends <- as.numeric(sapply(stri_split_fixed(to(sgvc_pred),":"), "[[", 3))
> 
> anno <- data.frame(VariantID = variantID(sgvc_pred), 
+                    Chr = seqnames, 
+                    Start = starts,
+                    End = ends,
+                    Strand = strands,
+                    Length = abs(starts - ends),
+                    gene_id = geneID(sgvc_pred),
+                    gene_name = gene.name,
+                    type = variantName(sgvc_pred),
+                    EventID = eventID(sgvc_pred))
> 

> head(anno, n = 30)
   VariantID Chr    Start      End Strand Length gene_id       gene_name           type EventID
1         39  22 11910219 11910989      +    770       8                  8_13_1/2_A3SS      13
2         40  22 11910219 11910989      +    770       8                  8_13_2/2_A3SS      13
3         41  22 11910219 11911025      +    806       8                  8_14_1/2_A3SS      14
4         42  22 11910219 11911025      +    806       8                  8_14_2/2_A3SS      14
5         43  22 11910219 11913040      +   2821       8                    8_15_1/2_SE      15
6         44  22 11910219 11913040      +   2821       8                    8_15_2/2_SE      15
7         88  22 11916034 11922926      +   6892      10                 10_7_1/2_OTHER      31
8         89  22 11916034 11922926      +   6892      10                 10_7_2/2_OTHER      31
9         90  22 11917076 11918800      +   1724      11                 11_1_1/2_OTHER      32
10        91  22 11917076 11918800      +   1724      11                 11_1_2/2_OTHER      32
11        95  22 11932586 11933775      +   1189       8                 8_25_1/2_OTHER      33
12        96  22 11932586 11933775      +   1189       8                 8_25_2/2_OTHER      33
13        97  22 11937533 11938702      +   1169       8                 8_26_1/2_OTHER      34
14        98  22 11937533 11938702      +   1169       8                 8_26_2/2_OTHER      34
15        99  22 15777331 15778010      +    679      21 ENSG00000225255  21_1_1/2_A5SS      35
16       100  22 15777331 15778010      +    679      21 ENSG00000225255  21_1_2/2_A5SS      35
17       247  22 15787282 15788552      +   1270      22 ENSG00000206195   22_17_1/2_RI      52
18       253  22 15787282 15788552      +   1270      22 ENSG00000206195   22_17_2/2_RI      52
19       254  22 15787415 15787444      +     29      22 ENSG00000206195   22_20_1/2_RI      55
20       255  22 15787415 15787444      +     29      22 ENSG00000206195   22_20_2/2_RI      55
21       279  22 15788633 15788668      +     35      22 ENSG00000206195   22_28_1/2_RI      63
22       280  22 15788633 15788668      +     35      22 ENSG00000206195   22_28_2/2_RI      63
23       281  22 15788699 15788820      +    121      22 ENSG00000206195   22_29_1/2_RI      64
24       282  22 15788699 15788820      +    121      22 ENSG00000206195   22_29_2/2_RI      64
25       492  22 15816282 15816370      +     88      22                   22_61_1/2_RI      99
26       493  22 15816282 15816370      +     88      22                   22_61_2/2_RI      99
27       494  22 15816566 15816697      +    131      26                    26_1_1/2_RI     100
28       495  22 15816566 15816697      +    131      26                    26_1_2/2_RI     100
29       496  22 15816589 15816654      +     65      26                    26_2_1/2_RI     101
30       497  22 15816589 15816654      +     65      26                    26_2_2/2_RI     101

SGSeq limma edgeR • 1.6k views

ADD COMMENT • link updated 5 months ago by Sara • 0 • written 4.4 years ago by zliu ▴ 10

score 1 · Answer 1 · 2020-10-13

1

Entering edit mode

Leonard Goldstein ▴ 260

@leonard-goldstein-6845

Last seen 8 months ago

Australia

The SGVariantCounts object contains read counts for splice variants. Each splice variant belongs to a splice event, which is made up of two or more variants. The vignette section on differential splicing is only brief – the idea is to test for differential variant usage within a splice event, analogous to testing for differential exon usage within a gene. For this type of analysis diffSplice arguments geneid and exonid correspond to SGSeq event_id and variant_id, respectively. For a reference that uses DEXSeq, please see Srinivasan et al. Cell Reports 2016 https://www.nature.com/articles/ncomms11295.

ADD COMMENT • link 4.4 years ago Leonard Goldstein ▴ 260

0

Entering edit mode

Thanks for your answer, Dr. Goldstein!

ADD REPLY • link 4.3 years ago zliu ▴ 10

0

Entering edit mode

Dear Dr. Goldstein,

I was checking the paper that you mentioned. From the DEXSeq normally we have Padj values and log2foldchanges. In this paper, they used the final result from DEXSeq and in addition, they showed delta PSI for each event as they mentioned: "The variant frequencies are what is called PSI ("percent spliced-in") by some authors. We also provide the statistic deltaVF, which means "delta variant frequency": the difference in variant frequencies between the two groups (called delta-PSI by some authors). This statistic may be more easily interpreted than the fold-change statistic."

But I am not sure how to calculate psi per group and get the delta psi; then use it in the final result from DEXSeq for each event and variant.

here after getting analyzeVariants

sgvc_pred <- analyzeVariants(sgfc_pred)

I've got the Variantfrequency:

variantFreq(sgvc_pred)
       N1   N2        N3        N4         T1         T2        T3         T4
[1,] 0.88 0.56 0.6153846 0.5925926 0.93333333 0.90196078 0.8484848 0.95833333
[2,] 0.12 0.44 0.3846154 0.4074074 0.06666667 0.09803922 0.1515152 0.04166667

variantName(sgvc_pred)
[1] "1_1_1/2_SE" "1_1_2/2_SE"

then I set the psi from the variant frequency as below:

psi <- variantFreq(sgvc_pred)

> psi
       N1   N2        N3        N4         T1         T2        T3         T4
[1,] 0.88 0.56 0.6153846 0.5925926 0.93333333 0.90196078 0.8484848 0.95833333
[2,] 0.12 0.44 0.3846154 0.4074074 0.06666667 0.09803922 0.1515152 0.04166667

then I set the variantNames as the row names for psi:

rownames(psi) <- variantName(sgvc_pred)
> psi
             N1   N2        N3        N4         T1         T2        T3         T4
1_1_1/2_SE 0.88 0.56 0.6153846 0.5925926 0.93333333 0.90196078 0.8484848 0.95833333
1_1_2/2_SE 0.12 0.44 0.3846154 0.4074074 0.06666667 0.09803922 0.1515152 0.04166667

but then here is my problem, I have a single psi value for every event and each sample. But how to calculate the psi for controls and cases; to have one psi value for the control group, and one for the case group. consider:

controls are: N1, N2,N3,N4 cases are: T1,T2,T3,T4

Now, I have psi values for each event per sample, how to calculate to have one psi value per group (control and patient).

For this variant 79791_1_1/2_SE, my goal is to present something like below:

79791_1_1/2_SE :       psi(controls)= 0   ;  psi(cases)=1     ;    dpsi(cases-control)=1

should I get the mean of individual psi values for N1, N2,N3,N4 samples and say the mean is the psi value for the control group? I am not sure if it is the right way.

we have the psi values for each event and each sample but at the end, the goal is to have one psi value for each variant per group, how do we calculate it?

Thank you in advance, Looking forward to hearing from you!

ADD REPLY • link 5 months ago Sara • 0

score 0 · Answer 2 · 2020-10-09

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 12 hours ago

WEHI, Melbourne, Australia

I am not familiar at all with SGSeq and I don't know, for example, why it returns multiple rows for the same exon. If you use diffSplice in limma you could set geneid = gene_id and leave exonid unset. The SGSeq authors would need to chime in here if anything else is required.

ADD COMMENT • link 4.4 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for your response, Dr. Smyth. It was a typo, I used "gene_id" and "EventID" in the anno data frame to define geneid in the diffSplice() function. I corrected it. Appreciate it!

ADD REPLY • link 4.4 years ago zliu ▴ 10