I am planning to do alternative splicing analyses with SGSeq and limma/edgeR, which was suggested in SGSeq vignette. But I could not find any literature to do it this way. So, I am trying to figure it out how to do it, and I am using a Chr22 RNA-seq data from an experiment of our lab to demonstrate it.
I obtained the SGVariant Count object, and got the count and variant information as shown below. I created a DGEList with these data. When I observed these data, I found one EventID always corresponds to 2 or more variantIDs. And these 2 variantIDs have the same genomic ranges. However, they have very different read counts. Usually, one variant has very low counts, and the other has fair counts. My understanding is that the counts from different variants, which are according to different paths of a splice graph. Am I right?
When I used these data to do alternative splicing analyses with limma/edgeR, I could not decide how to handle these data. First, the low count variants are always filtered out. Also, to my understanding, the diffSplice() function compared an exon with other exons in the same gene to determine whether that exon is differentially expressed. However, in this count table, the exon information for a single gene is incomplete, although there are many variant information inside one gene. I tried to use "gene_id" and "EventID" showed in the anno data frame to define geneid in the diffSplice() function, and they gave different results. But when I check the results with IGV, none of them seems correct.
Another way is to treat every variant as a single gene. However, in this way, without the whole gene correction, we cannot distinguish differential variant expression from differential gene expression.
So, I wonder what is the count-read engine behind? Is it possible to do read counting of every exon of the whole genes using the same engine? This may be quite important for the alternative splicing quantification analysis.
Anyone has done this before? Any suggestions would be helpful. Thanks!
I didn't encounter any errors, so I did not include the session information here.
> ## Get count table
> counts <- counts(sgvc_pred)
> head(counts, n =30)
shCTRL1 shCTRL2 shCTRL3 shPRPF8_1_1 shPRPF8_1_2 shPRPF8_1_3
[1,] 0 0 0 3 1 4
[2,] 0 0 0 0 0 1
[3,] 0 0 0 1 4 3
[4,] 0 0 0 0 0 1
[5,] 0 0 0 1 0 0
[6,] 0 0 0 1 0 0
[7,] 2 2 4 0 8 2
[8,] 2 3 16 8 10 11
[9,] 0 0 0 0 1 0
[10,] 0 3 0 2 0 2
[11,] 0 3 0 1 1 5
[12,] 0 0 2 2 2 2
[13,] 0 0 0 1 0 0
[14,] 7 5 9 10 13 35
[15,] 1 0 1 2 0 3
[16,] 0 3 1 1 0 3
[17,] 104 117 99 62 91 88
[18,] 13 12 9 4 9 3
[19,] 1 0 0 0 0 0
[20,] 14 4 18 5 14 9
[21,] 0 2 0 0 0 0
[22,] 203 220 189 114 183 161
[23,] 152 163 147 82 149 127
[24,] 1 4 5 0 0 1
[25,] 0 0 0 0 0 2
[26,] 48 30 48 13 36 36
[27,] 0 0 0 1 0 0
[28,] 30 19 22 6 13 15
[29,] 1 0 0 0 0 0
[30,] 19 18 14 5 12 5
> ## Obtain geneName, txName, featureID, geneID, ect. for exon annotations.
> gene.name <- sapply(sapply(geneName(sgvc_pred), unlist), function(x) {paste(x,collapse = "+")})
> seqnames<- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 2)
> starts <- as.numeric(sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 3))
> strands <- sapply(stri_split_fixed(from(sgvc_pred),":"), "[[", 4)
> ends <- as.numeric(sapply(stri_split_fixed(to(sgvc_pred),":"), "[[", 3))
>
> anno <- data.frame(VariantID = variantID(sgvc_pred),
+ Chr = seqnames,
+ Start = starts,
+ End = ends,
+ Strand = strands,
+ Length = abs(starts - ends),
+ gene_id = geneID(sgvc_pred),
+ gene_name = gene.name,
+ type = variantName(sgvc_pred),
+ EventID = eventID(sgvc_pred))
>
> head(anno, n = 30)
VariantID Chr Start End Strand Length gene_id gene_name type EventID
1 39 22 11910219 11910989 + 770 8 8_13_1/2_A3SS 13
2 40 22 11910219 11910989 + 770 8 8_13_2/2_A3SS 13
3 41 22 11910219 11911025 + 806 8 8_14_1/2_A3SS 14
4 42 22 11910219 11911025 + 806 8 8_14_2/2_A3SS 14
5 43 22 11910219 11913040 + 2821 8 8_15_1/2_SE 15
6 44 22 11910219 11913040 + 2821 8 8_15_2/2_SE 15
7 88 22 11916034 11922926 + 6892 10 10_7_1/2_OTHER 31
8 89 22 11916034 11922926 + 6892 10 10_7_2/2_OTHER 31
9 90 22 11917076 11918800 + 1724 11 11_1_1/2_OTHER 32
10 91 22 11917076 11918800 + 1724 11 11_1_2/2_OTHER 32
11 95 22 11932586 11933775 + 1189 8 8_25_1/2_OTHER 33
12 96 22 11932586 11933775 + 1189 8 8_25_2/2_OTHER 33
13 97 22 11937533 11938702 + 1169 8 8_26_1/2_OTHER 34
14 98 22 11937533 11938702 + 1169 8 8_26_2/2_OTHER 34
15 99 22 15777331 15778010 + 679 21 ENSG00000225255 21_1_1/2_A5SS 35
16 100 22 15777331 15778010 + 679 21 ENSG00000225255 21_1_2/2_A5SS 35
17 247 22 15787282 15788552 + 1270 22 ENSG00000206195 22_17_1/2_RI 52
18 253 22 15787282 15788552 + 1270 22 ENSG00000206195 22_17_2/2_RI 52
19 254 22 15787415 15787444 + 29 22 ENSG00000206195 22_20_1/2_RI 55
20 255 22 15787415 15787444 + 29 22 ENSG00000206195 22_20_2/2_RI 55
21 279 22 15788633 15788668 + 35 22 ENSG00000206195 22_28_1/2_RI 63
22 280 22 15788633 15788668 + 35 22 ENSG00000206195 22_28_2/2_RI 63
23 281 22 15788699 15788820 + 121 22 ENSG00000206195 22_29_1/2_RI 64
24 282 22 15788699 15788820 + 121 22 ENSG00000206195 22_29_2/2_RI 64
25 492 22 15816282 15816370 + 88 22 22_61_1/2_RI 99
26 493 22 15816282 15816370 + 88 22 22_61_2/2_RI 99
27 494 22 15816566 15816697 + 131 26 26_1_1/2_RI 100
28 495 22 15816566 15816697 + 131 26 26_1_2/2_RI 100
29 496 22 15816589 15816654 + 65 26 26_2_1/2_RI 101
30 497 22 15816589 15816654 + 65 26 26_2_2/2_RI 101
Thanks for your answer, Dr. Goldstein!
Dear Dr. Goldstein,
I was checking the paper that you mentioned. From the DEXSeq normally we have Padj values and log2foldchanges. In this paper, they used the final result from DEXSeq and in addition, they showed delta PSI for each event as they mentioned: "The variant frequencies are what is called PSI ("percent spliced-in") by some authors. We also provide the statistic deltaVF, which means "delta variant frequency": the difference in variant frequencies between the two groups (called delta-PSI by some authors). This statistic may be more easily interpreted than the fold-change statistic."
But I am not sure how to calculate psi per group and get the delta psi; then use it in the final result from DEXSeq for each event and variant.
here after getting analyzeVariants
I've got the Variantfrequency:
then I set the psi from the variant frequency as below:
then I set the variantNames as the row names for psi:
but then here is my problem, I have a single psi value for every event and each sample. But how to calculate the psi for controls and cases; to have one psi value for the control group, and one for the case group. consider:
controls are: N1, N2,N3,N4 cases are: T1,T2,T3,T4
Now, I have psi values for each event per sample, how to calculate to have one psi value per group (control and patient).
For this variant 79791_1_1/2_SE, my goal is to present something like below:
should I get the mean of individual psi values for N1, N2,N3,N4 samples and say the mean is the psi value for the control group? I am not sure if it is the right way.
we have the psi values for each event and each sample but at the end, the goal is to have one psi value for each variant per group, how do we calculate it?
Thank you in advance, Looking forward to hearing from you!