Question

Unexpected results while analyzing a subset of genes in GEO RNAseq dataset containing TPM values

0

Entering edit mode

atakanekiz ▴ 30

@atakanekiz-15874

Last seen 11 months ago

Turkey

Hello BioC community,

Let me set the stage for my question first. We generated a list of 200 genes that correlate with certain clinical parameters using human RNAseq data. I'd like to analyze the expression patterns of these genes in published GEO datasets. One of the GEO datasets I'm interested in analyzing involves RNAseq of sorted human PBMCs (GSE107011). There are 30 different types of immune cells in this dataset with multiple biological replicates (127 samples total). The data I downloaded using GEOquery package are in TPM format as reported by Kallisto (somehow I can't find the raw data). I can generate heatmaps like this to visualize the gene expression patterns (samples in rows, genes in columns):

I would like to analyze the significantly different genes between data subsets and generate a summary table. I used limma package before for routine DE analyses, but I'm trying to develop more of a targeted analysis approach to investigate the ~200 genes (I don't care much about the other genes in the dataset for now). I performed the analysis, but the results I'm getting do not make sense to me.

These are the steps I performed, please let me know if my thinking is erroneous at any point.

Prepare expression data and sample metadata

# Read data
dat <- read.delim(gzfile("GSE107011/GSE107011_Processed_data_TPM.txt.gz"), check.names = F)

dat[1:5,1:5]
##                     DZQV_CD8_naive DZQV_CD8_CM DZQV_CD8_EM DZQV_CD8_TE
## 1 ENSG00000223972.5       0.214321     0.00000  0.00421414  0.00953342
## 2 ENSG00000227232.5       4.759630     3.49280  2.64877000  3.72232000
## 3 ENSG00000278267.1       0.000000     2.61518  2.45976000  0.00000000
## 4 ENSG00000243485.5       0.000000     0.00000  0.00000000  0.00000000
## 5 ENSG00000284332.1       0.000000     0.00000  0.00000000  0.00000000

# Do some processing including:
### Averaging counts from multiple transcripts matching the same gene
### Log2 transforming the linear TPM values
### Make a ggplot-friendly composite dataframe that contains info from sample metadata and gene expression (will be useful later on for graphing)

expdat[1:5,1:5]
##           sample                     cell_type gender  DNAJC11   CDK11A
## 1 DZQV_CD8_naive             Naive_CD8_T_cells      M 5.609300 5.799777
## 2    DZQV_CD8_CM     Central_memory_CD8_T_cell      M 5.547006 6.204563
## 3    DZQV_CD8_EM   Effector_memory_CD8_T_cells      M 5.778402 6.187289
## 4    DZQV_CD8_TE Terminal_effector_CD8_T_cells      M 5.186984 6.465826
## 5      DZQV_MAIT                    MAIT_cells      M 5.251571 6.152436

Subset the dataset to contain the genes of interest

I'm thinking that this will reduce the number of multiple comparisons (by discarding uninteresting genes), giving the analysis more power to detect differential expression.

# genes object is a character vector containing the names of genes-of-interest
genes[1:5]
## "PPP3CB_AS1"  "IL10RB_AS1"  "PPP1R26_AS1" "RRN3P2"      "AC099343.3" 

# Subset the data frame to contain info from genes and cell_type (group variable from sample metadata)
expsubset <- expdat[, colnames(expdat) %in% c("cell_type", genes)]

dim(expsubset)
## 127 200

# Groups in comparison
groups <- expsubset$cell_type

groups[1:5]
## [1] Naive_CD8_T_cells           Central_memory_CD8_T_cell   Effector_memory_CD8_T_cells
## 30 Levels: Central_memory_CD8_T_cell Classical_monocytes ... Vd2_gd_T_cells

# Design for DE analyses
design <- model.matrix(~0 + groups)
colnames(design) <- gsub("^groups", "", colnames(design))

# Extract expression data from data subset (discard cell_type column) and transform so samples are in columns
de_df <- as.data.frame(t(expsubset[, colnames(expsubset) != "cell_type"]))

rownames(de_df)[1:5]
## [1] "LINC00337"  "PIK3CD_AS2" "MATN1_AS1"  "LINC01778"  "AL353622.1"
colnames(de_df)[1:5]
## [1] "V1" "V2" "V3" "V4" "V5"

# Reassign column names. There will be duplicate column names because of biological replicates
colnames(de_df) <- expsubset$cell_type

colnames(de_df)[1:4]
## [1] "Naive_CD8_T_cells"             "Central_memory_CD8_T_cell"     "Effector_memory_CD8_T_cells"   "Terminal_effector_CD8_T_cells"

Limma analysis

# Create contrasts matrix
cont.matrix <- makeContrasts(Th1_vs_Nai4 = Th1_cells- Naive_CD4_T_cells,
                              Nai4_vs_Th1 = Naive_CD4_T_cells- Th1_cells, levels=colnames(design))  

# Output of cont matrix
cont.matrix
##                                Contrasts
##   Levels                          Th1_vs_Nai4 Nai4_vs_Th1
##   Central_memory_CD8_T_cell               0           0
## ...
##  Naive_CD4_T_cells                      -1           1
##  Naive_CD8_T_cells                       0           0
##  Natural_killer_cells                    0           0
##  Th1.Th17_cells                          0           0
##  Th1_cells                               1          -1
##  Th17_cells                              0           0
##  Th2_cells                               0           0
##  Vd2_gd_T_cells                          0           0             


# Linear fit
lmfit <- lmFit(de_df, design)
lmfit <- eBayes(lmfit, trend=T)

Tabulate results

topTable(lmfit, coef = which(colnames(cont.matrix2)=="Th1_vs_Nai4"), 
               p.value = 0.01, number = 4)
##               logFC  AveExpr        t      P.Value    adj.P.Val        B
## EBLN3P     6.960910 6.450265 48.99890 1.308402e-72 2.420544e-70 153.2290
## PSMB8_AS1  7.144795 7.284830 44.25446 2.626655e-68 2.429656e-66 143.9306
## PCED1B_AS1 8.126577 6.757791 40.95170 4.656414e-65 2.871455e-63 136.8323
## ITGB2_AS1  6.614067 5.535769 39.23635 2.807513e-63 1.298475e-61 132.9173

topTable(lmfit, coef = which(colnames(cont.matrix2)=="Nai4_vs_Th1"), 
               p.value = 0.01, number = 4)
##              logFC  AveExpr        t      P.Value    adj.P.Val        B
## PSMB8_AS1 7.103732 7.284830 44.00011 4.588982e-68 8.489617e-66 143.4033
## EBLN3P    5.927469 6.450265 41.72436 7.720048e-66 7.141045e-64 138.5431
## ITGB2_AS1 5.778358 5.535769 34.27871 1.008373e-57 6.218298e-56 120.6026
## VIM_AS1   6.128814 5.022750 32.85078 5.375385e-56 2.486116e-54 116.7481

The problem

The genes listed to be differentially expressed between two groups are pretty much the same regardless of which groups are compared or which direction I set up the comparisons (i.e. grp1-grp2 vs grp2-grp1). This doesn't make sense. For instance, take EBLN3P, which seems to be always "upregulated". When I plot the expression in a scatter plot, there is clearly no appreciable change between Th1 and Naive CD4 groups:

library(ggpubr)

ggdotchart(expdat, "cell_type", "EBLN3P", 
           color="gender", 
           sort="desc", add="segment", legend="top", rotate=T)

When I calculate the mean expression value, these "upregulated" genes are among the highest in the data frame:

meanvec <- sapply(expdat[, !colnames(expdat) %in% c("cell_type", "gender", "sample") & colnames(expdat) %in% lncs$gene_name], mean)

head(sort(meanvec, decreasing = T), 10)

## PSMB8_AS1 PCED1B_AS1     EBLN3P       HCP5 AL390728.6  LINC_PINT  ITGB2_AS1      
##   7.284830   6.757791   6.450265   6.195957   5.891494   5.762214   5.535769

What am I doing wrong here?

R limma TPM rnaseq differential expression • 2.3k views

ADD COMMENT • link updated 5.6 years ago by Steve Lianoglou ★ 13k • written 5.6 years ago by atakanekiz ▴ 30

score 3 · Accepted Answer · 2019-10-02

3

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 17 days ago

United States

I've got a few points / comments.

The first one I think will address the problem you are asking about directly. The biggest error here is that you have defined a contrast matrix, but you aren't using it. I'll take bits and pieces from your code, hopefully it makes sense.

cont.matrix <- makeContrasts(Th1_vs_Nai4 = Th1_cells- Naive_CD4_T_cells,
                             Nai4_vs_Th1 = Naive_CD4_T_cells- Th1_cells,
                             levels=colnames(design))
lmfit <- lmFit(de_df, design)
lmfit <- contrasts.fit(lmfit, cont.matrix)   # **you forgot this part**
lmfit <- eBayes(lmfit, trend = TRUE)

res <- topTable(lmfit, "Th1_vs_Nai4", n = Inf)

Testing the other contrast you have should just give you the same results as the first, but with the sign switched for the logFC.

Some other things you can consider:

I don't get what the "Averaging counts from multiple transcripts matching the same gene" step is in your pre-processing mojo. Looks like these reads are already rolled up into genes: I don't think you should be doing whatever it is else that you're doing here with this TPM data.
I'd suggest taking the raw data, and converting that right into an EList. Your $targets data.frame looks like it would just be the first three columns of your expdata data.frame you have already.
Shouldn't it rather be colnames(de_def) <- expdat$sample?
You might do better to include more of your data in the analysis, and only focus on the genes that you want in the final step. After I create the EList, I'd then only remove lowly expressed genes. I'd then do all the limma mojo on that and only subset down to genes you care about after your call to topTable. You can then readjust those p-values like so:
```
res.all <- topTable(lmfit, "Th1_vs_Nai4", n = Inf)
res <- res[rownames(res) %in% genes_I_care_about,]
res$adj.P.Value <- p.adjust(res$P.Value, "BH")
```
Having more data (genes) to work with is typically better for your eBayes step and should give you more (legitimate) power.

I think that's all I got for now ...

ADD COMMENT • link 5.6 years ago Steve Lianoglou ★ 13k

1

Entering edit mode

Re-inforcing point 4. All expressed genes should be input to eBayes, not just a few genes you are interested in. If there are only a few genes of interest, then subset the fitted model object input to topTable:

topTable(lmfit[mysubset,], coef="Th1_vs_Nai4")

Note also my comments on TPMs: https://support.bioconductor.org/p/98820/

ADD REPLY • link 5.6 years ago Gordon Smyth 52k

0

Entering edit mode

@Gordon, thanks so much for the input. I read that thread prior to my analysis but I will go back, and read it again with a fresh pair of eyes. I'd appreciate your insights on the follow-up questions I added in the comments.

Best, Atakan

ADD REPLY • link 5.6 years ago atakanekiz ▴ 30

0

Entering edit mode

Thanks for the insights! I got it to work now and it makes sense. I'd appreciate if you can help me understand two couple more points about this analysis (see below)

First to comment on your points:

1) The data contained transcript IDs which may correlate with different isoforms of the same host gene. I could have kept them as is and convert it to the human-readable gene IDs at the endpoint. I guess I was trying to avoid situations where two transcripts may have quite different expression patterns even though they are coming from the same gene. Although this is interesting, it was beyond the scope of my analysis and by putting these transcripts into the same bins, I wanted to look at gene-level changes as opposed to transcript level changes.

3) I agree it is more appropriate to name them that way. I was thinking to keep the group assignments as the column names to ensure correct ordering during my analyses. But I realize that the column naming isn't important as long as as the order of the groups and the data matrix correspond to each other. Am I right?

Now the questions...

1. About subsetting the dataset.

I had two lines of reasoning when I was subsetting my data:

i) the data is in TPM format (which is already normalized) so I don't need to use information from the entire dataset for calculating normalization factors etc. Therefore I thought that discarding the uninteresting genes will not impact the analysis.

ii) Since I know the gene list I want to examine and there are a lot more uninteresting genes than interesting ones, I thought removing these excess genes will shrink the multiple comparison corrections and will yield higher significance results. I realize this approach is more prone to false positives. Since I'm at an early exploratory phase of the project I wanted to see any trends however weak they may be.

This makes intuitive sense to me but your answers suggest that I may be wrong in one/both of these assumptions. Can you please help me understand why is it better to retain all the genes in the analysis in my case?

2. About the selection of low expression cutoff in TPM data

I know that the best approach is working with raw data, but sometimes it is not possible (especially for reanalysis of published datasets). When I used the filterByExpr function with no other arguments as indicated in vignette, most of the genes (98%) were discarded leaving me with 1300 or so genes. Then I decided to In use min.count=1 argument for low expression filtering of TPM data and ~60% of the genes were retained (32052 genes). What is a good cutoff for this kind of prenormalized data?

Thanks!

ADD REPLY • link 5.6 years ago atakanekiz ▴ 30

1

Entering edit mode

The data contained transcript IDs

No, it doesn't. IDs starting with "ENSG" are Ensembl gene IDs. IDs starting with "ENST" would be Ensembl transcript IDs.

I thought that discarding the uninteresting genes will not impact the analysis.

It does impact the analysis. The eBayes function needs to estimate variance heterogeneity across the whole genome in order to determine how much empirical Bayes moderation to apply.

I thought removing these excess genes will shrink the multiple comparison corrections and will yield higher significance results.

Indeed it does. That is why I advised you to subset the data entered to topTable(), which is the function that does the multiple testing adjustment. Subsetting for other functions has no benefit.

I realize this approach is more prone to false positives.

No, actually it isn't.

I used the filterByExpr function

You can't. filterByExpr is for counts, not TPMs. It isn't my responsibility to tell you how to filter TPMs, since I don't recommend them in the first place. But you can get an idea of whether the filtering is effective by examining the mean-variance trend from plotSA(lmfit).

ADD REPLY • link 5.6 years ago Gordon Smyth 52k

0

Entering edit mode

It makes sense, thanks!

Just out of curiosity... When I skip the contrasts.fit step what is actually being calculated giving the same output when I call the topTable function? Since the reported "upregulated" genes are always the highly expressed ones, it feels like the genes are compared against the others in the data frame in a sample-agnostic manner?

ADD REPLY • link 5.6 years ago atakanekiz ▴ 30

0

Entering edit mode

No that is not what is happening. edgeR always gives you the DE list corresponding to the coefficient in the design matrix that you specified. In this case the coefficient is just the log-expression of a particular group, so you will get genes that are highly expressed in that group. That is why you must always use contrast.fit if you remove the intercept by using 0+ when you call model.matrix.

edgeR's behaviour is explained in the documentation.

ADD REPLY • link 5.6 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the insights!

ADD REPLY • link 5.6 years ago atakanekiz ▴ 30