Hello BioC community,
Let me set the stage for my question first. We generated a list of 200 genes that correlate with certain clinical parameters using human RNAseq data. I'd like to analyze the expression patterns of these genes in published GEO datasets. One of the GEO datasets I'm interested in analyzing involves RNAseq of sorted human PBMCs (GSE107011). There are 30 different types of immune cells in this dataset with multiple biological replicates (127 samples total). The data I downloaded using GEOquery
package are in TPM format as reported by Kallisto (somehow I can't find the raw data). I can generate heatmaps like this to visualize the gene expression patterns (samples in rows, genes in columns):
I would like to analyze the significantly different genes between data subsets and generate a summary table. I used limma
package before for routine DE analyses, but I'm trying to develop more of a targeted analysis approach to investigate the ~200 genes (I don't care much about the other genes in the dataset for now). I performed the analysis, but the results I'm getting do not make sense to me.
These are the steps I performed, please let me know if my thinking is erroneous at any point.
Prepare expression data and sample metadata
# Read data
dat <- read.delim(gzfile("GSE107011/GSE107011_Processed_data_TPM.txt.gz"), check.names = F)
dat[1:5,1:5]
## DZQV_CD8_naive DZQV_CD8_CM DZQV_CD8_EM DZQV_CD8_TE
## 1 ENSG00000223972.5 0.214321 0.00000 0.00421414 0.00953342
## 2 ENSG00000227232.5 4.759630 3.49280 2.64877000 3.72232000
## 3 ENSG00000278267.1 0.000000 2.61518 2.45976000 0.00000000
## 4 ENSG00000243485.5 0.000000 0.00000 0.00000000 0.00000000
## 5 ENSG00000284332.1 0.000000 0.00000 0.00000000 0.00000000
# Do some processing including:
### Averaging counts from multiple transcripts matching the same gene
### Log2 transforming the linear TPM values
### Make a ggplot-friendly composite dataframe that contains info from sample metadata and gene expression (will be useful later on for graphing)
expdat[1:5,1:5]
## sample cell_type gender DNAJC11 CDK11A
## 1 DZQV_CD8_naive Naive_CD8_T_cells M 5.609300 5.799777
## 2 DZQV_CD8_CM Central_memory_CD8_T_cell M 5.547006 6.204563
## 3 DZQV_CD8_EM Effector_memory_CD8_T_cells M 5.778402 6.187289
## 4 DZQV_CD8_TE Terminal_effector_CD8_T_cells M 5.186984 6.465826
## 5 DZQV_MAIT MAIT_cells M 5.251571 6.152436
Subset the dataset to contain the genes of interest
I'm thinking that this will reduce the number of multiple comparisons (by discarding uninteresting genes), giving the analysis more power to detect differential expression.
# genes object is a character vector containing the names of genes-of-interest
genes[1:5]
## "PPP3CB_AS1" "IL10RB_AS1" "PPP1R26_AS1" "RRN3P2" "AC099343.3"
# Subset the data frame to contain info from genes and cell_type (group variable from sample metadata)
expsubset <- expdat[, colnames(expdat) %in% c("cell_type", genes)]
dim(expsubset)
## 127 200
# Groups in comparison
groups <- expsubset$cell_type
groups[1:5]
## [1] Naive_CD8_T_cells Central_memory_CD8_T_cell Effector_memory_CD8_T_cells
## 30 Levels: Central_memory_CD8_T_cell Classical_monocytes ... Vd2_gd_T_cells
# Design for DE analyses
design <- model.matrix(~0 + groups)
colnames(design) <- gsub("^groups", "", colnames(design))
# Extract expression data from data subset (discard cell_type column) and transform so samples are in columns
de_df <- as.data.frame(t(expsubset[, colnames(expsubset) != "cell_type"]))
rownames(de_df)[1:5]
## [1] "LINC00337" "PIK3CD_AS2" "MATN1_AS1" "LINC01778" "AL353622.1"
colnames(de_df)[1:5]
## [1] "V1" "V2" "V3" "V4" "V5"
# Reassign column names. There will be duplicate column names because of biological replicates
colnames(de_df) <- expsubset$cell_type
colnames(de_df)[1:4]
## [1] "Naive_CD8_T_cells" "Central_memory_CD8_T_cell" "Effector_memory_CD8_T_cells" "Terminal_effector_CD8_T_cells"
Limma analysis
# Create contrasts matrix
cont.matrix <- makeContrasts(Th1_vs_Nai4 = Th1_cells- Naive_CD4_T_cells,
Nai4_vs_Th1 = Naive_CD4_T_cells- Th1_cells, levels=colnames(design))
# Output of cont matrix
cont.matrix
## Contrasts
## Levels Th1_vs_Nai4 Nai4_vs_Th1
## Central_memory_CD8_T_cell 0 0
## ...
## Naive_CD4_T_cells -1 1
## Naive_CD8_T_cells 0 0
## Natural_killer_cells 0 0
## Th1.Th17_cells 0 0
## Th1_cells 1 -1
## Th17_cells 0 0
## Th2_cells 0 0
## Vd2_gd_T_cells 0 0
# Linear fit
lmfit <- lmFit(de_df, design)
lmfit <- eBayes(lmfit, trend=T)
Tabulate results
topTable(lmfit, coef = which(colnames(cont.matrix2)=="Th1_vs_Nai4"),
p.value = 0.01, number = 4)
## logFC AveExpr t P.Value adj.P.Val B
## EBLN3P 6.960910 6.450265 48.99890 1.308402e-72 2.420544e-70 153.2290
## PSMB8_AS1 7.144795 7.284830 44.25446 2.626655e-68 2.429656e-66 143.9306
## PCED1B_AS1 8.126577 6.757791 40.95170 4.656414e-65 2.871455e-63 136.8323
## ITGB2_AS1 6.614067 5.535769 39.23635 2.807513e-63 1.298475e-61 132.9173
topTable(lmfit, coef = which(colnames(cont.matrix2)=="Nai4_vs_Th1"),
p.value = 0.01, number = 4)
## logFC AveExpr t P.Value adj.P.Val B
## PSMB8_AS1 7.103732 7.284830 44.00011 4.588982e-68 8.489617e-66 143.4033
## EBLN3P 5.927469 6.450265 41.72436 7.720048e-66 7.141045e-64 138.5431
## ITGB2_AS1 5.778358 5.535769 34.27871 1.008373e-57 6.218298e-56 120.6026
## VIM_AS1 6.128814 5.022750 32.85078 5.375385e-56 2.486116e-54 116.7481
The problem
The genes listed to be differentially expressed between two groups are pretty much the same regardless of which groups are compared or which direction I set up the comparisons (i.e. grp1-grp2 vs grp2-grp1). This doesn't make sense. For instance, take EBLN3P
, which seems to be always "upregulated". When I plot the expression in a scatter plot, there is clearly no appreciable change between Th1
and Naive CD4
groups:
library(ggpubr)
ggdotchart(expdat, "cell_type", "EBLN3P",
color="gender",
sort="desc", add="segment", legend="top", rotate=T)
When I calculate the mean expression value, these "upregulated" genes are among the highest in the data frame:
meanvec <- sapply(expdat[, !colnames(expdat) %in% c("cell_type", "gender", "sample") & colnames(expdat) %in% lncs$gene_name], mean)
head(sort(meanvec, decreasing = T), 10)
## PSMB8_AS1 PCED1B_AS1 EBLN3P HCP5 AL390728.6 LINC_PINT ITGB2_AS1
## 7.284830 6.757791 6.450265 6.195957 5.891494 5.762214 5.535769
What am I doing wrong here?
Re-inforcing point 4. All expressed genes should be input to
eBayes
, not just a few genes you are interested in. If there are only a few genes of interest, then subset the fitted model object input to topTable:Note also my comments on TPMs: https://support.bioconductor.org/p/98820/
@Gordon, thanks so much for the input. I read that thread prior to my analysis but I will go back, and read it again with a fresh pair of eyes. I'd appreciate your insights on the follow-up questions I added in the comments.
Best, Atakan
Thanks for the insights! I got it to work now and it makes sense. I'd appreciate if you can help me understand two couple more points about this analysis (see below)
First to comment on your points:
1) The data contained transcript IDs which may correlate with different isoforms of the same host gene. I could have kept them as is and convert it to the human-readable gene IDs at the endpoint. I guess I was trying to avoid situations where two transcripts may have quite different expression patterns even though they are coming from the same gene. Although this is interesting, it was beyond the scope of my analysis and by putting these transcripts into the same bins, I wanted to look at gene-level changes as opposed to transcript level changes.
3) I agree it is more appropriate to name them that way. I was thinking to keep the group assignments as the column names to ensure correct ordering during my analyses. But I realize that the column naming isn't important as long as as the order of the groups and the data matrix correspond to each other. Am I right?
Now the questions...
1. About subsetting the dataset.
I had two lines of reasoning when I was subsetting my data:
i) the data is in TPM format (which is already normalized) so I don't need to use information from the entire dataset for calculating normalization factors etc. Therefore I thought that discarding the uninteresting genes will not impact the analysis.
ii) Since I know the gene list I want to examine and there are a lot more uninteresting genes than interesting ones, I thought removing these excess genes will shrink the multiple comparison corrections and will yield higher significance results. I realize this approach is more prone to false positives. Since I'm at an early exploratory phase of the project I wanted to see any trends however weak they may be.
This makes intuitive sense to me but your answers suggest that I may be wrong in one/both of these assumptions. Can you please help me understand why is it better to retain all the genes in the analysis in my case?
2. About the selection of low expression cutoff in TPM data
I know that the best approach is working with raw data, but sometimes it is not possible (especially for reanalysis of published datasets). When I used the
filterByExpr
function with no other arguments as indicated in vignette, most of the genes (98%) were discarded leaving me with 1300 or so genes. Then I decided to In usemin.count=1
argument for low expression filtering of TPM data and ~60% of the genes were retained (32052 genes). What is a good cutoff for this kind of prenormalized data?Thanks!
No, it doesn't. IDs starting with "ENSG" are Ensembl gene IDs. IDs starting with "ENST" would be Ensembl transcript IDs.
It does impact the analysis. The
eBayes
function needs to estimate variance heterogeneity across the whole genome in order to determine how much empirical Bayes moderation to apply.Indeed it does. That is why I advised you to subset the data entered to
topTable()
, which is the function that does the multiple testing adjustment. Subsetting for other functions has no benefit.No, actually it isn't.
You can't.
filterByExpr
is for counts, not TPMs. It isn't my responsibility to tell you how to filter TPMs, since I don't recommend them in the first place. But you can get an idea of whether the filtering is effective by examining the mean-variance trend fromplotSA(lmfit)
.It makes sense, thanks!
Just out of curiosity... When I skip the
contrasts.fit
step what is actually being calculated giving the same output when I call thetopTable
function? Since the reported "upregulated" genes are always the highly expressed ones, it feels like the genes are compared against the others in the data frame in a sample-agnostic manner?No that is not what is happening. edgeR always gives you the DE list corresponding to the coefficient in the design matrix that you specified. In this case the coefficient is just the log-expression of a particular group, so you will get genes that are highly expressed in that group. That is why you must always use
contrast.fit
if you remove the intercept by using0+
when you callmodel.matrix
.edgeR's behaviour is explained in the documentation.
Thanks for the insights!