Question

HTA 2.0 and Transcript level analysis

0

Entering edit mode

giroudpaul ▴ 40

@giroudpaul-10031

Last seen 5.0 years ago

France

Dear Bioconductor members

I am asked to analyze my lab's HTA2.0 data once again, but this time to find information on specific transcript variants.

Do you know if it is possible to annotate an Eset obtained via : data.rma = oligo::rma(data, target="probeset")? For what I understand, this apply RMA to the probeset, without summarizing in transcriptclusters. And can we perform DE analysis on transcripts ?

For instance, how do I know which probes will be of interest for this kind of transcript : https://www.ncbi.nlm.nih.gov/nuccore/XM_011529084.2

I read in an another thread (Use junction probesets (JUC) to detect alternative splicing on Affymetrix human HTA 2.0 arrays? that practically no one was interested in transcript analysis from microarray data, explaining the lack of tools to perform this task. Is it really so ? I find this quite surprising.

Thank you for your help,
Paul

hta2.0 transcripts • 3.1k views

ADD COMMENT • link 7.3 years ago giroudpaul ▴ 40

score 2 · Accepted Answer · 2017-08-21

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 2 days ago

United States

You can annotate at the probeset level. Just use the hta20probeset.db package, which contains whatever annotation Affy makes available.

As for performing DE analysis on transcripts, there are, so far as I know, no Bioconductor packages for this. There is Affy's TAC, which does something that they claim is useful, so you might take a look at that.

ADD COMMENT • link 7.3 years ago James W. MacDonald 67k

0

Entering edit mode

Hi again,

So I annotated my dataset just to have a look, and it seems that the hta20probeset.db does not really differentiate between transcripts as it maps all refseq transcripts for one gene to each probes corresponding to this gene.

For example:
hta.dat = rma(hta.dat, target="probeset")
ProbesTr <- select(hta20probeset.db, keys=keys(hta20probeset.db), columns = c("SYMBOL", "REFSEQ"))
ProbesTr[ProbesTr$PROBEID %in% "JUC17000798.hg.1",]

Returns :

                 PROBEID SYMBOL       REFSEQ
4984845 JUC17000798.hg.1   CD68 NM_001040059
4984846 JUC17000798.hg.1   CD68    NM_001251
4984847 JUC17000798.hg.1   CD68 NP_001035148
4984848 JUC17000798.hg.1   CD68    NP_001242

Corresponding to two transcript variants and two precursors
I suppose this is what makes hta.dat <- annotateEset(hta.dat, hta20probeset.db) fails (R crashes).

Is this normal ? Is there a trick that I missed ? Or is it simply because affymetrix did not give the correct annotations for its probes.

ADD REPLY • link 7.3 years ago giroudpaul ▴ 40

0

Entering edit mode

Ah, you know what? The way these annotation packages work is completely anathema to how this array is supposed to work. The ChipDb package only has one real table, which maps the probeset ID to the Entrez Gene ID, and any other lookups will be farmed out to the org.Hs.eg.db package.

Since all previous Affy arrays (with the notable exception of the Exon ST arrays) were based on gene level expression, this was fine. But for these arrays it doesn't really work. For example the JUC probe you mention is only intended to measure two of the four transcripts for CD68:

grep JUC17000798.hg.1 HTA-2_0.na35.hg19.probeset.csv | cut -d, -f 12 | sed 's|///|\n|g'

 uc002ghu.3 // --- // 100 // 0 // --- // 1
 uc002ghv.3 // --- // 100 // 0 // --- // 1
 NM_001040059 // --- // 100 // 0 // --- // 1
 NM_001251 // --- // 100 // 0 // --- // 1
 BC015557 // --- // 100 // 0 // --- // 1
 ENST00000250092 // --- // 100 // 0 // --- // 1
 ENST00000380498 // --- // 100 // 0 // --- // 1
 OTTHUMT00000226949 // --- // 100 // 0 // --- // 1

But when we make the annotation package we map the RefSeq ID to its Entrez Gene ID

> dbGetQuery(hta20probeset_dbconn(), "select * from probes where probe_id='JUC17000798.hg.1';")
          probe_id gene_id is_multiple
1 JUC17000798.hg.1     968           0

And any mapping from that probeset ID to anything else is based on a join to that table, so is by definition gene based, not transcript based.

So as they stand, the probeset.db packages are probably not that useful. OTOH, it's a fairly rare thing for people to want to do what you want to do, so I don't know if it's worth putting limited resources towards making a special ChipDb type that is transcript rather than gene centric.

ADD REPLY • link 7.3 years ago James W. MacDonald 67k

0

Entering edit mode

I should also mention that annotateEset works for me, so I cannot reproduce any crash. You also have access to all the annotation that Affy provides, which you can get using the getNetAffx function in oligo.

> z <- getNetAffx(dat)

> pData(z)["JUC17000798.hg.1","mrnaassignment"]
[1] "uc002ghu.3 // --- // 100 // 0 // --- // 1 /// uc002ghv.3 // --- // 100 // 0 // --- // 1 /// NM_001040059 // --- // 100 // 0 // --- // 1 /// NM_001251 // --- // 100 // 0 // --- // 1 /// BC015557 // --- // 100 // 0 // --- // 1 /// ENST00000250092 // --- // 100 // 0 // --- // 1 /// ENST00000380498 // --- // 100 // 0 // --- // 1 /// OTTHUMT00000226949 // --- // 100 // 0 // --- // 1"

Which you can parse to get all the transcripts that each probeset is supposed to interrogate.

But this turns out, for this gene, to be exactly what you already would get from the hta20probeset.db package:

> toget <- select(hta20probeset.db, "CD68", "PROBEID", "SYMBOL")$PROBEID
'select()' returned 1:many mapping between keys and columns
> anno <- pData(z)[toget,"mrnaassignment"]
> lapply(strsplit(anno, " /// "), function(x) grep("^NM_", sapply(strsplit(x, " // "), "[", 1), value = TRUE))
[[1]]
[1] "NM_001040059" "NM_001251"   

[[2]]
[1] "NM_001040059" "NM_001251"   

[[3]]
[1] "NM_001040059" "NM_001251"   

[[4]]
[1] "NM_001040059" "NM_001251"   

[[5]]
[1] "NM_001040059" "NM_001251"   

[[6]]
[1] "NM_001040059" "NM_001251"   

[[7]]
[1] "NM_001040059" "NM_001251"   

[[8]]
[1] "NM_001040059" "NM_001251"   

[[9]]
[1] "NM_001040059" "NM_001251"   

[[10]]
[1] "NM_001040059" "NM_001251"   

[[11]]
[1] "NM_001040059" "NM_001251"   

[[12]]
[1] "NM_001040059" "NM_001251"   

[[13]]
[1] "NM_001040059" "NM_001251"   

[[14]]
[1] "NM_001040059" "NM_001251"   

[[15]]
[1] "NM_001040059" "NM_001251"   

[[16]]
[1] "NM_001040059" "NM_001251"   

[[17]]
[1] "NM_001040059" "NM_001251"   

[[18]]
[1] "NM_001040059" "NM_001251"   

[[19]]
[1] "NM_001040059" "NM_001251"

So apparently all of the PSR and JUC probesets for this gene measure just two of the known transcripts? I don't have time to explore further, but this should give you some hints as to how you can proceed on your own.

ADD REPLY • link 7.3 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you very much for these explanations. As I suspected, it had something to do with how the ChipDB package is built.

Thank you for your time,

Paul

ADD REPLY • link 7.3 years ago giroudpaul ▴ 40

0

Entering edit mode

I think you missed my point. At least for this particular probeset, we are supplying EXACTLY what Affy supplies, which is the same exact data for all of the probesets. So unless this is an outlier, it's not clear that Affy supply the required data to do any transcript work anyway.

ADD REPLY • link 7.3 years ago James W. MacDonald 67k

0

Entering edit mode

"The ChipDb package only has one real table, which maps the probeset ID to the Entrez Gene ID, and any other lookups will be farmed out to the org.Hs.eg.db package."

This is what I refered to when saying this. As you said, Affymetrix is not providing Transcript detailed information for the ChipDB package of it's HTA chips (but they sell you them telling that its great to look at transcripts). Well, anyway, I cannot do that (which I will explain my colleague). Only solution left is to look at Affymetrix TAC software.

ADD REPLY • link 7.3 years ago giroudpaul ▴ 40