Question

Annotating HTA 2.0 genes

0

Entering edit mode

statistuta • 0

@statistuta-7023

Last seen 10.1 years ago

United States

Hello..I have a list of .CEL files from HTA 2.0 chips. I created an ExpressionSet object by reading them with the R package "oligo". This object has data on 70523 features (genes) and 48 samples. What is best way to annotate these genes so that I can create a "GenomicRanges" object with the following attributes: Chromosome, Transcript_start, Transcript end ?

microarray • 7.0k views

ADD COMMENT • link 10.1 years ago statistuta • 0

score 0 · Answer 1 · 2014-11-11

> library(pd.hta.2.0)

> load(paste0(path.package("pd.hta.2.0"), "/extdata/netaffxTranscript.rda"))

> annot <- pData(netaffxTranscript)

> annot <- annot[!annot$seqname %in% c("", NA, "NA"),]

> annot$strand[annot$strand == ""] <- "*"

> pd.hta.gr <- GRanges(annot$seqname, IRanges(annot$start, annot$stop), annot$strand)

EDIT: added names

> namespd.hta.gr) <- annot$transcriptclusterid
> pd.hta.gr
GRanges object with 67758 ranges and 0 metadata columns:
                                 seqnames           ranges strand
                                    <Rle>        <IRanges>  <Rle>
           TC01000001.hg.1           chr1 [ 11869,  14409]      +
           TC01000002.hg.1           chr1 [ 29554,  31109]      +
           TC01000003.hg.1           chr1 [ 69091,  70008]      +
           TC01000004.hg.1           chr1 [160446, 161525]      +
           TC01000005.hg.1           chr1 [317811, 328581]      +
                       ...            ...              ...    ...
  TCUn_gl000212000002.hg.0 chrUn_gl000212 [ 65184,  66597]      +
  TCUn_gl000219000001.hg.0 chrUn_gl000219 [ 45397,  57412]      +
  TCUn_gl000222000004.hg.0 chrUn_gl000222 [122791, 122814]      -
  TCUn_gl000223000002.hg.0 chrUn_gl000223 [   351,    377]      -
  TCUn_gl000229000001.hg.0 chrUn_gl000229 [ 16405,  16424]      +
  -------
  seqinfo: 60 sequences from an unspecified genome; no seqlengths

score 0 · Answer 2 · 2014-11-12

0

Entering edit mode

statistuta • 0

@statistuta-7023

Last seen 10.1 years ago

United States

Sincere thanks for your reply!…But this gives the annotation for 67758 transcripts, while I have the expression data on 70523 transcripts. How can I get the annotation for the remaining transcripts?

ADD COMMENT • link 10.1 years ago statistuta • 0

0

Entering edit mode

The short answer is that you can't. If you re-read the code in my previous answer, you will note this line:

> annot <- annot[!annot$seqname %in% c("", NA, "NA"),]

which removes all the data where the seqname (chromosome) is missing. It turns out there are 2995 probesets for which there is no chromosome in the annotation file. If you don't have a chromosome, how will you add it to the GRanges object? If we look a bit closer at these probesets, they don't have start/end information either:

> annot.noseqname <- annot[annot$seqname == "", 1:6]
> head(annot.noseqname)
           transcriptclusterid probesetid seqname strand start stop
2824546_st          2824546_st 2824546_st                    0    0
2824549_st          2824549_st 2824549_st                    0    0
2824551_st          2824551_st 2824551_st                    0    0
2824554_st          2824554_st 2824554_st                    0    0
2827992_st          2827992_st 2827992_st                    0    0
2827995_st          2827995_st 2827995_st                    0    0
> table(annot.noseqname$start)
   0
2995
> table(annot.noseqname$stop)
   0
2995

And if we go to netaffx to see what these probesets are supposed to measure, we either return no results, or we get redirected to some random page. Try querying for 2824546_st at netaffx, or better yet, one of these PSR probesets:

> tail(annot.noseqname)
transcriptclusterid                probesetid
PSR6_ssto_hap7002446.hg.1 PSR6_ssto_hap7002446.hg.1
PSR6_ssto_hap7002447.hg.1 PSR6_ssto_hap7002447.hg.1
PSR6_ssto_hap7002454.hg.1 PSR6_ssto_hap7002454.hg.1
PSR6_ssto_hap7002455.hg.1 PSR6_ssto_hap7002455.hg.1
PSR6_ssto_hap7002456.hg.1 PSR6_ssto_hap7002456.hg.1
PSR6_ssto_hap7002459.hg.1 PSR6_ssto_hap7002459.hg.1
                          seqname strand start stop
PSR6_ssto_hap7002446.hg.1                    0    0
PSR6_ssto_hap7002447.hg.1                    0    0
PSR6_ssto_hap7002454.hg.1                    0    0
PSR6_ssto_hap7002455.hg.1                    0    0
PSR6_ssto_hap7002456.hg.1                    0    0
PSR6_ssto_hap7002459.hg.1                    0    0

So long story short, you cannot get the position information for all of the probesets, because some of them have no such information.

ADD REPLY • link 10.1 years ago James W. MacDonald 67k

0

Entering edit mode

The 2995 probesets in question are various types of controls, and are not considered main content for gene level analysis.

ADD REPLY • link 9.6 years ago Jason Hubbard ▴ 20

score 0 · Answer 3 · 2014-11-13

0

Entering edit mode

statistuta • 0

@statistuta-7023

Last seen 10.1 years ago

United States

Thank you very much!

ADD COMMENT • link 10.1 years ago statistuta • 0

0

Entering edit mode

Hello James,

I have a similar situation. I have read all my 30 CEL file into R and Normalized them using Oligo package. How can I use pd.hta.2.0 package to annotate my data ? I have been searching for codes. From what I have seen some people have tried to build their on library but I am not sure if is that a necessary step to annotate my data?

thanks in advance.

Best,

Hoss

ADD REPLY • link 7.8 years ago Seymoo • 0

0

Entering edit mode

library(affycoretools)

eset <- annotateEset(eset, pd.hta.2.0)

OR

eset <- annotateEset(eset, hta20transcriptcluster.db)

If you want to use the annotation db package. See ?annotateEset for more information.

ADD REPLY • link 7.8 years ago James W. MacDonald 67k