Annotating HTA 2.0 genes
3
0
Entering edit mode
statistuta • 0
@statistuta-7023
Last seen 10.1 years ago
United States

Hello..I have a list of .CEL files from HTA 2.0 chips. I created an ExpressionSet object by reading them with the R package "oligo". This object has data on 70523 features (genes) and 48 samples. What is best way to annotate these genes so that I can create a "GenomicRanges" object with the following attributes: Chromosome, Transcript_start, Transcript end ?

microarray • 7.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States
> library(pd.hta.2.0)

> load(paste0(path.package("pd.hta.2.0"), "/extdata/netaffxTranscript.rda"))

> annot <- pData(netaffxTranscript)

> annot <- annot[!annot$seqname %in% c("", NA, "NA"),]

> annot$strand[annot$strand == ""] <- "*"

> pd.hta.gr <- GRanges(annot$seqname, IRanges(annot$start, annot$stop), annot$strand)

EDIT: added names

> namespd.hta.gr) <- annot$transcriptclusterid
> pd.hta.gr
GRanges object with 67758 ranges and 0 metadata columns:
                                 seqnames           ranges strand
                                    <Rle>        <IRanges>  <Rle>
           TC01000001.hg.1           chr1 [ 11869,  14409]      +
           TC01000002.hg.1           chr1 [ 29554,  31109]      +
           TC01000003.hg.1           chr1 [ 69091,  70008]      +
           TC01000004.hg.1           chr1 [160446, 161525]      +
           TC01000005.hg.1           chr1 [317811, 328581]      +
                       ...            ...              ...    ...
  TCUn_gl000212000002.hg.0 chrUn_gl000212 [ 65184,  66597]      +
  TCUn_gl000219000001.hg.0 chrUn_gl000219 [ 45397,  57412]      +
  TCUn_gl000222000004.hg.0 chrUn_gl000222 [122791, 122814]      -
  TCUn_gl000223000002.hg.0 chrUn_gl000223 [   351,    377]      -
  TCUn_gl000229000001.hg.0 chrUn_gl000229 [ 16405,  16424]      +
  -------
  seqinfo: 60 sequences from an unspecified genome; no seqlengths

 

ADD COMMENT
0
Entering edit mode
statistuta • 0
@statistuta-7023
Last seen 10.1 years ago
United States

Sincere thanks for your reply!…But this gives the annotation for 67758 transcripts, while I have the expression data on 70523 transcripts. How can I get the annotation for the remaining transcripts?

ADD COMMENT
0
Entering edit mode

The short answer is that you can't. If you re-read the code in my previous answer, you will note this line:

> annot <- annot[!annot$seqname %in% c("", NA, "NA"),]

which removes all the data where the seqname (chromosome) is missing. It turns out there are 2995 probesets for which there is no chromosome in the annotation file. If you don't have a chromosome, how will you add it to the GRanges object? If we look a bit closer at these probesets, they don't have start/end information either:

> annot.noseqname <- annot[annot$seqname == "", 1:6]
> head(annot.noseqname)
           transcriptclusterid probesetid seqname strand start stop
2824546_st          2824546_st 2824546_st                    0    0
2824549_st          2824549_st 2824549_st                    0    0
2824551_st          2824551_st 2824551_st                    0    0
2824554_st          2824554_st 2824554_st                    0    0
2827992_st          2827992_st 2827992_st                    0    0
2827995_st          2827995_st 2827995_st                    0    0
> table(annot.noseqname$start)
   0
2995
> table(annot.noseqname$stop)
   0
2995

And if we go to netaffx to see what these probesets are supposed to measure, we either return no results, or we get redirected to some random page. Try querying for 2824546_st at netaffx, or better yet, one of these PSR probesets:

> tail(annot.noseqname)
transcriptclusterid                probesetid
PSR6_ssto_hap7002446.hg.1 PSR6_ssto_hap7002446.hg.1
PSR6_ssto_hap7002447.hg.1 PSR6_ssto_hap7002447.hg.1
PSR6_ssto_hap7002454.hg.1 PSR6_ssto_hap7002454.hg.1
PSR6_ssto_hap7002455.hg.1 PSR6_ssto_hap7002455.hg.1
PSR6_ssto_hap7002456.hg.1 PSR6_ssto_hap7002456.hg.1
PSR6_ssto_hap7002459.hg.1 PSR6_ssto_hap7002459.hg.1
                          seqname strand start stop
PSR6_ssto_hap7002446.hg.1                    0    0
PSR6_ssto_hap7002447.hg.1                    0    0
PSR6_ssto_hap7002454.hg.1                    0    0
PSR6_ssto_hap7002455.hg.1                    0    0
PSR6_ssto_hap7002456.hg.1                    0    0
PSR6_ssto_hap7002459.hg.1                    0    0

So long story short, you cannot get the position information for all of the probesets, because some of them have no such information.

 

ADD REPLY
0
Entering edit mode

The 2995 probesets in question are various types of controls, and are not considered main content for gene level analysis.

ADD REPLY
0
Entering edit mode
statistuta • 0
@statistuta-7023
Last seen 10.1 years ago
United States

Thank you very much!

ADD COMMENT
0
Entering edit mode

Hello James,

I have a similar situation. I have read all my 30 CEL file into R and Normalized them using Oligo package. How can I use pd.hta.2.0 package to annotate my data ? I have been searching for codes. From what I have seen some people have tried to build their on library but I am not sure if is that a necessary step to annotate my data?

 

thanks in advance.

Best,

Hoss

ADD REPLY
0
Entering edit mode
library(affycoretools)

eset <- annotateEset(eset, pd.hta.2.0)

OR

eset <- annotateEset(eset, hta20transcriptcluster.db)

If you want to use the annotation db package. See ?annotateEset for more information.

ADD REPLY

Login before adding your answer.

Traffic: 611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6