Entering edit mode
Polikepahad, Sumanth
▴
10
@polikepahad-sumanth-5111
Last seen 10.4 years ago
Hi,
I have analyzed my deep sequencing data with DESeq and successfully
generated a heatmap showing differentially expressed miRNAs by using
gplots heatmap.2. However, I want to put different colors to the fonts
of miRNA names that are shown on y-axis, depending on whether they are
up or down regulated. For example, I want to show the names of all
down-regulated miRNAs in blue and up-regulated in Red. How do I do
this? I guess I must use if else statement, but not sure if its the
right way to do it. I am very new to R and I really appreciate any
help.
Thanks in advance.
________________________________________
From: bioconductor-bounces@r-project.org [bioconductor-
bounces@r-project.org] On Behalf Of bioconductor-request@r-project.org
[bioconductor-request@r-project.org]
Sent: Friday, February 10, 2012 5:00 AM
To: bioconductor at r-project.org
Subject: Bioconductor Digest, Vol 108, Issue 10
Send Bioconductor mailing list submissions to
bioconductor at r-project.org
To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/bioconductor
or, via email, send a message with subject or body 'help' to
bioconductor-request at r-project.org
You can reach the person managing the list at
bioconductor-owner at r-project.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioconductor digest..."
Today's Topics:
1. Re: Limma: questions about data pre-processing (Vladimir
Krasikov)
2. Re: DESeq and transcript-wise analysis (Nicolas Delhomme)
3. Re: Error message with RNAither (Martin Morgan)
4. GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
5. February-March 2012 ***R/S-PLUS Courses***by XLSolutions Corp
at 9 USA locations (sue at xlsolutions-corp.com)
6. Re: DESeq and transcript-wise analysis (Thomas Girke)
7. Re: DESeq and transcript-wise analysis (Nicolas Delhomme)
8. Re: Limma: questions about data pre-processing
(axel.klenk at actelion.com)
9. suggestions/comments on DESeq transcript wise analysis
(Akula, Nirmala (NIH/NIMH) [C])
10. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Tim Triche, Jr.)
11. Re: suggestions/comments on DESeq transcript wise analysis
(Abhishek Pratap)
12. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
13. Optimisation (Simon No?l)
14. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Tim Triche, Jr.)
15. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
16. how can i convert a gse object(expression sets) to affybatch
(Salwa Eid)
17. Import sequences from MacClade 4.* (Brian)
18. Re: DESeq and transcript-wise analysis (Elena Sorokin)
19. RE : maping SNPs (Simon No?l)
20. Re: RE : maping SNPs (Hervé Pagès)
21. RE : RE : maping SNPs (Simon No?l)
22. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Michael Lawrence)
23. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
24. Removing probes with low variance across samples (Infinium
450k) (khadeeja ismail)
25. Re: RE : RE : maping SNPs (Hervé Pagès)
26. Using write.table with output from topTags [was: report a
possible bug of edgeR] (Gordon K Smyth)
27. Re: DESeq and transcript-wise analysis (Ann Loraine)
28. Re: Removing probes with low variance across samples
(Infinium 450k) (Tim Triche, Jr.)
29. Re: how can i convert a gse object(expression sets) to
affybatch (Sean Davis)
30. ConsensusClusterPlus extracting features for clusters
(somnath bandyopadhyay)
31. limma barcodeplot with 2 groups (Dario Strbenac)
32. bioMart (Bogdan Tanasa)
33. Inconsistent coefficient values (Richard Coulson [guest])
34. a problem in reading in cel files (Manuela Di Russo)
35. Re: NormqPCR and ReadqPCR (James Perkins)
----------------------------------------------------------------------
Message: 1
Date: Thu, 09 Feb 2012 13:47:59 +0100
From: "Vladimir Krasikov" <v.v.krasikov@gmail.com>
To: axel.klenk at actelion.com
Cc: bioconductor at r-project.org, bioconductor-bounces at
r-project.org
Subject: Re: [BioC] Limma: questions about data pre-processing
Message-ID: <op.v9ewh9grb6kgy0 at="" u028550.uva.nl="">
Content-Type: text/plain; charset=iso-8859-15; format=flowed;
delsp=yes
Dear Axel
Once again thanks...
Q2: The only thing I know now is that it
was rather new Agilent edition of March 2011,
however our company stripped away all information in files ( even
removed
all control spots).
Do you think there is still a way to make visualizations?
Q5: After reading Rquantile description I now see some rationale about
this normalization,
when all Red chanels contoined common reference (which is commercial
"universal human reference").
However, question remains, what kind of plots, metrics are useful to
judge
the results of normalizations?
On Tue, 07 Feb 2012 15:32:03 +0100, <axel.klenk at="" actelion.com="">
wrote:
> Dear Vladimir,
>
> I'll only answer or comment on some of your questions and leave
> the others for the true experts...
>
> Q2: yes, for example using package arrayQualityMetrics, if you know
> the array layout. FES output usually contains columns Col and Row
for
> spot coordinates but apparently your "service provider" has removed
> them. I could send you a coordinates <--> oligo mapping by email if
you
> can tell me your array type -- is it 1x44K, 4x44K or 4x44Kv2?
> Alternatively,
> you can try to find that information on Agilent's eArray web site:
> earray.chem.agilent.com
>
> Q5: for a common reference design, dye swaps are not required and
> I would not apply a loess normalization -- depending on what you
have
> hybridized as the common reference, the assumptions may not hold.
> As for the between-array normalization, Rquantile may also be an
> option for your design and boxplots and density plots may be used
> for judging the results.
>
> Cheers,
>
> - axel
>
>
> Axel Klenk
> Research Informatician
> Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil
/
> Switzerland
>
>
>
>
> From:
> "Vladimir Krasikov" <v.v.krasikov at="" gmail.com="">
> To:
> bioconductor at r-project.org
> Date:
> 07.02.2012 14:27
> Subject:
> [BioC] Limma: questions about data pre-processing
> Sent by:
> bioconductor-bounces at r-project.org
>
>
>
> Dear limma experts
>
> During creating the pipe-line for dissecting differential gene
expression
> in frame of limma,
> several questions have arose.
>
> Experiment:
> I have 62 two-color Agilent human arrays.
> The samples are from several human more or less related to each
other
> disorders and vary in age, sex, disease duration and diagnosis.
> Company that made hybridizations performed all hybs in one direction
(no
> dye-swaps),
> where all samples were in G channel and common Ref in R channel,
> and unfortunately provided us only excepts of Feature Extraction
> which contained info on G, Gb, R, Rb, and FNO (non-uniformity
outliers)
> and separate gene annotation table.
>
> I performed generic import of the data and assigned zero-weight to
the
> FNO
> spots:
> I analyzed density and MA-plots, box-plots of M-values, G and R
channels
> and box-plots of background intensities,
> and removed from experiment 1 array with aberrant raw G-channel
density.
> (I will discuss experiment description later, when come to the
linear
> model)
>
> Q1: Is there a rationale of down-weighting FNO (around 100-200 spots
per
> array) for background correction and further normalization?
> Q2: Is there way to make image representation of Agilent microarray
(for
> each channel and backgrounds)?
> In another words is there known 'layout' for human 44K Agilent?
>
> Next I corrected the background with:
>> RG.b <- backgroundCorrect(RG.raw, method="minimum", offset=50)
> (recommended method=normexp produced shifted curves for five arrays
after
> taking a look on density plots,
> and box-plots for separate G and R channels also look less uniform
as
> compared with 'minimum' method)
>
> Q3: I guess it is also possible to remove those 5 arrays from the
> experiment. Is it fair?
> Q4: What kind of reasoning should be used for the choice between
> background subtraction methods?
>
> Then performed standard loess within array normalization:
>> MA.loess <- normalizeWithinArrays(RG.b,
method="loess",bc.method="none")
>
> Q5: Do I need to perform between array normalization?
> How to judge which of the methods (non, scale, quantile,
Aquantile)
> is
> best for my experiment?
>
> For now I decide to stuck with background=minimum, within=loess, and
> between=is under the question
>
> Next I would like to ask questions about
> linear model of my experiment, but I will make it in a next help
request
>
> Thanks a lot in advance
>
> and finally
>> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=Dutch_Netherlands.1252
LC_CTYPE=Dutch_Netherlands.1252
> [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
> [5] LC_TIME=Dutch_Netherlands.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_3.10.2
>>
>
> With kind regards
> Vladimir
> --
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> The information of this email and in any file transmitted with it is
> strictly confidential and may be legally privileged.
> It is intended solely for the addressee. If you are not the intended
> recipient, any copying, distribution or any other use of this email
is
> prohibited and may be unlawful. In such case, you should please
notify
> the sender immediately and destroy this email.
> The content of this email is not legally binding unless confirmed by
> letter.
> Any views expressed in this message are those of the individual
sender,
> except where the message states otherwise and the sender is
authorised
> to state them to be the views of the sender's company. For further
> information about Actelion please see our website at
> http://www.actelion.com
>
--
------------------------------
Message: 2
Date: Thu, 9 Feb 2012 14:38:02 +0100
From: Nicolas Delhomme <delhomme@embl.de>
To: Abhishek Pratap <apratap at="" lbl.gov="">
Cc: bioconductor at r-project.org
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <36CC7B64-054F-4E9C-8CEB-190A1D857D5C at embl.de>
Content-Type: text/plain; charset=us-ascii
Dear Abhi,
If you could point me to some published strand specific data or let me
get an excerpt of yours, I could easily had strand-specificity in the
easyRNASeq package.
Thanks,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On 9 Feb 2012, at 00:41, Abhishek Pratap wrote:
> Hi Elena
>
> Good timing with me on this. I recently was contemplating the best
way
> to move forward for a similar analysis. HTSeq a python based
toolkit
> by Simon can help you do the counting. FYI : It can also take
strand
> info into account. If you dont have stranded data you could also
look
> at easyrnaseq package.
>
> So if you have an annotation file like gff/gtf with the isoform
> information you could then do the read counting at isoform or gene
> level based on which attribute of the gff file you select to do the
> counting. Check out
> http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.
>
> Also you want to keep in mind that at isoform level you would be
> double counting the reads in exons which are shared in the isoforms
> which can bias your results to some extent. But as Wolfgang pointed
> out in a recent post if you use FDR, it should not matter a lost as
> the bias will be cancelled between denominator /numerator.
>
> You also might want to check the DEXSeq which can help infer
> differential expression from RNA-Seq exons which could then be
related
> back to genes/isoforms.
>
> Hope this helps and let us know about your progress. I would be
> interested in learning from your experience too.
>
> Cheers!
> -Abhi
>
> ----------------------------------
> Abhishek Pratap
> Bioinformatics Systems Analyst - 3
> DOE- Joint Genome Institute
> Lawrence Berkeley National Lab
>
>
>
>
> On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at="" wisc.edu="">
wrote:
>> Greetings all,
>>
>> After re-reading related posts in the listserv archive, I still
didn't know
>> the exact answer to my question, so here goes. I'd like to use
DESeq to
>> measure differential isoform expression. Has Simon or anybody else
written a
>> script that will convert aligned reads (.bam/.sam file) into a
table of
>> isoform counts, suitable for input to DESEq - similar to what Simon
has done
>> at the gene-wise level, but instead for making a table of counts by
isoform?
>>
>> I would try to do this myself, but I'm a novice at programming.
Sorry if
>> this has been answered elsewhere... If so, please let me know the
link.
>>
>> Thanks,
>> Elena
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 3
Date: Thu, 09 Feb 2012 06:10:37 -0800
From: Martin Morgan <mtmorgan@fhcrc.org>
To: Catherine Garry <cagarry at="" tcd.ie="">
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Error message with RNAither
Message-ID: <4F33D3DD.7040303 at fhcrc.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
On 02/09/2012 02:54 AM, Catherine Garry wrote:
> Hi Dan,
>
> When i tried all of this, this is what i got:
>
>> source("http://www.bioconductor.org/biocLite.R")
> BiocInstaller version 1.2.1, ?biocLite for help
>> biocLite("RNAither")
> BioC_mirror: 'http://www.bioconductor.org'
> Using R version 2.14, BiocInstaller version 1.2.1.
> Installing package(s) 'RNAither'
> Installing package(s) into ?C:/Users/Catherine/Documents/R/win-
library/2.14?
> (as ?lib? is unspecified)
> trying URL '
> http://www.bioconductor.org/packages/2.9/bioc/bin/windows/contrib/2.
14/RNAither_2.2.0.zip
> '
> Content type 'application/zip' length 1414362 bytes (1.3 Mb)
> opened URL
> downloaded 1.3 Mb
> package ?RNAither? successfully unpacked and MD5 sums checked
> The downloaded packages are in
>
C:\Users\Catherine\AppData\Local\Temp\RtmpETRn43\downloaded_packages
> Warning message:
> 'boot' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'foreign' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'KernSmooth' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'Matrix' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'mgcv' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'nlme' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'rpart' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'survival' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
>> biocLite(character(), ask=FALSE)
> BioC_mirror: 'http://www.bioconductor.org'
> Using R version 2.14, BiocInstaller version 1.2.1.
> Warning message:
> 'boot' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'foreign' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'KernSmooth' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'Matrix' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'mgcv' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'nlme' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'rpart' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
> 'survival' cannot be updated, installed directory 'C:/Program
> Files/R/R-2.14.0/library' not writeable
>>
> Any ideas on how to recify this?
These are warnings, and can be ignored; you could update your R
installation to R-2.14.1 but that is not strictly necessary. The good
news is
> package ?RNAither? successfully unpacked and MD5 sums checked
do you still have the problems
>>> library("RNAither"). but i also get an error here:
>>>
>>> Loading required package: robustbase
>>> Error: package ?robustbase? could not be loaded
>>> In addition: Warning messages:
>>> 1: package ?AnnotationDbi? was built under R version 2.14.1
>>> 2: package ?SparseM? was built under R version 2.14.1
>>> 3: In library(pkg, character.only = TRUE, logical.return = TRUE,
lib.loc
? Martin
>
> On 8 February 2012 19:13, Dan Tenenbaum<dtenenba at="" fhcrc.org="">
wrote:
>
>> On Wed, Feb 8, 2012 at 10:58 AM, Catherine Garry<cagarry at="" tcd.ie="">
wrote:
>>> Hi,
>>>
>>> I keep getting an error message when i try to generate the dataset
file
>> in
>>> RNAither. This is the error i keep getting.
>>>
>>>> generateDatasetFile("DGS", "DGScreen in cells",
>>> + NA_character_, "RNAither_output_Rep1.txt", plateLayout1,
plateLayout2,
>>> + 8, 12, 1, emptyWells, poorWells, controlCoordsOutput,
>>> backgroundValOutput, cellnumOutput)
>>>
>>> Error: could not find function "generateDatasetFile"
>>> Can anyone help explain why this is happening?
>>> At the beginning, as usual, i add in:
>>>
>>> library("RNAither"). but i also get an error here:
>>>
>>> Loading required package: robustbase
>>> Error: package ?robustbase? could not be loaded
>>> In addition: Warning messages:
>>> 1: package ?AnnotationDbi? was built under R version 2.14.1
>>> 2: package ?SparseM? was built under R version 2.14.1
>>> 3: In library(pkg, character.only = TRUE, logical.return = TRUE,
lib.loc
>> =
>>> lib.loc) :
>>> there is no package called ?robustbase?
>>>
>>> I also used:
>>>
>>> source("http://www.bioconductor.org/biocLite.R")
>>> biocLite("RNAither")
>>>
>>> , to install all the files etc.
>>>
>>> Can anyone help me rectify whatever is happening?
>>
>> It would be helpful to see the output of sessionInfo() so we can
see
>> exactly what packages you have, and a little bit about your system
>> configuration.
>>
>> Without seeing it, I can guess that you have some out of date
>> packages, and you can fix this as follows:
>>
>> biocLite(character(), ask=FALSE)
>> That will update ALL Bioconductor and CRAN packages on your system
>> which are outdated.
>>
>> It would also be good to see the complete output of
>> biocLite("RNAither").
>>
>> Thanks,
>> Dan
>>
>>
>>>
>>> Thank you.
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
------------------------------
Message: 4
Date: Thu, 9 Feb 2012 10:55:29 -0500
From: Steve Lianoglou <mailinglist.honeypot@gmail.com>
To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
<caha9mcpberwps+eexrtqbo0mbhyaod1wljfgho9kxk6uewxecg at="" mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Hi,
I thought it would be handy to make a GenomicFeatures::TranscriptDb
from a gtf file and was somehow surprised to see that I couldn't find
such a function in GenomicFeatures.
I'm happy to whip up such a function, but wanted to check in to make
sure I'm not missing something like (1) you can already do it; or (2)
it's not as straightforward as I think; (3) maybe it's there and I'm
not looking hard enough.
Right now I just want to build it on the gff that the knew versions of
tophat build when you pass in a gtf/gff file of known transcripts (the
--G/--GTF cmd line arg), but I think it'd be handy overall.
I don't think gff/gtf's have any column for exon ordering, though, in
which case I'd just assume the exons are ordered linearly (bye bye
trans-splicing)).
Good idea? Bad idea? Already done?
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
------------------------------
Message: 5
Date: Thu, 9 Feb 2012 08:56:06 -0700
From: "sue@xlsolutions-corp.com" <sue@xlsolutions-corp.com>
To: <bioconductor at="" stat.math.ethz.ch="">
Subject: [BioC] February-March 2012 ***R/S-PLUS Courses***by
XLSolutions Corp at 9 USA locations
Message-ID:
<20120209085606.aa8924c5d28ca71e2a043bb294e795eb.a4e946a69b.wbe at
email04.secureserver.net>
Content-Type: text/plain; charset="utf-8"
Happy New Year !
XLSolutions February-March 2012 R/S-PLUS courses schedule is now
available
online at 9 USA cities for with 13 new courses: *** Suggest a future
course
date/city
(1) R-PLUS: A Point-and-Click Approach to R
(2) S-PLUS / R : Programming Essentials.
(3) R/S+ Fundamentals and Programming Techniques
(4) R/S-PLUS Functions by Example.
(5) S/R-PLUS Programming 3: Advanced Techniques and Efficiencies.
(6) R/S+ System: Advanced Programming.
(7) R/S-PLUS Graphics: Essentials.
(8) R/S-PLUS Graphics for SAS Users
(9) R/S-PLUS Graphical Techniques for Marketing Research.
(10) Multivariate Statistical Methods in R/S-PLUS: Practical Research
Applications
(11) Introduction to Applied Econometrics with R/S-PLUS
(12) Exploratory Analysis for Large and Complex Problems in R/S-PLUS
(13) Determining Power and Sample Size Using R/S-PLUS.
(14) R/S-PLUS: Data Preparation for Data Mining
(15) Data Cleaning Techniques in R/S-PLUS
(16) R/S-PLUS: Applied Clustering Techniques
More on website
http://www.xlsolutions-corp.com/rplus.asp
Ask for group discount and reserve your seat Now - Earlybird Rates.
Payment due after the class! Email Sue Turner: sue at
xlsolutions-corp.com
Phone: 206-686-1578
Please let us know if you and your colleagues are interested in this
class to take advantage of group discount. Register now to secure your
seat.
Cheers,
Elvis Miller, PhD
Manager Training.
XLSolutions Corporation
206 686 1578
www.xlsolutions-corp.com
elvis at xlsolutions-corp.com
------------------------------
Message: 6
Date: Thu, 9 Feb 2012 08:21:29 -0800
From: Thomas Girke <thomas.girke@ucr.edu>
To: Nicolas Delhomme <delhomme at="" embl.de="">
Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <20120209162129.GA4433 at Thomas-Girkes-MacBook-Pro.local>
Content-Type: text/plain; charset=us-ascii
This study contains some strand specific RNA-Seq data:
http://www.hubmed.org/display.cgi?uids=18423832
I would expect that most RNA-Seq experiments in the near future may be
performed in a strand-specific manner, since the strand information
carries a lot of biologically relevant information in this application
domain. Thus, adding analysis support for it is definitely not a waste
of time.
I have not used easyRNASeq yet, but I will certainly give it a try.
In my group we currently use for RNA-Seq analysis the following
components: Rsubread (or tophat) ->
rtracklayer/Rsamtools/GenomicRanges
-> DESeq/edgeR. This allows any type strand and non-strand specific
read
counts for exons, transcripts, genes, intergenic features, etc. A huge
advantage of this environment is its flexibility and broad application
spectrum for most applications domains in the NGS field, such as
SNP-Seq, ChiP-Seq, smallRNA-Seq, etc. For instance, our ChIP-Seq
analysis routines use most of these tools plus some peak callers.
Thomas
On Thu, Feb 09, 2012 at 01:38:02PM +0000, Nicolas Delhomme wrote:
> Dear Abhi,
>
> If you could point me to some published strand specific data or let
me get an excerpt of yours, I could easily had strand-specificity in
the easyRNASeq package.
>
> Thanks,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 9 Feb 2012, at 00:41, Abhishek Pratap wrote:
>
> > Hi Elena
> >
> > Good timing with me on this. I recently was contemplating the best
way
> > to move forward for a similar analysis. HTSeq a python based
toolkit
> > by Simon can help you do the counting. FYI : It can also take
strand
> > info into account. If you dont have stranded data you could also
look
> > at easyrnaseq package.
> >
> > So if you have an annotation file like gff/gtf with the isoform
> > information you could then do the read counting at isoform or gene
> > level based on which attribute of the gff file you select to do
the
> > counting. Check out
> > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.
> >
> > Also you want to keep in mind that at isoform level you would be
> > double counting the reads in exons which are shared in the
isoforms
> > which can bias your results to some extent. But as Wolfgang
pointed
> > out in a recent post if you use FDR, it should not matter a lost
as
> > the bias will be cancelled between denominator /numerator.
> >
> > You also might want to check the DEXSeq which can help infer
> > differential expression from RNA-Seq exons which could then be
related
> > back to genes/isoforms.
> >
> > Hope this helps and let us know about your progress. I would be
> > interested in learning from your experience too.
> >
> > Cheers!
> > -Abhi
> >
> > ----------------------------------
> > Abhishek Pratap
> > Bioinformatics Systems Analyst - 3
> > DOE- Joint Genome Institute
> > Lawrence Berkeley National Lab
> >
> >
> >
> >
> > On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at="" wisc.edu=""> wrote:
> >> Greetings all,
> >>
> >> After re-reading related posts in the listserv archive, I still
didn't know
> >> the exact answer to my question, so here goes. I'd like to use
DESeq to
> >> measure differential isoform expression. Has Simon or anybody
else written a
> >> script that will convert aligned reads (.bam/.sam file) into a
table of
> >> isoform counts, suitable for input to DESEq - similar to what
Simon has done
> >> at the gene-wise level, but instead for making a table of counts
by isoform?
> >>
> >> I would try to do this myself, but I'm a novice at programming.
Sorry if
> >> this has been answered elsewhere... If so, please let me know the
link.
> >>
> >> Thanks,
> >> Elena
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 7
Date: Thu, 9 Feb 2012 17:31:06 +0100
From: Nicolas Delhomme <delhomme@embl.de>
To: Thomas Girke <thomas.girke at="" ucr.edu="">
Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <1FFF4CFC-2C52-425D-80AC-56FF777BB562 at embl.de>
Content-Type: text/plain; charset=us-ascii
Hi Thomas,
On 9 Feb 2012, at 17:21, Thomas Girke wrote:
> This study contains some strand specific RNA-Seq data:
> http://www.hubmed.org/display.cgi?uids=18423832
>
Thanks for the pointer.
> I would expect that most RNA-Seq experiments in the near future may
be
> performed in a strand-specific manner, since the strand information
> carries a lot of biologically relevant information in this
application
> domain. Thus, adding analysis support for it is definitely not a
waste
> of time.
Clearly. I would have already done had I had the time.
>
> I have not used easyRNASeq yet, but I will certainly give it a try.
Let me know when you do.
>
> In my group we currently use for RNA-Seq analysis the following
> components: Rsubread (or tophat) ->
rtracklayer/Rsamtools/GenomicRanges
> -> DESeq/edgeR. This allows any type strand and non-strand specific
read
> counts for exons, transcripts, genes, intergenic features, etc. A
huge
> advantage of this environment is its flexibility and broad
application
> spectrum for most applications domains in the NGS field, such as
> SNP-Seq, ChiP-Seq, smallRNA-Seq, etc. For instance, our ChIP-Seq
> analysis routines use most of these tools plus some peak callers.
Exactly why I have been developing and using easyRNASeq as well.
Cheers,
Nico
>
> Thomas
>
> On Thu, Feb 09, 2012 at 01:38:02PM +0000, Nicolas Delhomme wrote:
>> Dear Abhi,
>>
>> If you could point me to some published strand specific data or let
me get an excerpt of yours, I could easily had strand-specificity in
the easyRNASeq package.
>>
>> Thanks,
>>
>> Nico
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>>
>>
>>
>>
>>
>> On 9 Feb 2012, at 00:41, Abhishek Pratap wrote:
>>
>>> Hi Elena
>>>
>>> Good timing with me on this. I recently was contemplating the best
way
>>> to move forward for a similar analysis. HTSeq a python based
toolkit
>>> by Simon can help you do the counting. FYI : It can also take
strand
>>> info into account. If you dont have stranded data you could also
look
>>> at easyrnaseq package.
>>>
>>> So if you have an annotation file like gff/gtf with the isoform
>>> information you could then do the read counting at isoform or gene
>>> level based on which attribute of the gff file you select to do
the
>>> counting. Check out
>>> http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.
>>>
>>> Also you want to keep in mind that at isoform level you would be
>>> double counting the reads in exons which are shared in the
isoforms
>>> which can bias your results to some extent. But as Wolfgang
pointed
>>> out in a recent post if you use FDR, it should not matter a lost
as
>>> the bias will be cancelled between denominator /numerator.
>>>
>>> You also might want to check the DEXSeq which can help infer
>>> differential expression from RNA-Seq exons which could then be
related
>>> back to genes/isoforms.
>>>
>>> Hope this helps and let us know about your progress. I would be
>>> interested in learning from your experience too.
>>>
>>> Cheers!
>>> -Abhi
>>>
>>> ----------------------------------
>>> Abhishek Pratap
>>> Bioinformatics Systems Analyst - 3
>>> DOE- Joint Genome Institute
>>> Lawrence Berkeley National Lab
>>>
>>>
>>>
>>>
>>> On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at="" wisc.edu=""> wrote:
>>>> Greetings all,
>>>>
>>>> After re-reading related posts in the listserv archive, I still
didn't know
>>>> the exact answer to my question, so here goes. I'd like to use
DESeq to
>>>> measure differential isoform expression. Has Simon or anybody
else written a
>>>> script that will convert aligned reads (.bam/.sam file) into a
table of
>>>> isoform counts, suitable for input to DESEq - similar to what
Simon has done
>>>> at the gene-wise level, but instead for making a table of counts
by isoform?
>>>>
>>>> I would try to do this myself, but I'm a novice at programming.
Sorry if
>>>> this has been answered elsewhere... If so, please let me know the
link.
>>>>
>>>> Thanks,
>>>> Elena
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 8
Date: Thu, 9 Feb 2012 18:02:40 +0100
From: axel.klenk@actelion.com
To: "Vladimir Krasikov" <v.v.krasikov at="" gmail.com="">
Cc: bioconductor at r-project.org, bioconductor-bounces at
r-project.org
Subject: Re: [BioC] Limma: questions about data pre-processing
Message-ID:
<11537_1328807013_4F33FC65_11537_9013_1_OF849F3859.3DF5C743-ON
C125799F.005B5234-C125799F.005DB562 at actelion.com>
Content-Type: text/plain; charset="US-ASCII"
Dear Vladimir,
sorry for the late reply... I'll give it a try and hope some true
expert
will
correct me if it is nonsense... :-)
Q2: rather new in 2011 would mean *probably* 4x44Kv2... the type
should
be available in the file header if you still have one (?) or otherwise
one
could
guess it from the set of identifiers in column "ProbeName" if you
still
have
one... can you make one file available via web or ftp for a quick
look?
Visualization should still be feasible, with missing spots missing, of
course,
and in this case it's a pity the positive controls are missing...
IIRC, Agilent FES does produce these plots for their QC -- but I
suppose
your company did not include them?
Q5: ok, so same or similar common reference we are using... and to be
useful
it should give a reasonable signal for (almost) all probes on the
array
which is
the whole genome -- but only a proportion of these will be expressed
in
any
real biological sample which is why I think that a) MA plots will look
pretty
unusual for these arrays and b) LOESS normalization will seemingly fix
that
but actually distort your data.
As for the choice of normalization method, since all normalization
steps
bear
the risk of "normalizing" away the biological signal you're interested
in,
you
should do only as much as necessary, using the least stringent method
that
will produce proper diagnostic plots. For comparison between arrays,
density
and box plots would be appropriate. I realize this is probably too
general
to be
useful :-) maybe the literature referenced in limma's
?normalizeWithinArrays and
?normalizeBetweenArrays can be of any help?
Cheers,
- axel
Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil /
Switzerland
From:
"Vladimir Krasikov" <v.v.krasikov at="" gmail.com="">
To:
axel.klenk at actelion.com
Cc:
bioconductor at r-project.org, bioconductor-bounces at r-project.org
Date:
09.02.2012 13:47
Subject:
Re: [BioC] Limma: questions about data pre-processing
Dear Axel
Once again thanks...
Q2: The only thing I know now is that it
was rather new Agilent edition of March 2011,
however our company stripped away all information in files ( even
removed
all control spots).
Do you think there is still a way to make visualizations?
Q5: After reading Rquantile description I now see some rationale about
this normalization,
when all Red chanels contoined common reference (which is commercial
"universal human reference").
However, question remains, what kind of plots, metrics are useful to
judge
the results of normalizations?
On Tue, 07 Feb 2012 15:32:03 +0100, <axel.klenk at="" actelion.com="">
wrote:
> Dear Vladimir,
>
> I'll only answer or comment on some of your questions and leave
> the others for the true experts...
>
> Q2: yes, for example using package arrayQualityMetrics, if you know
> the array layout. FES output usually contains columns Col and Row
for
> spot coordinates but apparently your "service provider" has removed
> them. I could send you a coordinates <--> oligo mapping by email if
you
> can tell me your array type -- is it 1x44K, 4x44K or 4x44Kv2?
> Alternatively,
> you can try to find that information on Agilent's eArray web site:
> earray.chem.agilent.com
>
> Q5: for a common reference design, dye swaps are not required and
> I would not apply a loess normalization -- depending on what you
have
> hybridized as the common reference, the assumptions may not hold.
> As for the between-array normalization, Rquantile may also be an
> option for your design and boxplots and density plots may be used
> for judging the results.
>
> Cheers,
>
> - axel
>
>
> Axel Klenk
> Research Informatician
> Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil
/
> Switzerland
>
>
>
>
> From:
> "Vladimir Krasikov" <v.v.krasikov at="" gmail.com="">
> To:
> bioconductor at r-project.org
> Date:
> 07.02.2012 14:27
> Subject:
> [BioC] Limma: questions about data pre-processing
> Sent by:
> bioconductor-bounces at r-project.org
>
>
>
> Dear limma experts
>
> During creating the pipe-line for dissecting differential gene
expression
> in frame of limma,
> several questions have arose.
>
> Experiment:
> I have 62 two-color Agilent human arrays.
> The samples are from several human more or less related to each
other
> disorders and vary in age, sex, disease duration and diagnosis.
> Company that made hybridizations performed all hybs in one direction
(no
> dye-swaps),
> where all samples were in G channel and common Ref in R channel,
> and unfortunately provided us only excepts of Feature Extraction
> which contained info on G, Gb, R, Rb, and FNO (non-uniformity
outliers)
> and separate gene annotation table.
>
> I performed generic import of the data and assigned zero-weight to
the
> FNO
> spots:
> I analyzed density and MA-plots, box-plots of M-values, G and R
channels
> and box-plots of background intensities,
> and removed from experiment 1 array with aberrant raw G-channel
density.
> (I will discuss experiment description later, when come to the
linear
> model)
>
> Q1: Is there a rationale of down-weighting FNO (around 100-200 spots
per
> array) for background correction and further normalization?
> Q2: Is there way to make image representation of Agilent microarray
(for
> each channel and backgrounds)?
> In another words is there known 'layout' for human 44K Agilent?
>
> Next I corrected the background with:
>> RG.b <- backgroundCorrect(RG.raw, method="minimum", offset=50)
> (recommended method=normexp produced shifted curves for five arrays
after
> taking a look on density plots,
> and box-plots for separate G and R channels also look less uniform
as
> compared with 'minimum' method)
>
> Q3: I guess it is also possible to remove those 5 arrays from the
> experiment. Is it fair?
> Q4: What kind of reasoning should be used for the choice between
> background subtraction methods?
>
> Then performed standard loess within array normalization:
>> MA.loess <- normalizeWithinArrays(RG.b,
method="loess",bc.method="none")
>
> Q5: Do I need to perform between array normalization?
> How to judge which of the methods (non, scale, quantile,
Aquantile)
> is
> best for my experiment?
>
> For now I decide to stuck with background=minimum, within=loess, and
> between=is under the question
>
> Next I would like to ask questions about
> linear model of my experiment, but I will make it in a next help
request
>
> Thanks a lot in advance
>
> and finally
>> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=Dutch_Netherlands.1252
LC_CTYPE=Dutch_Netherlands.1252
> [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
> [5] LC_TIME=Dutch_Netherlands.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_3.10.2
>>
>
> With kind regards
> Vladimir
> --
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> The information of this email and in any file transmitted with it is
> strictly confidential and may be legally privileged.
> It is intended solely for the addressee. If you are not the intended
> recipient, any copying, distribution or any other use of this email
is
> prohibited and may be unlawful. In such case, you should please
notify
> the sender immediately and destroy this email.
> The content of this email is not legally binding unless confirmed by
> letter.
> Any views expressed in this message are those of the individual
sender,
> except where the message states otherwise and the sender is
authorised
> to state them to be the views of the sender's company. For further
> information about Actelion please see our website at
> http://www.actelion.com
>
--
Using Opera's revolutionary email client: http://www.opera.com/mail/
The information of this email and in any file transmitted with it is
strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended
recipient, any copying, distribution or any other use of this email is
prohibited and may be unlawful. In such case, you should please notify
the sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by
letter.
Any views expressed in this message are those of the individual
sender, except where the message states otherwise and the sender is
authorised to state them to be the views of the sender's company. For
further information about Actelion please see our website at
http://www.actelion.com
------------------------------
Message: 9
Date: Thu, 9 Feb 2012 12:06:30 -0500
From: "Akula, Nirmala (NIH/NIMH) [C]" <akulan@mail.nih.gov>
To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: [BioC] suggestions/comments on DESeq transcript wise analysis
Message-ID:
<d56d6a723c413c4db8e608ed137ec6a20d46536cfc at="" nihmlbx12.nih.gov="">
Content-Type: text/plain
Based on the previous threads when using DESeq the reads should not be
double counted. I am following pipeline for RNA-seq analysis and would
like to know any suggestions/comments regarding the pipeline:
1. Mapping the reads using Tophat
2. Convert Tophat output.bam to Sam
3. Create bed file from Sam file
4. Use CoverageBed along with reference genome for counting the
reads
5. Sum count of reads from all exons in a transcript
6. DESeq to analyze the counts/transcript
Thank you very much
Nirmala
[[alternative HTML version deleted]]
------------------------------
Message: 10
Date: Thu, 9 Feb 2012 09:50:37 -0800
From: "Tim Triche, Jr." <tim.triche@gmail.com>
To: Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">
Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
<cac+n9bxh2d_ywvcr=ofgo=kwa6ixpfeixxacbjesc77az1jzqg at="" mail.gmail.com="">
Content-Type: text/plain
Along these lines, I took Kasper Hansen's df2GR() function and tarted
it up
a bit, wrote a GR2df() function, and added some hooks to store the
entire
works in a trivial (4-table plus views) database schema. This stores
descriptions of what each track/range means, where it came from, and
when
it was added, not unlike biomaRt. (I had trouble finding "pickling"
functions for tracks and ranges so I rolled my own) It's not perfect
(because I don't yet understand how to "bolt on" the appropriate
BSgenome
so that out-of-memory sequence access is performed the way I want it
to)
but it works well enough that I have started migrating other types of
data
(segmented copynumber, gene-, splice-, exon-wise RNAseq, and BS-seq)
into
freeze-dried GRanges. If there is a better container schema in
GenomicFeatures::TranscriptDb, I'll switch to it today...
And since it's fresh in my mind, how does one automatically attach the
appropriate BSgenome to a GenomicRanges? It seems like it should be
trivial, and the documentation hints at it, but I haven't succeeded
yet in
doing this automatically. I store the assembly from which a given
GRanges
object's features were aligned, so if I can figure out the syntax,
it's
done.
thanks all,
--t
On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> I thought it would be handy to make a GenomicFeatures::TranscriptDb
> from a gtf file and was somehow surprised to see that I couldn't
find
> such a function in GenomicFeatures.
>
> I'm happy to whip up such a function, but wanted to check in to make
> sure I'm not missing something like (1) you can already do it; or
(2)
> it's not as straightforward as I think; (3) maybe it's there and I'm
> not looking hard enough.
>
> Right now I just want to build it on the gff that the knew versions
of
> tophat build when you pass in a gtf/gff file of known transcripts
(the
> --G/--GTF cmd line arg), but I think it'd be handy overall.
>
> I don't think gff/gtf's have any column for exon ordering, though,
in
> which case I'd just assume the exons are ordered linearly (bye bye
> trans-splicing)).
>
> Good idea? Bad idea? Already done?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]
------------------------------
Message: 11
Date: Thu, 9 Feb 2012 10:01:08 -0800
From: Abhishek Pratap <apratap@lbl.gov>
To: "Akula, Nirmala (NIH/NIMH) [C]" <akulan at="" mail.nih.gov="">
Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] suggestions/comments on DESeq transcript wise
analysis
Message-ID:
<ca+7hxbx4lgvrbkktcrgvvx4jr=xmsv7ovkoy1=hcuoby66gh9g at="" mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Unless you are counting reads on the unique disjoin set of exons from
the transcripts , double counting will be inevitable as most of the
transcripts for a gene will have overlapping exons.
You can also see Simon's post today on a different thread about the
same issue. He explains in more detail why one would want to stay with
gene level or exon level differential expression and relate it back to
gene/isoforms.
-Abhi
On Thu, Feb 9, 2012 at 9:06 AM, Akula, Nirmala (NIH/NIMH) [C]
<akulan at="" mail.nih.gov=""> wrote:
> Based on the previous threads when using DESeq the reads should not
be double counted. I am following pipeline for RNA-seq analysis and
would like to know any suggestions/comments regarding the pipeline:
>
>
> ?1. ?Mapping the reads using Tophat
> ?2. ?Convert Tophat output.bam to Sam
> ?3. ?Create bed file from Sam file
> ?4. ?Use CoverageBed along with reference genome for counting the
reads
> ?5. ?Sum count of reads from all exons in a transcript
> ?6. ?DESeq to analyze the counts/transcript
>
> Thank you very much
> Nirmala
>
>
>
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 12
Date: Thu, 9 Feb 2012 13:17:29 -0500
From: Steve Lianoglou <mailinglist.honeypot@gmail.com>
To: ttriche at usc.edu
Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
<caha9mcmhsph4cs6neutym2yo4dlqywg551wtk7+-nefhmb2sxg at="" mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Hi Tim,
A quick answer to one of your questions, specifically:
> And since it's fresh in my mind, how does one automatically attach
the
> appropriate BSgenome to a GenomicRanges? ?It seems like it should be
> trivial, and the documentation hints at it, but I haven't succeeded
yet in
> doing this automatically.
I have a generic method defined somewhere in one of my utility called
`getBsGenome` which loads the appropriate genome of choice. It's not
bullet proof, but is handy.
The genome is identified by its annotation/accession/build/release,
whatever you call it, where I mostly go by UCSC conveintions, ie.
'hg19', 'mm9', etc.
I use it like this
bsg <- getBsGenome('hg19')
And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return
the `Hsapiens` object (that is also, now, in your workspace).
I define methods for different classes that I have that basically end
up calling down to the base function I pasted below.
Assuming you are on bioc2.9, a potential GenomicRanges impl of the
method would look like so (untested):
setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) {
g <- unique(genome(x))
if (!is.character(g) || length(g) != 1L || nchar(g) < 1) {
stop("Expected a single, valid genome identifier in seqinfo slot")
}
getBsGenome(g, ...)
})
You would have to add more cases in the switch statement of the base
function when you want to expand your reportoire of organisms:
setMethod("getBsGenome", c(x="character"),
function(x, organism=NULL, anno.source='UCSC', ...) {
lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:'
if (is.null(organism)) {
organism <- switch(substring(x, 1, 2),
hg='Hsapiens',
mm='Mmusculus',
sa='Scerevisiae',
dm='Dmelanogaster',
rn='Rnorvegicus',
ce='Celegans',
stop("Unknown genome", x, sep=" "))
}
lib.name <- gsub(':organism:', organism, lib.name)
lib.name <- gsub(':anno.source:', anno.source, lib.name)
lib.name <- gsub(':genome:', x, lib.name)
suppressPackageStartupMessages({
found <- requirelib.name, character.only=TRUE)
})
if (!found) {
stoplib.name, " package required.")
}
get(organism, pos=paste('package', lib.name, sep=":"))
})
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
------------------------------
Message: 13
Date: Thu, 9 Feb 2012 13:37:29 -0500
From: Simon No?l <simon.noel.2@ulaval.ca>
To: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: [BioC] Optimisation
Message-ID:
<1C970CA8E580A84394CE5CF42FEC0046E2CBDCC106 at EXCH-
MBX-E.ulaval.ca>
Content-Type: text/plain; charset="iso-8859-1"
Hi every one,
I have a big formula with some parameter that I want to minimise
with some
condition like sum of all of them must be one, in some condition,
one
of them must equal 0, etc.
I am trying to solve that with constrOptim but I got some
broblem... Like
how to have a parametter equal to 0 Is there any package who is
more
user friendly that you can suggest to me?
Simon No??l
CdeC
------------------------------
Message: 14
Date: Thu, 9 Feb 2012 10:55:11 -0800
From: "Tim Triche, Jr." <tim.triche@gmail.com>
To: Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">
Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
<cac+n9bv69pjk_9b65nj9+cfpuhr_uk2bj8xuvzwewz37i8uyvq at="" mail.gmail.com="">
Content-Type: text/plain
I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that
makes
any difference...
Thanks much, will play with these and see what happens :-)
On Thu, Feb 9, 2012 at 10:17 AM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:
> Hi Tim,
>
> A quick answer to one of your questions, specifically:
>
> > And since it's fresh in my mind, how does one automatically attach
the
> > appropriate BSgenome to a GenomicRanges? It seems like it should
be
> > trivial, and the documentation hints at it, but I haven't
succeeded yet
> in
> > doing this automatically.
>
> I have a generic method defined somewhere in one of my utility
called
> `getBsGenome` which loads the appropriate genome of choice. It's not
> bullet proof, but is handy.
>
> The genome is identified by its annotation/accession/build/release,
> whatever you call it, where I mostly go by UCSC conveintions, ie.
> 'hg19', 'mm9', etc.
>
> I use it like this
>
> bsg <- getBsGenome('hg19')
>
> And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return
> the `Hsapiens` object (that is also, now, in your workspace).
>
> I define methods for different classes that I have that basically
end
> up calling down to the base function I pasted below.
>
> Assuming you are on bioc2.9, a potential GenomicRanges impl of the
> method would look like so (untested):
>
> setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) {
> g <- unique(genome(x))
> if (!is.character(g) || length(g) != 1L || nchar(g) < 1) {
> stop("Expected a single, valid genome identifier in seqinfo
slot")
> }
> getBsGenome(g, ...)
> })
>
> You would have to add more cases in the switch statement of the base
> function when you want to expand your reportoire of organisms:
>
> setMethod("getBsGenome", c(x="character"),
> function(x, organism=NULL, anno.source='UCSC', ...) {
> lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:'
> if (is.null(organism)) {
> organism <- switch(substring(x, 1, 2),
> hg='Hsapiens',
> mm='Mmusculus',
> sa='Scerevisiae',
> dm='Dmelanogaster',
> rn='Rnorvegicus',
> ce='Celegans',
> stop("Unknown genome", x, sep=" "))
> }
>
> lib.name <- gsub(':organism:', organism, lib.name)
> lib.name <- gsub(':anno.source:', anno.source, lib.name)
> lib.name <- gsub(':genome:', x, lib.name)
>
> suppressPackageStartupMessages({
> found <- requirelib.name, character.only=TRUE)
> })
>
> if (!found) {
> stoplib.name, " package required.")
> }
>
> get(organism, pos=paste('package', lib.name, sep=":"))
> })
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]
------------------------------
Message: 15
Date: Thu, 9 Feb 2012 13:59:30 -0500
From: Steve Lianoglou <mailinglist.honeypot@gmail.com>
To: ttriche at usc.edu
Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
<caha9mcot9d+n1rwzhmt_pslswagsioka98hhj-lntabyq5nh5w at="" mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
On Thu, Feb 9, 2012 at 1:55 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> wrote:
> I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that
makes any
> difference...
I should have said "running at least bioc2.9" .. I think it's only in
2.9 that GenomicRanges objects got the seqinfo slots and, therefore,
the `genome` function ... my release-history-trivia is a bit weak,
though :-)
FWIW, I'm typically on *-devel too ...
(^^ not worth much ...)
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
------------------------------
Message: 16
Date: Thu, 9 Feb 2012 19:35:57 +0000
From: Salwa Eid <salwaeid@hotmail.com>
To: <bioconductor at="" r-project.org="">
Subject: [BioC] how can i convert a gse object(expression sets) to
affybatch
Message-ID: <snt115-w5658aa61757753c7db59cacf7b0 at="" phx.gbl="">
Content-Type: text/plain
Dear all, I have read cel files from ncbi website using the getGEO
from GEOquery package. I want then to use the combineaffybatch from
the matchprobe package but the objects I have to pass is affybatch. I
need to convert the gse object or the expression sets to affybatch to
be able to use it. Any ideas? Regards,salwa
[[alternative HTML version deleted]]
------------------------------
Message: 17
Date: Thu, 09 Feb 2012 20:37:24 +0100
From: Brian <zenlines@gmail.com>
To: bioconductor at r-project.org
Subject: [BioC] Import sequences from MacClade 4.*
Message-ID: <4F342074.2040304 at gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hello List,
sorry if this is a stupid question, but I am returning some old
sequences that I have laying around, the program is a mac program
called
MacClade, but the sequence file looks like:
#NEXUS
[MacClade 4.03 registered to University]
BEGIN DATA;
DIMENSIONS NTAX=53 NCHAR=673;
FORMAT DATATYPE=DNA MISSING=? GAP=- INTERLEAVE ;
MATRIX
[ 10 20 30 40 50
60]
[ . . . . . .]
[Modal TGAACCTGCGGAAGGAAAATATTATTGAATATATTTTTTA]
AI1A TGAACCTGCGGAACGAAAATATTATTGAATATATTTTTTA [60]
...
Does this look familiar to anyone? Did I overlook some function in the
"seqinr" package? Before I write some function to get the sequences
out
for me.
Thanks for the help!
Cheers,
Brian
------------------------------
Message: 18
Date: Thu, 09 Feb 2012 13:40:49 -0600
From: Elena Sorokin <sorokin@wisc.edu>
To: bioconductor at r-project.org
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <4F342141.1040209 at wisc.edu>
Content-Type: text/plain; CHARSET=US-ASCII; format=flowed
Dear Simon, Abhi, and Martin,
Thanks for your replies. I just learned about DEXseq and am working
through the vignette. Importantly for me, I see that DEXseq can also
do
generalized linear models. I will compare that to my current isoform
analysis in DESeq with its GLM function, then compare both those to
previous results from Cuffdiff, where unfortunately I've only been
able
to do 2-sample comparisons.
Simon, I'm working with C. elegans, not with a vertebrate, and am
usually dealing with 1-2 isoforms per gene, though sometimes it can be
much more. I don't know if that still presents a huge problem for
DESeq,
but I figure it's worth trying anyway.
Thanks again for everyone's advice. It is much appreciated! Best,
Elena
------------------------------
Message: 19
Date: Thu, 9 Feb 2012 15:28:39 -0500
From: Simon No?l <simon.noel.2@ulaval.ca>
To: Hervé Pagès <hpages at="" fhcrc.org="">
Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: [BioC] RE : maping SNPs
Message-ID:
<1C970CA8E580A84394CE5CF42FEC0046E2CBDCC109 at EXCH-
MBX-E.ulaval.ca>
Content-Type: text/plain; charset="iso-8859-1"
Hello Mr. Pag?s,
At the begining of my master, you really helped me to map my SNPs to
their gene with the code bellow. As you remember, with your help we
changed it a little bit and we have got with R2.10 :
library("IRanges")
library("GenomicRanges")
library("GenomicFeatures")
#? changer si une version plus r?cente de la librairie est
t?l?charg?e.
library(SNPlocs.Hsapiens.dbSNP.20101109)
library("org.Hs.eg.db")
#Allocation de la m?moire sous windows
memory.limit(size = 4095)
#v?rification de la librairie SNPlocs.Hsapiens.dbSNP
getSNPcount()
ch22snps <- getSNPlocs("ch22")
ch22snps[1:5, ]
#Create a GRange objetc to use with GenomicRanges library
makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
{
ans_seqnames <- character(length(myids))
ans_seqnames[] <- "unknown"
ans_locs <- integer(length(myids))
for (seqname in names(getSNPcount())) {
if (verbose)
cat("Processing ", seqname, " SNPs ...\n", sep="")
locs <- getSNPlocs(seqname)
ids <- paste("rs", locs$RefSNP_id, sep="")
myrows <- match(myids, ids)
hit_idx <- !is.na(myrows)
ans_seqnames[hit_idx] <- seqname
ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
}
GRanges(seqnames=ans_seqnames,
IRanges(start=ans_locs, width=1),
RefSNP_id=myids)
}
#Test en utilisant les premi?res sondes du premier et second
chormosome
#myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542",
"rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176",
"rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230",
"rs12557436")
#ouverture du fichier pour aller chercher nos num?ros rs
rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE)
myids <- rs_SNPs[,1]
mysnps <- makeGRangesFromRefSNPids(myids)
mysnps # a GRanges object with 1 SNP per row
#create a TranscriptDb
txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
txdb
#extract the transcript locations together with their genes
tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
tx # a GRanges object with 1 transcript per row
#rename chromosome to fit USCS standard
seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
#v?rifier pour X/Y -> seqnames(mysnps) <- sub("chrX", "chrY",
seqnames(mysnps))
#mapping but not on a readable format
map <- as.matrix(findOverlaps(mysnps, tx))
#making the mapping readable
mapped_genes <- values(tx)$gene_id[map[, 2]]
mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
elementLengths(mapped_genes))
snp2gene <- unique(data.frame(SNPNAME=mapped_snps,
gene_id=unlist(mapped_genes)))
rownames(snp2gene) <- NULL
snp2gene
#snp2gene se travaille mal alors on le transf?re en matrice
snp2geneTmp = t(t(snp2gene))
#aller chercher les symboles correspondants ? nos gene id
symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound =
NA))
save.image(file = "map.RData")
And everything was working perfectly.
Now, I have done a lot of script to analyse my data on a lot of way
and I think it's time to update my old mapping. I have try the same
script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109)
for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some
error... Can you help me? The problems seems to start with "map <-
as.matrix(findOverlaps(mysnps, tx))" and the other error seems to
result from that problem.
sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Hs.eg.db_2.6.4
[2] RSQLite_0.11.1
[3] DBI_0.2-5
[4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6
[5] GenomicFeatures_1.6.7
[6] AnnotationDbi_1.16.11
[7] Biobase_2.14.0
[8] GenomicRanges_1.6.7
[9] IRanges_1.12.6
loaded via a namespace (and not attached):
[1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0
RCurl_1.9-5
[5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4
zlibbioc_1.0.0
> library("IRanges")
Attaching package: ?IRanges?
The following object(s) are masked from ?package:base?:
cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int,
pmin, pmin.int, rbind, rep.int, setdiff, table, union
> library("GenomicRanges")
> library("GenomicFeatures")
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material. To view, type
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")' and for packages 'citation("pkgname")'.
Attaching package: ?Biobase?
The following object(s) are masked from ?package:IRanges?:
updateObject
> #? changer si une version plus r?cente de la librairie est
t?l?charg?e.
> library(SNPlocs.Hsapiens.dbSNP.20110815)
> library("org.Hs.eg.db")
Loading required package: DBI
>
>
> #Allocation de la m?moire sous windows
> memory.limit(size = 4095)
[1] Inf
Warning message:
'memory.limit()' is Windows-specific
>
> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
> getSNPcount()
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9 ch10
2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977
1403368 1544307
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19 ch20
1542256 1521919 1104719 1031214 949642 1084538 917737 886293
732039 788556
ch21 ch22 chX chY chMT
468379 454939 920890 75108 942
> ch22snps <- getSNPlocs("ch22")
> ch22snps[1:5, ]
RefSNP_id alleles_as_ambig loc
1 56342815 K 16050353
2 149201999 Y 16050408
3 146752890 S 16050612
4 139377059 Y 16050678
5 143300205 R 16050822
>
> #########################? FAIRE CANOPUS###################
>
> #Create a GRange objetc to use with GenomicRanges library
> makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
+ {
+ ans_seqnames <- character(length(myids))
+ ans_seqnames[] <- "unknown"
+ ans_locs <- integer(length(myids))
+ for (seqname in names(getSNPcount())) {
+ if (verbose)
+ cat("Processing ", seqname, " SNPs ...\n", sep="")
+ locs <- getSNPlocs(seqname)
+ ids <- paste("rs", locs$RefSNP_id, sep="")
+ myrows <- match(myids, ids)
+ hit_idx <- !is.na(myrows)
+ ans_seqnames[hit_idx] <- seqname
+ ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
+ }
+ GRanges(seqnames=ans_seqnames,
+ IRanges(start=ans_locs, width=1),
+ RefSNP_id=myids)
+ }
>
>
> #Test en utilisant les premi?res sondes du premier et second
chormosome
> #myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542",
"rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176",
"rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230",
"rs12557436")
>
> #ouverture du fichier pour aller chercher nos num?ros rs
> rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE)
> myids <- rs_SNPs[,1]
>
> mysnps <- makeGRangesFromRefSNPids(myids)
> mysnps # a GRanges object with 1 SNP per row
GRanges with 348411 ranges and 1 elementMetadata value:
seqnames ranges strand | RefSNP_id
<rle> <iranges> <rle> | <factor>
[1] ch1 [2195117, 2195117] * | rs7547453
[2] ch1 [2291680, 2291680] * | rs2840542
[3] ch1 [3256108, 3256108] * | rs1999527
[4] ch1 [3577321, 3577321] * | rs4648545
[5] ch1 [4230463, 4230463] * | rs10915459
[6] ch1 [4404344, 4404344] * | rs16838750
[7] ch1 [4501911, 4501911] * | rs12128230
[8] ch1 [4535148, 4535148] * | rs7541288
[9] ch1 [4581230, 4581230] * | rs12039682
... ... ... ... ... ...
[348403] chX [154514047, 154514047] * | rs499428
[348404] chX [154514919, 154514919] * | rs507127
[348405] chX [154737376, 154737376] * | rs5940372
[348406] chX [154780283, 154780283] * | rs6642287
[348407] chX [154830377, 154830377] * | rs5983658
[348408] chX [154870197, 154870197] * | rs473772
[348409] chX [154892230, 154892230] * | rs553678
[348410] chX [154899846, 154899846] * | rs473491
[348411] chX [154929412, 154929412] * | rs557132
---
seqlengths:
ch1 ch10 ch11 ch12 ch13 ... ch8 ch9 chX
unknown
NA NA NA NA NA ... NA NA NA
NA
>
> #create a TranscriptDb
> txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TranscriptDb object ... OK
There were 50 or more warnings (use warnings() to see the first 50)
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| Genus and Species: Homo sapiens
| UCSC Table: refGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 41677
| exon_nrow: 235596
| cds_nrow: 206518
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012)
| GenomicFeatures version at creation time: 1.6.7
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
| package: GenomicFeatures
>
> #extract the transcript locations together with their genes
> tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx # a GRanges object with 1 transcript per row
GRanges with 41677 ranges and 3 elementMetadata values:
seqnames ranges strand | tx_id
tx_name
<rle> <iranges> <rle> | <integer>
<character>
[1] chr1 [ 11874, 14408] + | 1127
NR_046018
[2] chr1 [ 69091, 70008] + | 1128
NM_001005484
[3] chr1 [323892, 328581] + | 1130
NR_028327
[4] chr1 [323892, 328581] + | 1132
NR_028325
[5] chr1 [323892, 328581] + | 1133
NR_028322
[6] chr1 [367659, 368597] + | 1131
NM_001005221
[7] chr1 [367659, 368597] + | 1134
NM_001005224
[8] chr1 [367659, 368597] + | 1135
NM_001005277
[9] chr1 [763064, 789740] + | 198
NR_015368
... ... ... ... ... ...
...
[41669] chrY [27177050, 27198251] - | 20790
NM_004678
[41670] chrY [27177050, 27198251] - | 20793
NM_001002761
[41671] chrY [27177050, 27198251] - | 20794
NM_001002760
[41672] chrY [27209230, 27246039] - | 20791
NR_002177
[41673] chrY [27209230, 27246039] - | 20792
NR_002178
[41674] chrY [27209230, 27246039] - | 20795
NR_001525
[41675] chrY [27329790, 27330920] - | 20796
NR_002179
[41676] chrY [27329790, 27330920] - | 20797
NR_002180
[41677] chrY [27329790, 27330920] - | 20798
NR_001526
gene_id
<compressedcharacterlist>
[1] 100287102
[2] 79501
[3] 100133331
[4] 100132062
[5] 100132287
[6] 729759
[7] 26683
[8] 81399
[9] 643837
... ...
[41669] 9083
[41670] 442868
[41671] 442867
[41672] 474150
[41673] 474149
[41674] 114761
[41675] 474152
[41676] 474151
[41677] 252949
---
seqlengths:
chr1 chr2 ...
chr18_gl000207_random
249250621 243199373 ...
4262
>
> #rename chromosome to fit USCS standard
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) :
levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
> #v?rifier pour X/Y -> seqnames(mysnps) <- sub("chrX", "chrY",
seqnames(mysnps))
>
> #mapping but not on a readable format
> map <- as.matrix(findOverlaps(mysnps, tx))
Warning message:
In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18,
ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX,
unknown
- in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8,
chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18,
chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5,
chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6,
chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random,
chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random,
chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212,
chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219,
chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217,
chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220,
chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214,
chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [...
truncated]
>
>
> #making the mapping readable
> mapped_genes <- values(tx)$gene_id[map[, 2]]
> mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
elementLengths(mapped_genes))
> snp2gene <- unique(data.frame(SNPNAME=mapped_snps,
gene_id=unlist(mapped_genes)))
> rownames(snp2gene) <- NULL
> snp2gene
[1] SNPNAME gene_id
<0 rows> (or 0-length row.names)
>
>
> #snp2gene se travaille mal alors on le transf?re en matrice
> snp2geneTmp = t(t(snp2gene))
>
> #aller chercher les symboles correspondants ? nos gene id
> symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound =
NA))
Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound =
NA)) :
error in evaluating the argument 'x' in selecting a method for
function 'unlist': Error in .checkKeysAreWellFormed(keys) :
keys must be supplied in a character vector with no NAs
>
>
> save.image(file = "map.RData")
>
Simon No?l
CdeC
________________________________________
De : Hervé Pagès [hpages at fhcrc.org]
Date d'envoi : 5 d?cembre 2010 23:43
? : Simon No?l
Cc : bioconductor at r-project.org
Objet : Re: [BioC] maping SNPs
Hi Simon,
On 12/03/2010 10:17 AM, Simon No?l wrote:
>
> Hi,
>
>
>
> I have a really big list of SNPs names like :
>
>
>
> SNPNAME
>
> rs7547453
>
> rs2840542
>
> rs1999527
>
> rs4648545
>
> rs10915459
>
> rs16838750
>
> rs12128230
>
> ...
>
>
>
> I woudlike to map them to their official gene symbol. What the
best way to
> procede?
Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
from SNPs to genes and I don't think we have this kind of mappings
either in our collection of annotations (*.db packages).
But if your SNPs are Human then you can do the mapping yourself by
using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
packages.
The latest SNPlocs.Hsapies.dbSNP.* package is
SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
SNP locations relative to the GRCh37 genome:
> library(SNPlocs.Hsapiens.dbSNP.20101109)
> getSNPcount()
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9
ch10
1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
995075
1158707
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19
ch20
1147722 1105364 815729 740129 657719 757926 641905 645646
520666
586708
ch21 ch22 chX chY chMT
338254 331060 529608 67438 624
Note that it doesn't contain *all* SNPs from dbSNP Build 132:
only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
for the details).
> ch22snps <- getSNPlocs("ch22")
> ch22snps[1:5, ]
RefSNP_id alleles_as_ambig loc
1 56342815 K 16050353
2 7288968 S 16050994
3 6518357 M 16051107
4 7292503 R 16051209
5 6518368 Y 16051241
Note that the rs prefix has been dropped.
So here is how to proceed:
First you can use the following function to make a GRanges object from
your SNP ids:
makeGRangesFromRefSNPids <- function(myids)
{
ans_seqnames <- character(length(myids))
ans_seqnames[] <- "unknown"
ans_locs <- integer(length(myids))
for (seqname in names(getSNPcount())) {
locs <- getSNPlocs(seqname)
ids <- paste("rs", locs$RefSNP_id, sep="")
myrows <- match(myids, ids)
ans_seqnames[!is.na(myrows)] <- seqname
ans_locs[!is.na(myrows)] <- locs$loc[myrows]
}
GRanges(seqnames=ans_seqnames,
IRanges(start=ans_locs, width=1),
RefSNP_id=myids)
}
This takes between 3 and 5 minutes:
> myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
"rs10915459", "rs16838750", "rs12128230", "rs999999999")
> mysnps <- makeGRangesFromRefSNPids(myids)
> mysnps # a GRanges object with 1 SNP per row
GRanges with 8 ranges and 1 elementMetadata value
seqnames ranges strand | myids
<rle> <iranges> <rle> | <character>
[1] ch1 [2195117, 2195117] * | rs7547453
[2] ch1 [2291680, 2291680] * | rs2840542
[3] ch1 [3256108, 3256108] * | rs1999527
[4] ch1 [3577321, 3577321] * | rs4648545
[5] ch1 [4230463, 4230463] * | rs10915459
[6] ch1 [4404344, 4404344] * | rs16838750
[7] ch1 [4501911, 4501911] * | rs12128230
[8] unknown [ 0, 0] * | rs999999999
seqlengths
ch1 unknown
NA NA
The next step is to create a TranscriptDb object with
makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
from the GenomicFeatures package. This TranscriptDb object will
contain the transcript locations and their associated
genes extracted from the annotation source you choose.
For example, if you want to use RefSeq genes:
## Takes about 3 minutes:
> txdb <- makeTranscriptDbFromUCSC(genome="hg19",
tablename="refGene")
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| UCSC Table: refGene
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 37924
| exon_nrow: 230024
| cds_nrow: 204571
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
| GenomicFeatures version at creation time: 1.2.2
| RSQLite version at creation time: 0.9-4
| DBSCHEMAVERSION: 1.0
Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb
object: this means that later you will be able to use the org.Hs.eg.db
package to map your gene ids to their symbol (the org.*.eg.db packages
are Entrez Gene ID centric).
To extract the transcript locations together with their genes:
> tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx # a GRanges object with 1 transcript per row
GRanges with 37924 ranges and 1 elementMetadata value
seqnames ranges strand |
gene_id
<rle> <iranges> <rle> |
<compressedcharacterlist>
[1] chr1 [ 69091, 70008] + |
79501
[2] chr1 [323892, 328581] + |
100133331
[3] chr1 [323892, 328581] + |
100132287
[4] chr1 [323892, 328581] + |
100132062
[5] chr1 [367659, 368597] + |
81399
[6] chr1 [367659, 368597] + |
729759
[7] chr1 [367659, 368597] + |
26683
[8] chr1 [763064, 789740] + |
643837
[9] chr1 [861121, 879961] + |
148398
... ... ... ... ...
...
[37916] chrY [27177050, 27198251] - |
9083
[37917] chrY [27177050, 27198251] - |
442867
[37918] chrY [27177050, 27198251] - |
442868
[37919] chrY [27209230, 27246039] - |
114761
[37920] chrY [27209230, 27246039] - |
474150
[37921] chrY [27209230, 27246039] - |
474149
[37922] chrY [27329790, 27330920] - |
252949
[37923] chrY [27329790, 27330920] - |
474152
[37924] chrY [27329790, 27330920] - |
474151
seqlengths
chr1 chr2 ...
chr18_gl000207_random
249250621 243199373 ...
4262
Now you can use findOverlaps() on 'mysnps' and 'tx' to find
the transcripts hits by your snps. But before you can do this,
you need to rename the sequences in 'mysnps' because dbSNPs and
UCSC use different naming conventions for the chromosomes:
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Then:
> map <- as.matrix(findOverlaps(mysnps, tx))
'map' contains the mapping between your SNPs and their genes but not
in a readable form (this matrix contains indices) so we make the
'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
the associated gene ids:
> mapped_genes <- values(tx)$gene_id[map[, 2]]
> mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]],
elementLengths(mapped_genes))
> snp2gene <- unique(data.frame(snp_id=mapped_snps,
gene_id=unlist(mapped_genes)))
> rownames(snp2gene) <- NULL
> snp2gene[1:4, ]
snp_id gene_id
1 rs7547453 6497
2 rs2840542 79906
3 rs1999527 63976
4 rs4648545 7161
Note that there is no guarantee that the number of rows in this
data frame is the number of your original SNP ids because the
relation between SNP ids and gene ids is of course not one-to-one.
Also the method described above considers that a SNP hits a gene
if it's located between the start and the end of one of its
transcripts but it doesn't take in account the exon structure of
the transcripts. If you want to do this you need to use exonsBy()
(from GenomicFeatures) to extract the exons grouped by transcripts
(this will be stored in a GRangesList object) and use this object
instead of 'tx' in the call to findOverlaps().
Hope this helps,
H.
>
>
>
> Simon No??l
> CdeC
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
------------------------------
Message: 20
Date: Thu, 09 Feb 2012 13:34:27 -0800
From: Hervé Pagès <hpages@fhcrc.org>
To: Simon No?l <simon.noel.2 at="" ulaval.ca="">
Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] RE : maping SNPs
Message-ID: <4F343BE3.1040102 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi Simon,
On 02/09/2012 12:28 PM, Simon No?l wrote:
> Hello Mr. Pag?s,
>
> At the begining of my master, you really helped me to map my SNPs to
their gene with the code bellow. As you remember, with your help we
changed it a little bit and we have got with R2.10 :
>
> library("IRanges")
> library("GenomicRanges")
> library("GenomicFeatures")
> #? changer si une version plus r?cente de la librairie est
t?l?charg?e.
> library(SNPlocs.Hsapiens.dbSNP.20101109)
> library("org.Hs.eg.db")
>
> #Allocation de la m?moire sous windows
> memory.limit(size = 4095)
> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
> getSNPcount()
> ch22snps<- getSNPlocs("ch22")
> ch22snps[1:5, ]
>
> #Create a GRange objetc to use with GenomicRanges library
> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> {
> ans_seqnames<- character(length(myids))
> ans_seqnames[]<- "unknown"
> ans_locs<- integer(length(myids))
> for (seqname in names(getSNPcount())) {
> if (verbose)
> cat("Processing ", seqname, " SNPs ...\n", sep="")
> locs<- getSNPlocs(seqname)
> ids<- paste("rs", locs$RefSNP_id, sep="")
> myrows<- match(myids, ids)
> hit_idx<- !is.na(myrows)
> ans_seqnames[hit_idx]<- seqname
> ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
> }
> GRanges(seqnames=ans_seqnames,
> IRanges(start=ans_locs, width=1),
> RefSNP_id=myids)
> }
>
> #Test en utilisant les premi?res sondes du premier et second
chormosome
> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542",
"rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176",
"rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230",
"rs12557436")
> #ouverture du fichier pour aller chercher nos num?ros rs
> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
> myids<- rs_SNPs[,1]
> mysnps<- makeGRangesFromRefSNPids(myids)
> mysnps # a GRanges object with 1 SNP per row
> #create a TranscriptDb
> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> txdb
> #extract the transcript locations together with their genes
> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx # a GRanges object with 1 transcript per row
> #rename chromosome to fit USCS standard
> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY",
seqnames(mysnps))
> #mapping but not on a readable format
> map<- as.matrix(findOverlaps(mysnps, tx))
>
> #making the mapping readable
> mapped_genes<- values(tx)$gene_id[map[, 2]]
> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
elementLengths(mapped_genes))
> snp2gene<- unique(data.frame(SNPNAME=mapped_snps,
gene_id=unlist(mapped_genes)))
> rownames(snp2gene)<- NULL
> snp2gene
>
> #snp2gene se travaille mal alors on le transf?re en matrice
> snp2geneTmp = t(t(snp2gene))
> #aller chercher les symboles correspondants ? nos gene id
> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound =
NA))
>
> save.image(file = "map.RData")
>
>
>
>
>
> And everything was working perfectly.
>
> Now, I have done a lot of script to analyse my data on a lot of way
and I think it's time to update my old mapping. I have try the same
script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109)
for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some
error...
You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which
means a lot of things could have changed. You should not assume that
your problems are caused only because you are using a more recent
SNPlocs package.
> Can you help me? The problems seems to start with "map<-
as.matrix(findOverlaps(mysnps, tx))"
The problem starts earlier with:
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) :
levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
The right way to do this with recent version of GenomicRanges is to
use
seqlevels() instead of seqnames().
> and the other error seems to result from that problem.
Failing to rename the seqlevels will surely cause you some troubles
later down so I would try to address this issue first see if that
solves
the other problems.
Also note that with recent versions of the SNPlocs packages (i.e.
version >= 0.99.6), you can use rsidsToGRanges() to do what your
home made makeGRangesFromRefSNPids() function does. The latter
is much faster BUT, unlike the former, it will fail if some of
the supplied rs ids are not found in the SNPlocs package (it will
issue an error showing the list of rs ids that could not be found).
I've already received some request to improve this so I'll try to
work on this soon.
Cheers,
H.
>
>
>
> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: x86_64-pc-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] org.Hs.eg.db_2.6.4
> [2] RSQLite_0.11.1
> [3] DBI_0.2-5
> [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6
> [5] GenomicFeatures_1.6.7
> [6] AnnotationDbi_1.16.11
> [7] Biobase_2.14.0
> [8] GenomicRanges_1.6.7
> [9] IRanges_1.12.6
> loaded via a namespace (and not attached):
> [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0
RCurl_1.9-5
> [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4
zlibbioc_1.0.0
>
>> library("IRanges")
> Attaching package: ?IRanges?
> The following object(s) are masked from ?package:base?:
> cbind, eval, intersect, Map, mapply, order, paste, pmax,
pmax.int,
> pmin, pmin.int, rbind, rep.int, setdiff, table, union
>> library("GenomicRanges")
>> library("GenomicFeatures")
> Loading required package: AnnotationDbi
> Loading required package: Biobase
> Welcome to Bioconductor
> Vignettes contain introductory material. To view, type
> 'browseVignettes()'. To cite Bioconductor, see
> 'citation("Biobase")' and for packages 'citation("pkgname")'.
>
> Attaching package: ?Biobase?
> The following object(s) are masked from ?package:IRanges?:
> updateObject
>> #? changer si une version plus r?cente de la librairie est
t?l?charg?e.
>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>> library("org.Hs.eg.db")
> Loading required package: DBI
>>
>>
>> #Allocation de la m?moire sous windows
>> memory.limit(size = 4095)
> [1] Inf
> Warning message:
> 'memory.limit()' is Windows-specific
>>
>> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
>> getSNPcount()
> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9 ch10
> 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977
1403368 1544307
> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19 ch20
> 1542256 1521919 1104719 1031214 949642 1084538 917737 886293
732039 788556
> ch21 ch22 chX chY chMT
> 468379 454939 920890 75108 942
>> ch22snps<- getSNPlocs("ch22")
>> ch22snps[1:5, ]
> RefSNP_id alleles_as_ambig loc
> 1 56342815 K 16050353
> 2 149201999 Y 16050408
> 3 146752890 S 16050612
> 4 139377059 Y 16050678
> 5 143300205 R 16050822
>>
>> #########################? FAIRE CANOPUS###################
>>
>> #Create a GRange objetc to use with GenomicRanges library
>> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> + {
> + ans_seqnames<- character(length(myids))
> + ans_seqnames[]<- "unknown"
> + ans_locs<- integer(length(myids))
> + for (seqname in names(getSNPcount())) {
> + if (verbose)
> + cat("Processing ", seqname, " SNPs ...\n", sep="")
> + locs<- getSNPlocs(seqname)
> + ids<- paste("rs", locs$RefSNP_id, sep="")
> + myrows<- match(myids, ids)
> + hit_idx<- !is.na(myrows)
> + ans_seqnames[hit_idx]<- seqname
> + ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
> + }
> + GRanges(seqnames=ans_seqnames,
> + IRanges(start=ans_locs, width=1),
> + RefSNP_id=myids)
> + }
>>
>>
>> #Test en utilisant les premi?res sondes du premier et second
chormosome
>> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542",
"rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176",
"rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230",
"rs12557436")
>>
>> #ouverture du fichier pour aller chercher nos num?ros rs
>> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
>> myids<- rs_SNPs[,1]
>>
>> mysnps<- makeGRangesFromRefSNPids(myids)
>> mysnps # a GRanges object with 1 SNP per row
> GRanges with 348411 ranges and 1 elementMetadata value:
> seqnames ranges strand | RefSNP_id
> <rle> <iranges> <rle> |<factor>
> [1] ch1 [2195117, 2195117] * | rs7547453
> [2] ch1 [2291680, 2291680] * | rs2840542
> [3] ch1 [3256108, 3256108] * | rs1999527
> [4] ch1 [3577321, 3577321] * | rs4648545
> [5] ch1 [4230463, 4230463] * | rs10915459
> [6] ch1 [4404344, 4404344] * | rs16838750
> [7] ch1 [4501911, 4501911] * | rs12128230
> [8] ch1 [4535148, 4535148] * | rs7541288
> [9] ch1 [4581230, 4581230] * | rs12039682
> ... ... ... ... ... ...
> [348403] chX [154514047, 154514047] * | rs499428
> [348404] chX [154514919, 154514919] * | rs507127
> [348405] chX [154737376, 154737376] * | rs5940372
> [348406] chX [154780283, 154780283] * | rs6642287
> [348407] chX [154830377, 154830377] * | rs5983658
> [348408] chX [154870197, 154870197] * | rs473772
> [348409] chX [154892230, 154892230] * | rs553678
> [348410] chX [154899846, 154899846] * | rs473491
> [348411] chX [154929412, 154929412] * | rs557132
> ---
> seqlengths:
> ch1 ch10 ch11 ch12 ch13 ... ch8 ch9
chX unknown
> NA NA NA NA NA ... NA NA
NA NA
>>
>> #create a TranscriptDb
>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> Download the refGene table ... OK
> Download the refLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TranscriptDb object ... OK
> There were 50 or more warnings (use warnings() to see the first 50)
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | Genus and Species: Homo sapiens
> | UCSC Table: refGene
> | Resource URL: http://genome.ucsc.edu/
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 41677
> | exon_nrow: 235596
> | cds_nrow: 206518
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012)
> | GenomicFeatures version at creation time: 1.6.7
> | RSQLite version at creation time: 0.11.1
> | DBSCHEMAVERSION: 1.0
> | package: GenomicFeatures
>>
>> #extract the transcript locations together with their genes
>> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>> tx # a GRanges object with 1 transcript per row
> GRanges with 41677 ranges and 3 elementMetadata values:
> seqnames ranges strand | tx_id
tx_name
> <rle> <iranges> <rle> |<integer>
<character>
> [1] chr1 [ 11874, 14408] + | 1127
NR_046018
> [2] chr1 [ 69091, 70008] + | 1128
NM_001005484
> [3] chr1 [323892, 328581] + | 1130
NR_028327
> [4] chr1 [323892, 328581] + | 1132
NR_028325
> [5] chr1 [323892, 328581] + | 1133
NR_028322
> [6] chr1 [367659, 368597] + | 1131
NM_001005221
> [7] chr1 [367659, 368597] + | 1134
NM_001005224
> [8] chr1 [367659, 368597] + | 1135
NM_001005277
> [9] chr1 [763064, 789740] + | 198
NR_015368
> ... ... ... ... ... ...
...
> [41669] chrY [27177050, 27198251] - | 20790
NM_004678
> [41670] chrY [27177050, 27198251] - | 20793
NM_001002761
> [41671] chrY [27177050, 27198251] - | 20794
NM_001002760
> [41672] chrY [27209230, 27246039] - | 20791
NR_002177
> [41673] chrY [27209230, 27246039] - | 20792
NR_002178
> [41674] chrY [27209230, 27246039] - | 20795
NR_001525
> [41675] chrY [27329790, 27330920] - | 20796
NR_002179
> [41676] chrY [27329790, 27330920] - | 20797
NR_002180
> [41677] chrY [27329790, 27330920] - | 20798
NR_001526
> gene_id
> <compressedcharacterlist>
> [1] 100287102
> [2] 79501
> [3] 100133331
> [4] 100132062
> [5] 100132287
> [6] 729759
> [7] 26683
> [8] 81399
> [9] 643837
> ... ...
> [41669] 9083
> [41670] 442868
> [41671] 442867
> [41672] 474150
> [41673] 474149
> [41674] 114761
> [41675] 474152
> [41676] 474151
> [41677] 252949
> ---
> seqlengths:
> chr1 chr2 ...
chr18_gl000207_random
> 249250621 243199373 ...
4262
>>
>> #rename chromosome to fit USCS standard
>> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> Error in `seqnames<-`(`*tmp*`, value =<s4 object="" of="" class="" "rle"="">) :
> levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
>> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY",
seqnames(mysnps))
>>
>> #mapping but not on a readable format
>> map<- as.matrix(findOverlaps(mysnps, tx))
> Warning message:
> In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels not in the
other:
> - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17,
ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9,
chX, unknown
> - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8,
chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18,
chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5,
chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6,
chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random,
chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random,
chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212,
chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219,
chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217,
chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220,
chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214,
chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [...
truncated]
>>
>>
>> #making the mapping readable
>> mapped_genes<- values(tx)$gene_id[map[, 2]]
>> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
elementLengths(mapped_genes))
>> snp2gene<- unique(data.frame(SNPNAME=mapped_snps,
gene_id=unlist(mapped_genes)))
>> rownames(snp2gene)<- NULL
>> snp2gene
> [1] SNPNAME gene_id
> <0 rows> (or 0-length row.names)
>>
>>
>> #snp2gene se travaille mal alors on le transf?re en matrice
>> snp2geneTmp = t(t(snp2gene))
>>
>> #aller chercher les symboles correspondants ? nos gene id
>> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound =
NA))
> Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound =
NA)) :
> error in evaluating the argument 'x' in selecting a method for
function 'unlist': Error in .checkKeysAreWellFormed(keys) :
> keys must be supplied in a character vector with no NAs
>>
>>
>> save.image(file = "map.RData")
>>
>
>
>
>
>
> Simon No?l
> CdeC
> ________________________________________
> De : Hervé Pagès [hpages at fhcrc.org]
> Date d'envoi : 5 d?cembre 2010 23:43
> ? : Simon No?l
> Cc : bioconductor at r-project.org
> Objet : Re: [BioC] maping SNPs
>
> Hi Simon,
>
> On 12/03/2010 10:17 AM, Simon No?l wrote:
>>
>> Hi,
>>
>>
>>
>> I have a really big list of SNPs names like :
>>
>>
>>
>> SNPNAME
>>
>> rs7547453
>>
>> rs2840542
>>
>> rs1999527
>>
>> rs4648545
>>
>> rs10915459
>>
>> rs16838750
>>
>> rs12128230
>>
>> ...
>>
>>
>>
>> I woudlike to map them to their official gene symbol. What
the best way to
>> procede?
>
> Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
> from SNPs to genes and I don't think we have this kind of mappings
> either in our collection of annotations (*.db packages).
>
> But if your SNPs are Human then you can do the mapping yourself by
> using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
> packages.
>
> The latest SNPlocs.Hsapies.dbSNP.* package is
> SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
> SNP locations relative to the GRCh37 genome:
>
> > library(SNPlocs.Hsapiens.dbSNP.20101109)
> > getSNPcount()
> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9
> ch10
> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
995075
> 1158707
> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19
> ch20
> 1147722 1105364 815729 740129 657719 757926 641905 645646
520666
> 586708
> ch21 ch22 chX chY chMT
> 338254 331060 529608 67438 624
>
> Note that it doesn't contain *all* SNPs from dbSNP Build 132:
> only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
> for the details).
>
> > ch22snps<- getSNPlocs("ch22")
> > ch22snps[1:5, ]
> RefSNP_id alleles_as_ambig loc
> 1 56342815 K 16050353
> 2 7288968 S 16050994
> 3 6518357 M 16051107
> 4 7292503 R 16051209
> 5 6518368 Y 16051241
>
> Note that the rs prefix has been dropped.
>
> So here is how to proceed:
>
> First you can use the following function to make a GRanges object
from
> your SNP ids:
>
> makeGRangesFromRefSNPids<- function(myids)
> {
> ans_seqnames<- character(length(myids))
> ans_seqnames[]<- "unknown"
> ans_locs<- integer(length(myids))
> for (seqname in names(getSNPcount())) {
> locs<- getSNPlocs(seqname)
> ids<- paste("rs", locs$RefSNP_id, sep="")
> myrows<- match(myids, ids)
> ans_seqnames[!is.na(myrows)]<- seqname
> ans_locs[!is.na(myrows)]<- locs$loc[myrows]
> }
> GRanges(seqnames=ans_seqnames,
> IRanges(start=ans_locs, width=1),
> RefSNP_id=myids)
> }
>
> This takes between 3 and 5 minutes:
>
> > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
> "rs10915459", "rs16838750", "rs12128230",
"rs999999999")
> > mysnps<- makeGRangesFromRefSNPids(myids)
> > mysnps # a GRanges object with 1 SNP per row
> GRanges with 8 ranges and 1 elementMetadata value
> seqnames ranges strand | myids
> <rle> <iranges> <rle> |<character>
> [1] ch1 [2195117, 2195117] * | rs7547453
> [2] ch1 [2291680, 2291680] * | rs2840542
> [3] ch1 [3256108, 3256108] * | rs1999527
> [4] ch1 [3577321, 3577321] * | rs4648545
> [5] ch1 [4230463, 4230463] * | rs10915459
> [6] ch1 [4404344, 4404344] * | rs16838750
> [7] ch1 [4501911, 4501911] * | rs12128230
> [8] unknown [ 0, 0] * | rs999999999
>
> seqlengths
> ch1 unknown
> NA NA
>
> The next step is to create a TranscriptDb object with
> makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
> from the GenomicFeatures package. This TranscriptDb object will
> contain the transcript locations and their associated
> genes extracted from the annotation source you choose.
> For example, if you want to use RefSeq genes:
>
> ## Takes about 3 minutes:
> > txdb<- makeTranscriptDbFromUCSC(genome="hg19",
tablename="refGene")
> > txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: refGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 37924
> | exon_nrow: 230024
> | cds_nrow: 204571
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
> | GenomicFeatures version at creation time: 1.2.2
> | RSQLite version at creation time: 0.9-4
> | DBSCHEMAVERSION: 1.0
>
> Note the type of gene IDs (Entrez Gene ID) stored in this
TranscriptDb
> object: this means that later you will be able to use the
org.Hs.eg.db
> package to map your gene ids to their symbol (the org.*.eg.db
packages
> are Entrez Gene ID centric).
>
> To extract the transcript locations together with their genes:
>
> > tx<- transcripts(txdb, columns=c("tx_id", "tx_name",
"gene_id"))
> > tx # a GRanges object with 1 transcript per row
> GRanges with 37924 ranges and 1 elementMetadata value
> seqnames ranges strand |
gene_id
> <rle> <iranges> <rle>
|<compressedcharacterlist>
> [1] chr1 [ 69091, 70008] + |
79501
> [2] chr1 [323892, 328581] + |
100133331
> [3] chr1 [323892, 328581] + |
100132287
> [4] chr1 [323892, 328581] + |
100132062
> [5] chr1 [367659, 368597] + |
81399
> [6] chr1 [367659, 368597] + |
729759
> [7] chr1 [367659, 368597] + |
26683
> [8] chr1 [763064, 789740] + |
643837
> [9] chr1 [861121, 879961] + |
148398
> ... ... ... ... ...
...
> [37916] chrY [27177050, 27198251] - |
9083
> [37917] chrY [27177050, 27198251] - |
442867
> [37918] chrY [27177050, 27198251] - |
442868
> [37919] chrY [27209230, 27246039] - |
114761
> [37920] chrY [27209230, 27246039] - |
474150
> [37921] chrY [27209230, 27246039] - |
474149
> [37922] chrY [27329790, 27330920] - |
252949
> [37923] chrY [27329790, 27330920] - |
474152
> [37924] chrY [27329790, 27330920] - |
474151
>
> seqlengths
> chr1 chr2 ...
chr18_gl000207_random
> 249250621 243199373 ...
4262
>
> Now you can use findOverlaps() on 'mysnps' and 'tx' to find
> the transcripts hits by your snps. But before you can do this,
> you need to rename the sequences in 'mysnps' because dbSNPs and
> UCSC use different naming conventions for the chromosomes:
>
> > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
>
> Then:
>
> > map<- as.matrix(findOverlaps(mysnps, tx))
>
> 'map' contains the mapping between your SNPs and their genes but not
> in a readable form (this matrix contains indices) so we make the
> 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
> the associated gene ids:
>
> > mapped_genes<- values(tx)$gene_id[map[, 2]]
> > mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]],
> elementLengths(mapped_genes))
> > snp2gene<- unique(data.frame(snp_id=mapped_snps,
> gene_id=unlist(mapped_genes)))
> > rownames(snp2gene)<- NULL
> > snp2gene[1:4, ]
> snp_id gene_id
> 1 rs7547453 6497
> 2 rs2840542 79906
> 3 rs1999527 63976
> 4 rs4648545 7161
>
> Note that there is no guarantee that the number of rows in this
> data frame is the number of your original SNP ids because the
> relation between SNP ids and gene ids is of course not one-to-one.
>
> Also the method described above considers that a SNP hits a gene
> if it's located between the start and the end of one of its
> transcripts but it doesn't take in account the exon structure of
> the transcripts. If you want to do this you need to use exonsBy()
> (from GenomicFeatures) to extract the exons grouped by transcripts
> (this will be stored in a GRangesList object) and use this object
> instead of 'tx' in the call to findOverlaps().
>
> Hope this helps,
> H.
>
>
>>
>>
>>
>> Simon No??l
>> CdeC
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
------------------------------
Message: 21
Date: Thu, 9 Feb 2012 16:43:23 -0500
From: Simon No?l <simon.noel.2@ulaval.ca>
To: Hervé Pagès <hpages at="" fhcrc.org="">
Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: [BioC] RE : RE : maping SNPs
Message-ID:
<1C970CA8E580A84394CE5CF42FEC0046E2CBDCC110 at EXCH-
MBX-E.ulaval.ca>
Content-Type: text/plain; charset="iso-8859-1"
Thank's alot. It seem to work now:) But for the home made
makeGRangesFromRefSNPids, it's not my work. You gave me that function
;)
Simon No?l
CdeC
________________________________________
De : Hervé Pagès [hpages at fhcrc.org]
Date d'envoi : 9 f?vrier 2012 16:34
? : Simon No?l
Cc : bioconductor at r-project.org
Objet : Re: RE : [BioC] maping SNPs
Hi Simon,
On 02/09/2012 12:28 PM, Simon No?l wrote:
> Hello Mr. Pag?s,
>
> At the begining of my master, you really helped me to map my SNPs to
their gene with the code bellow. As you remember, with your help we
changed it a little bit and we have got with R2.10 :
>
> library("IRanges")
> library("GenomicRanges")
> library("GenomicFeatures")
> #? changer si une version plus r?cente de la librairie est
t?l?charg?e.
> library(SNPlocs.Hsapiens.dbSNP.20101109)
> library("org.Hs.eg.db")
>
> #Allocation de la m?moire sous windows
> memory.limit(size = 4095)
> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
> getSNPcount()
> ch22snps<- getSNPlocs("ch22")
> ch22snps[1:5, ]
>
> #Create a GRange objetc to use with GenomicRanges library
> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> {
> ans_seqnames<- character(length(myids))
> ans_seqnames[]<- "unknown"
> ans_locs<- integer(length(myids))
> for (seqname in names(getSNPcount())) {
> if (verbose)
> cat("Processing ", seqname, " SNPs ...\n", sep="")
> locs<- getSNPlocs(seqname)
> ids<- paste("rs", locs$RefSNP_id, sep="")
> myrows<- match(myids, ids)
> hit_idx<- !is.na(myrows)
> ans_seqnames[hit_idx]<- seqname
> ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
> }
> GRanges(seqnames=ans_seqnames,
> IRanges(start=ans_locs, width=1),
> RefSNP_id=myids)
> }
>
> #Test en utilisant les premi?res sondes du premier et second
chormosome
> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542",
"rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176",
"rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230",
"rs12557436")
> #ouverture du fichier pour aller chercher nos num?ros rs
> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
> myids<- rs_SNPs[,1]
> mysnps<- makeGRangesFromRefSNPids(myids)
> mysnps # a GRanges object with 1 SNP per row
> #create a TranscriptDb
> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> txdb
> #extract the transcript locations together with their genes
> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx # a GRanges object with 1 transcript per row
> #rename chromosome to fit USCS standard
> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY",
seqnames(mysnps))
> #mapping but not on a readable format
> map<- as.matrix(findOverlaps(mysnps, tx))
>
> #making the mapping readable
> mapped_genes<- values(tx)$gene_id[map[, 2]]
> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
elementLengths(mapped_genes))
> snp2gene<- unique(data.frame(SNPNAME=mapped_snps,
gene_id=unlist(mapped_genes)))
> rownames(snp2gene)<- NULL
> snp2gene
>
> #snp2gene se travaille mal alors on le transf?re en matrice
> snp2geneTmp = t(t(snp2gene))
> #aller chercher les symboles correspondants ? nos gene id
> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound =
NA))
>
> save.image(file = "map.RData")
>
>
>
>
>
> And everything was working perfectly.
>
> Now, I have done a lot of script to analyse my data on a lot of way
and I think it's time to update my old mapping. I have try the same
script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109)
for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some
error...
You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which
means a lot of things could have changed. You should not assume that
your problems are caused only because you are using a more recent
SNPlocs package.
> Can you help me? The problems seems to start with "map<-
as.matrix(findOverlaps(mysnps, tx))"
The problem starts earlier with:
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) :
levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
The right way to do this with recent version of GenomicRanges is to
use
seqlevels() instead of seqnames().
> and the other error seems to result from that problem.
Failing to rename the seqlevels will surely cause you some troubles
later down so I would try to address this issue first see if that
solves
the other problems.
Also note that with recent versions of the SNPlocs packages (i.e.
version >= 0.99.6), you can use rsidsToGRanges() to do what your
home made makeGRangesFromRefSNPids() function does. The latter
is much faster BUT, unlike the former, it will fail if some of
the supplied rs ids are not found in the SNPlocs package (it will
issue an error showing the list of rs ids that could not be found).
I've already received some request to improve this so I'll try to
work on this soon.
Cheers,
H.
>
>
>
> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: x86_64-pc-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] org.Hs.eg.db_2.6.4
> [2] RSQLite_0.11.1
> [3] DBI_0.2-5
> [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6
> [5] GenomicFeatures_1.6.7
> [6] AnnotationDbi_1.16.11
> [7] Biobase_2.14.0
> [8] GenomicRanges_1.6.7
> [9] IRanges_1.12.6
> loaded via a namespace (and not attached):
> [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0
RCurl_1.9-5
> [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4
zlibbioc_1.0.0
>
>> library("IRanges")
> Attaching package: ?IRanges?
> The following object(s) are masked from ?package:base?:
> cbind, eval, intersect, Map, mapply, order, paste, pmax,
pmax.int,
> pmin, pmin.int, rbind, rep.int, setdiff, table, union
>> library("GenomicRanges")
>> library("GenomicFeatures")
> Loading required package: AnnotationDbi
> Loading required package: Biobase
> Welcome to Bioconductor
> Vignettes contain introductory material. To view, type
> 'browseVignettes()'. To cite Bioconductor, see
> 'citation("Biobase")' and for packages 'citation("pkgname")'.
>
> Attaching package: ?Biobase?
> The following object(s) are masked from ?package:IRanges?:
> updateObject
>> #? changer si une version plus r?cente de la librairie est
t?l?charg?e.
>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>> library("org.Hs.eg.db")
> Loading required package: DBI
>>
>>
>> #Allocation de la m?moire sous windows
>> memory.limit(size = 4095)
> [1] Inf
> Warning message:
> 'memory.limit()' is Windows-specific
>>
>> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
>> getSNPcount()
> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9 ch10
> 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977
1403368 1544307
> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19 ch20
> 1542256 1521919 1104719 1031214 949642 1084538 917737 886293
732039 788556
> ch21 ch22 chX chY chMT
> 468379 454939 920890 75108 942
>> ch22snps<- getSNPlocs("ch22")
>> ch22snps[1:5, ]
> RefSNP_id alleles_as_ambig loc
> 1 56342815 K 16050353
> 2 149201999 Y 16050408
> 3 146752890 S 16050612
> 4 139377059 Y 16050678
> 5 143300205 R 16050822
>>
>> #########################? FAIRE CANOPUS###################
>>
>> #Create a GRange objetc to use with GenomicRanges library
>> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> + {
> + ans_seqnames<- character(length(myids))
> + ans_seqnames[]<- "unknown"
> + ans_locs<- integer(length(myids))
> + for (seqname in names(getSNPcount())) {
> + if (verbose)
> + cat("Processing ", seqname, " SNPs ...\n", sep="")
> + locs<- getSNPlocs(seqname)
> + ids<- paste("rs", locs$RefSNP_id, sep="")
> + myrows<- match(myids, ids)
> + hit_idx<- !is.na(myrows)
> + ans_seqnames[hit_idx]<- seqname
> + ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
> + }
> + GRanges(seqnames=ans_seqnames,
> + IRanges(start=ans_locs, width=1),
> + RefSNP_id=myids)
> + }
>>
>>
>> #Test en utilisant les premi?res sondes du premier et second
chormosome
>> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542",
"rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176",
"rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230",
"rs12557436")
>>
>> #ouverture du fichier pour aller chercher nos num?ros rs
>> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
>> myids<- rs_SNPs[,1]
>>
>> mysnps<- makeGRangesFromRefSNPids(myids)
>> mysnps # a GRanges object with 1 SNP per row
> GRanges with 348411 ranges and 1 elementMetadata value:
> seqnames ranges strand | RefSNP_id
> <rle> <iranges> <rle> |<factor>
> [1] ch1 [2195117, 2195117] * | rs7547453
> [2] ch1 [2291680, 2291680] * | rs2840542
> [3] ch1 [3256108, 3256108] * | rs1999527
> [4] ch1 [3577321, 3577321] * | rs4648545
> [5] ch1 [4230463, 4230463] * | rs10915459
> [6] ch1 [4404344, 4404344] * | rs16838750
> [7] ch1 [4501911, 4501911] * | rs12128230
> [8] ch1 [4535148, 4535148] * | rs7541288
> [9] ch1 [4581230, 4581230] * | rs12039682
> ... ... ... ... ... ...
> [348403] chX [154514047, 154514047] * | rs499428
> [348404] chX [154514919, 154514919] * | rs507127
> [348405] chX [154737376, 154737376] * | rs5940372
> [348406] chX [154780283, 154780283] * | rs6642287
> [348407] chX [154830377, 154830377] * | rs5983658
> [348408] chX [154870197, 154870197] * | rs473772
> [348409] chX [154892230, 154892230] * | rs553678
> [348410] chX [154899846, 154899846] * | rs473491
> [348411] chX [154929412, 154929412] * | rs557132
> ---
> seqlengths:
> ch1 ch10 ch11 ch12 ch13 ... ch8 ch9
chX unknown
> NA NA NA NA NA ... NA NA
NA NA
>>
>> #create a TranscriptDb
>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> Download the refGene table ... OK
> Download the refLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TranscriptDb object ... OK
> There were 50 or more warnings (use warnings() to see the first 50)
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | Genus and Species: Homo sapiens
> | UCSC Table: refGene
> | Resource URL: http://genome.ucsc.edu/
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 41677
> | exon_nrow: 235596
> | cds_nrow: 206518
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012)
> | GenomicFeatures version at creation time: 1.6.7
> | RSQLite version at creation time: 0.11.1
> | DBSCHEMAVERSION: 1.0
> | package: GenomicFeatures
>>
>> #extract the transcript locations together with their genes
>> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>> tx # a GRanges object with 1 transcript per row
> GRanges with 41677 ranges and 3 elementMetadata values:
> seqnames ranges strand | tx_id
tx_name
> <rle> <iranges> <rle> |<integer>
<character>
> [1] chr1 [ 11874, 14408] + | 1127
NR_046018
> [2] chr1 [ 69091, 70008] + | 1128
NM_001005484
> [3] chr1 [323892, 328581] + | 1130
NR_028327
> [4] chr1 [323892, 328581] + | 1132
NR_028325
> [5] chr1 [323892, 328581] + | 1133
NR_028322
> [6] chr1 [367659, 368597] + | 1131
NM_001005221
> [7] chr1 [367659, 368597] + | 1134
NM_001005224
> [8] chr1 [367659, 368597] + | 1135
NM_001005277
> [9] chr1 [763064, 789740] + | 198
NR_015368
> ... ... ... ... ... ...
...
> [41669] chrY [27177050, 27198251] - | 20790
NM_004678
> [41670] chrY [27177050, 27198251] - | 20793
NM_001002761
> [41671] chrY [27177050, 27198251] - | 20794
NM_001002760
> [41672] chrY [27209230, 27246039] - | 20791
NR_002177
> [41673] chrY [27209230, 27246039] - | 20792
NR_002178
> [41674] chrY [27209230, 27246039] - | 20795
NR_001525
> [41675] chrY [27329790, 27330920] - | 20796
NR_002179
> [41676] chrY [27329790, 27330920] - | 20797
NR_002180
> [41677] chrY [27329790, 27330920] - | 20798
NR_001526
> gene_id
> <compressedcharacterlist>
> [1] 100287102
> [2] 79501
> [3] 100133331
> [4] 100132062
> [5] 100132287
> [6] 729759
> [7] 26683
> [8] 81399
> [9] 643837
> ... ...
> [41669] 9083
> [41670] 442868
> [41671] 442867
> [41672] 474150
> [41673] 474149
> [41674] 114761
> [41675] 474152
> [41676] 474151
> [41677] 252949
> ---
> seqlengths:
> chr1 chr2 ...
chr18_gl000207_random
> 249250621 243199373 ...
4262
>>
>> #rename chromosome to fit USCS standard
>> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> Error in `seqnames<-`(`*tmp*`, value =<s4 object="" of="" class="" "rle"="">) :
> levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
>> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY",
seqnames(mysnps))
>>
>> #mapping but not on a readable format
>> map<- as.matrix(findOverlaps(mysnps, tx))
> Warning message:
> In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels not in the
other:
> - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17,
ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9,
chX, unknown
> - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8,
chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18,
chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5,
chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6,
chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random,
chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random,
chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212,
chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219,
chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217,
chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220,
chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214,
chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [...
truncated]
>>
>>
>> #making the mapping readable
>> mapped_genes<- values(tx)$gene_id[map[, 2]]
>> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
elementLengths(mapped_genes))
>> snp2gene<- unique(data.frame(SNPNAME=mapped_snps,
gene_id=unlist(mapped_genes)))
>> rownames(snp2gene)<- NULL
>> snp2gene
> [1] SNPNAME gene_id
> <0 rows> (or 0-length row.names)
>>
>>
>> #snp2gene se travaille mal alors on le transf?re en matrice
>> snp2geneTmp = t(t(snp2gene))
>>
>> #aller chercher les symboles correspondants ? nos gene id
>> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound =
NA))
> Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound =
NA)) :
> error in evaluating the argument 'x' in selecting a method for
function 'unlist': Error in .checkKeysAreWellFormed(keys) :
> keys must be supplied in a character vector with no NAs
>>
>>
>> save.image(file = "map.RData")
>>
>
>
>
>
>
> Simon No?l
> CdeC
> ________________________________________
> De : Hervé Pagès [hpages at fhcrc.org]
> Date d'envoi : 5 d?cembre 2010 23:43
> ? : Simon No?l
> Cc : bioconductor at r-project.org
> Objet : Re: [BioC] maping SNPs
>
> Hi Simon,
>
> On 12/03/2010 10:17 AM, Simon No?l wrote:
>>
>> Hi,
>>
>>
>>
>> I have a really big list of SNPs names like :
>>
>>
>>
>> SNPNAME
>>
>> rs7547453
>>
>> rs2840542
>>
>> rs1999527
>>
>> rs4648545
>>
>> rs10915459
>>
>> rs16838750
>>
>> rs12128230
>>
>> ...
>>
>>
>>
>> I woudlike to map them to their official gene symbol. What
the best way to
>> procede?
>
> Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
> from SNPs to genes and I don't think we have this kind of mappings
> either in our collection of annotations (*.db packages).
>
> But if your SNPs are Human then you can do the mapping yourself by
> using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
> packages.
>
> The latest SNPlocs.Hsapies.dbSNP.* package is
> SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
> SNP locations relative to the GRCh37 genome:
>
> > library(SNPlocs.Hsapiens.dbSNP.20101109)
> > getSNPcount()
> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9
> ch10
> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
995075
> 1158707
> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19
> ch20
> 1147722 1105364 815729 740129 657719 757926 641905 645646
520666
> 586708
> ch21 ch22 chX chY chMT
> 338254 331060 529608 67438 624
>
> Note that it doesn't contain *all* SNPs from dbSNP Build 132:
> only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
> for the details).
>
> > ch22snps<- getSNPlocs("ch22")
> > ch22snps[1:5, ]
> RefSNP_id alleles_as_ambig loc
> 1 56342815 K 16050353
> 2 7288968 S 16050994
> 3 6518357 M 16051107
> 4 7292503 R 16051209
> 5 6518368 Y 16051241
>
> Note that the rs prefix has been dropped.
>
> So here is how to proceed:
>
> First you can use the following function to make a GRanges object
from
> your SNP ids:
>
> makeGRangesFromRefSNPids<- function(myids)
> {
> ans_seqnames<- character(length(myids))
> ans_seqnames[]<- "unknown"
> ans_locs<- integer(length(myids))
> for (seqname in names(getSNPcount())) {
> locs<- getSNPlocs(seqname)
> ids<- paste("rs", locs$RefSNP_id, sep="")
> myrows<- match(myids, ids)
> ans_seqnames[!is.na(myrows)]<- seqname
> ans_locs[!is.na(myrows)]<- locs$loc[myrows]
> }
> GRanges(seqnames=ans_seqnames,
> IRanges(start=ans_locs, width=1),
> RefSNP_id=myids)
> }
>
> This takes between 3 and 5 minutes:
>
> > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
> "rs10915459", "rs16838750", "rs12128230",
"rs999999999")
> > mysnps<- makeGRangesFromRefSNPids(myids)
> > mysnps # a GRanges object with 1 SNP per row
> GRanges with 8 ranges and 1 elementMetadata value
> seqnames ranges strand | myids
> <rle> <iranges> <rle> |<character>
> [1] ch1 [2195117, 2195117] * | rs7547453
> [2] ch1 [2291680, 2291680] * | rs2840542
> [3] ch1 [3256108, 3256108] * | rs1999527
> [4] ch1 [3577321, 3577321] * | rs4648545
> [5] ch1 [4230463, 4230463] * | rs10915459
> [6] ch1 [4404344, 4404344] * | rs16838750
> [7] ch1 [4501911, 4501911] * | rs12128230
> [8] unknown [ 0, 0] * | rs999999999
>
> seqlengths
> ch1 unknown
> NA NA
>
> The next step is to create a TranscriptDb object with
> makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
> from the GenomicFeatures package. This TranscriptDb object will
> contain the transcript locations and their associated
> genes extracted from the annotation source you choose.
> For example, if you want to use RefSeq genes:
>
> ## Takes about 3 minutes:
> > txdb<- makeTranscriptDbFromUCSC(genome="hg19",
tablename="refGene")
> > txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: refGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 37924
> | exon_nrow: 230024
> | cds_nrow: 204571
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
> | GenomicFeatures version at creation time: 1.2.2
> | RSQLite version at creation time: 0.9-4
> | DBSCHEMAVERSION: 1.0
>
> Note the type of gene IDs (Entrez Gene ID) stored in this
TranscriptDb
> object: this means that later you will be able to use the
org.Hs.eg.db
> package to map your gene ids to their symbol (the org.*.eg.db
packages
> are Entrez Gene ID centric).
>
> To extract the transcript locations together with their genes:
>
> > tx<- transcripts(txdb, columns=c("tx_id", "tx_name",
"gene_id"))
> > tx # a GRanges object with 1 transcript per row
> GRanges with 37924 ranges and 1 elementMetadata value
> seqnames ranges strand |
gene_id
> <rle> <iranges> <rle>
|<compressedcharacterlist>
> [1] chr1 [ 69091, 70008] + |
79501
> [2] chr1 [323892, 328581] + |
100133331
> [3] chr1 [323892, 328581] + |
100132287
> [4] chr1 [323892, 328581] + |
100132062
> [5] chr1 [367659, 368597] + |
81399
> [6] chr1 [367659, 368597] + |
729759
> [7] chr1 [367659, 368597] + |
26683
> [8] chr1 [763064, 789740] + |
643837
> [9] chr1 [861121, 879961] + |
148398
> ... ... ... ... ...
...
> [37916] chrY [27177050, 27198251] - |
9083
> [37917] chrY [27177050, 27198251] - |
442867
> [37918] chrY [27177050, 27198251] - |
442868
> [37919] chrY [27209230, 27246039] - |
114761
> [37920] chrY [27209230, 27246039] - |
474150
> [37921] chrY [27209230, 27246039] - |
474149
> [37922] chrY [27329790, 27330920] - |
252949
> [37923] chrY [27329790, 27330920] - |
474152
> [37924] chrY [27329790, 27330920] - |
474151
>
> seqlengths
> chr1 chr2 ...
chr18_gl000207_random
> 249250621 243199373 ...
4262
>
> Now you can use findOverlaps() on 'mysnps' and 'tx' to find
> the transcripts hits by your snps. But before you can do this,
> you need to rename the sequences in 'mysnps' because dbSNPs and
> UCSC use different naming conventions for the chromosomes:
>
> > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
>
> Then:
>
> > map<- as.matrix(findOverlaps(mysnps, tx))
>
> 'map' contains the mapping between your SNPs and their genes but not
> in a readable form (this matrix contains indices) so we make the
> 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
> the associated gene ids:
>
> > mapped_genes<- values(tx)$gene_id[map[, 2]]
> > mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]],
> elementLengths(mapped_genes))
> > snp2gene<- unique(data.frame(snp_id=mapped_snps,
> gene_id=unlist(mapped_genes)))
> > rownames(snp2gene)<- NULL
> > snp2gene[1:4, ]
> snp_id gene_id
> 1 rs7547453 6497
> 2 rs2840542 79906
> 3 rs1999527 63976
> 4 rs4648545 7161
>
> Note that there is no guarantee that the number of rows in this
> data frame is the number of your original SNP ids because the
> relation between SNP ids and gene ids is of course not one-to-one.
>
> Also the method described above considers that a SNP hits a gene
> if it's located between the start and the end of one of its
> transcripts but it doesn't take in account the exon structure of
> the transcripts. If you want to do this you need to use exonsBy()
> (from GenomicFeatures) to extract the exons grouped by transcripts
> (this will be stored in a GRangesList object) and use this object
> instead of 'tx' in the call to findOverlaps().
>
> Hope this helps,
> H.
>
>
>>
>>
>>
>> Simon No??l
>> CdeC
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
------------------------------
Message: 22
Date: Thu, 9 Feb 2012 13:48:19 -0800
From: Michael Lawrence <lawrence.michael@gene.com>
To: Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">
Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
<caoq5nyfevw0_a-g_j65-ic9e67xcw7m9ri10ro2jnn8dfyv_pw at="" mail.gmail.com="">
Content-Type: text/plain
Hi Steve,
I had thought someone would want something like this, and it would be
nice
to able to parse a GTF file into something more useful than a GRanges,
as
rtracklayer supports now.
So I'd definitely encourage you to go ahead, while recommending that
it be
based on the import.gtf function in rtracklayer.
We really need to have a centralized track I/O package. I'm not sure
why,
but everyone seems intent on writing the same parser over and over
again.
Duplicated effort frustrates me. It doesn't have to be rtracklayer,
but
somewhere in the infrastructure we need reliable I/O that handles all
of
the corner cases. I've embarked on a time consuming effort of
implementing
a comprehensive test suite for rtracklayer, which should make it at
least a
base for such a package.
Thanks a lot for volunteering this contribution,
Michael
On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> I thought it would be handy to make a GenomicFeatures::TranscriptDb
> from a gtf file and was somehow surprised to see that I couldn't
find
> such a function in GenomicFeatures.
>
> I'm happy to whip up such a function, but wanted to check in to make
> sure I'm not missing something like (1) you can already do it; or
(2)
> it's not as straightforward as I think; (3) maybe it's there and I'm
> not looking hard enough.
>
> Right now I just want to build it on the gff that the knew versions
of
> tophat build when you pass in a gtf/gff file of known transcripts
(the
> --G/--GTF cmd line arg), but I think it'd be handy overall.
>
> I don't think gff/gtf's have any column for exon ordering, though,
in
> which case I'd just assume the exons are ordered linearly (bye bye
> trans-splicing)).
>
> Good idea? Bad idea? Already done?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
------------------------------
Message: 23
Date: Thu, 9 Feb 2012 17:16:56 -0500
From: Steve Lianoglou <mailinglist.honeypot@gmail.com>
To: Michael Lawrence <lawrence.michael at="" gene.com="">
Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
<caha9mcn4bxggysrgoarbsqedu1zo6s3o+55yy3pjr3cij8du_g at="" mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Hey Michael,
I'll see if I can take a crack at it this w/e.
Also I'm in complete agreement w.r.t i/o stuff and your thoroughness
w/ rtracklayer is very much appreciated.
If we do have a centralized track i/o package that is outside of
rtracklayer, I nominate we name it bIO ... or something ;-)
... yes, I can hear the crickets ...
-steve
On Thu, Feb 9, 2012 at 4:48 PM, Michael Lawrence
<lawrence.michael at="" gene.com=""> wrote:
> Hi Steve,
>
> I had thought someone would want something like this, and it would
be nice
> to able to parse a GTF file into something more useful than a
GRanges, as
> rtracklayer supports now.
>
> So I'd definitely encourage you to go ahead, while recommending that
it be
> based on the import.gtf function in rtracklayer.
>
> We really need to have a centralized track I/O package. I'm not sure
why,
> but everyone seems intent on writing the same parser over and over
again.
> Duplicated effort frustrates me. It doesn't have to be rtracklayer,
but
> somewhere in the infrastructure we need reliable I/O that handles
all of the
> corner cases. I've embarked on a time consuming effort of
implementing a
> comprehensive test suite for rtracklayer, which should make it at
least a
> base for such a package.
>
> Thanks a lot for volunteering this contribution,
> Michael
>
> On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou
> <mailinglist.honeypot at="" gmail.com=""> wrote:
>>
>> Hi,
>>
>> I thought it would be handy to make a GenomicFeatures::TranscriptDb
>> from a gtf file and was somehow surprised to see that I couldn't
find
>> such a function in GenomicFeatures.
>>
>> I'm happy to whip up such a function, but wanted to check in to
make
>> sure I'm not missing something like (1) you can already do it; or
(2)
>> it's not as straightforward as I think; (3) maybe it's there and
I'm
>> not looking hard enough.
>>
>> Right now I just want to build it on the gff that the knew versions
of
>> tophat build when you pass in a gtf/gff file of known transcripts
(the
>> --G/--GTF cmd line arg), but I think it'd be handy overall.
>>
>> I don't think gff/gtf's have any column for exon ordering, though,
in
>> which case I'd just assume the exons are ordered linearly (bye bye
>> trans-splicing)).
>>
>> Good idea? Bad idea? Already done?
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> ?| Memorial Sloan-Kettering Cancer Center
>> ?| Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
------------------------------
Message: 24
Date: Thu, 9 Feb 2012 14:17:04 -0800 (PST)
From: khadeeja ismail <hajjja@yahoo.com>
To: "Bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: [BioC] Removing probes with low variance across samples
(Infinium 450k)
Message-ID:
<1328825824.36437.YahooMailNeo at
web114703.mail.gq1.yahoo.com>
Content-Type: text/plain
Hi,
I have a question regarding the Illumina Human Methylation 450k array
and the genefilter package.
I used the 'nsfilter' function in gene filter to remove probes that
have low variance across samples. When I checked the documentation for
nsfilter, I found out that applying the function removes 50% of the
probes by default.
I computed the variance for each probe in the remaining probes and for
the removed probes separately. When I plot the density for each set of
variances, they overlap completely showing that both sets have most of
their probes with variance close to zero and few with high variance.
This leaves me wondering how nsfilter actually filters probes, as it
doesn't appear from the plot that the probes with the lowest variances
are removed.
What would be the best way to filter out low variance probes in 450k
data? If the default value in nsfilter is set to 50% assuming that 40%
of genes in a cell are not expressed, what percentage cutoff can be
used for methylation data?
Would be great if anyone can explain it.
Thanks,
Khadeeja
[[alternative HTML version deleted]]
------------------------------
Message: 25
Date: Thu, 09 Feb 2012 14:18:04 -0800
From: Hervé Pagès <hpages@fhcrc.org>
To: Simon No?l <simon.noel.2 at="" ulaval.ca="">
Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] RE : RE : maping SNPs
Message-ID: <4F34461C.2050502 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 02/09/2012 01:43 PM, Simon No?l wrote:
> Thank's alot. It seem to work now:) But for the home made
makeGRangesFromRefSNPids, it's not my work. You gave me that function
;)
Sure, I remember now. No problem, you can keep it ;-)
H.
------------------------------
Message: 26
Date: Fri, 10 Feb 2012 09:44:37 +1100 (AUS Eastern Daylight Time)
From: Gordon K Smyth <smyth@wehi.edu.au>
To: wang peter <wng.peter at="" gmail.com="">
Cc: Bioconductor mailing list <bioconductor at="" r-project.org="">
Subject: [BioC] Using write.table with output from topTags [was:
report a possible bug of edgeR]
Message-ID: <pine.wnt.4.64.1202100925590.4124 at="" pc765.wehi.edu.au="">
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Dear Peter (or Shan Gao),
Short answer:
You are using an old version of edgeR. You need to either install the
current version of edgeR from Bioconductor or, with the version you
have,
you can use
write.table(result$table, etc)
instead of write.table(result, etc).
Longer answer:
This is not an edgeR bug. The topTags results object has been
produced
correctly. However, when you try to write it to a file using
write.table(), you are trying to treat the results object as a
data.frame,
which it is not. Hence the error message.
In the version of edgeR in the most recent Bioconductor release we
added a
method for coercing TopTags objects to data.frames, to allow people to
do
exactly what you're doing.
Please have a look at the posting guide:
http://www.bioconductor.org/help/mailing-list/posting-guide/
It is a good idea to include output from sessionInfo() in any post.
Best wishes
Gordon
> Date: Wed, 8 Feb 2012 16:33:38 -0500
> From: wang peter <wng.peter at="" gmail.com="">
> To: bioconductor at r-project.org
> Subject: [BioC] report a possible bug of edgeR
>
> i have two group of samples,
> each group have 3 biological replicate
> library(edgeR)
> library(limma)
>
> raw.data <- read.table("111",row.names=1)
> d <- raw.data[, 1:dim(raw.data)[2]-1
> group <- factor(c(rep("sap", 3),rep("vas", 3)))
> d <- DGEList(counts = d, lib.size =
> c(9893630,11055814,11207084,9663487,11455088,8140053), group =
group)
> d <- d[rowSums(d$counts) >= length(group)/2,]
>
> #normalization
> d <- calcNormFactors(d)
>
> #To estimate common dispersion:
> d <- estimateCommonDisp(d)
> #To estimate tagwise dispersions:
> d <- estimateTagwiseDisp(d)
>
> et <- exactTest(d)
> result <- topTags(et, n=dim(d)[1]+1, adjust.method="BH",
sort.by="logFC")
> write.table(result,file = "DEresult",sep = "\t")
>
> Error in data.frame(table = list(logConc = c(-30.7452523676841,
> -30.9164328563752, :
> arguments imply differing number of rows: 92305, 1, 2
>
> but write.table(result[1:92304,],file = "DEresult",sep = "\t") can
work well
> and then result[92305,] can also work well:
>
> logConc logFC P.Value FDR
> UN089145 -16.96614 -4.874874e-05 1 1
>
>
> another problem is the result file
> "table.logConc" "table.logFC" "table.P.Value"
"table.adj.P.Val" "adjust.method" "comparison"
> "UN038758" -30.7452523676841 -38.5416037522595
1.78446331842105e-37 2.74524811011424e-33 "BH" "sap"
>
>
> "comparison" should include two group name like, "sap" vs "vas"
> but why only one name?
>
> --
> shan gao
> Room 231(Dr.Fei lab)
> Boyce Thompson Institute
> Cornell University
> Tower Road, Ithaca, NY 14853-1801
> Office phone: 1-607-254-1267(day)
> Official email:sg839 at cornell.edu
> Facebook:http://www.facebook.com/profile.php?id=100001986532253
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
------------------------------
Message: 27
Date: Thu, 9 Feb 2012 17:46:58 -0500
From: Ann Loraine <aloraine@gmail.com>
To: bioconductor at r-project.org
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID:
<cao=20estfpfzcokld6x0hj=iedfofw+jkvjc5r9-tkhk2wirdw at="" mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Hello,
Is there anything that can assess differences at the individual splice
site level?
Testing at the isoform level is sometimes not useful because many,
many factors go into determining the final structure of a
fully-processed mRNA transcript, such as the transcription start site,
cleavage at the three prime end, splicing of many internal introns,
and so on.
I think it is much more interesting and useful biologically to focus
in on one or two aspects of transcription and test whether a condition
or treatment has affected that one aspect.
For example, I'd like to know if a condition or treatment affects
individual splice site preference.
ex):
5'
V1 XXXXX------XXXXXXXXX 3'
V2 XXXXX--------------XXXXX
In this simple example, we have two isoforms arising from the same
gene.
To assess whether the condition has changed splicing, i.e., changed
which splice site is preferred, then I need to know how many
transcripts used acceptor site V1 versus acceptor site V2.
I can do this by counting reads that align across the intron. A
spliced read can support the V1 acceptor or the V2 acceptor, but never
both. So I can be sure of which choice the spliced read represents.
But let's say I have data from three treatments and three controls.
How can I determine whether the treatment changed the ratio of V1 to
V2 spliced reads?
Can I do something like calculate the percentage of V1 reads overall
and then compare the percentages using a t test?
[[elided Yahoo spam]]
Best wishes,
Ann
------------------------------
Message: 28
Date: Thu, 9 Feb 2012 16:07:39 -0800
From: "Tim Triche, Jr." <tim.triche@gmail.com>
Cc: "Bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Subject: Re: [BioC] Removing probes with low variance across samples
(Infinium 450k)
Message-ID:
<cac+n9buvsmxlbh_w4o8botx6p5m9yq3jzht=um7dqw0mt4nnpq at="" mail.gmail.com="">
Content-Type: text/plain
Genomic DNA -- what you're assaying on these arrays, or at least what
they're designed for -- need not be expressed.
It's just... there, chopped up after extraction, bisulfite conversion,
and
whole-genome amplification, waiting to hybridize.
Thus nsfilter's fundamental assumption -- that some large fraction of
the
probes on the array are in fact pure noise -- is violated. It may be
that
there is (almost always) local correlation between probes within +/-
1kb of
each other, but if the protocols for these arrays are followed
carefully,
you can expect better than 99% of the probes to hybridize (which is
NOT the
case with expression arrays, and you would not expect 99% of the
genome to
align in an RNAseq experiment either). So the decision of how many
probes
to retain then comes down to your judgment.
Biological annotation (e.g. from ChIP-seq peak calls for histone
marks,
transcription factors, or physical interactions) can become very
useful in
making sense of these data. If you lack normal samples (or don't know
which ones are "normal") it is possible to see low variability in
regions
which are consistently aberrant, so that may not always be the best
approach. I find the GenomicRanges, GenomicFeatures, and rtracklayer
packages useful for this type of annotation, FWIW.
Hope this helps,
--t
> Hi,
>
> I have a question regarding the Illumina Human Methylation 450k
array and
> the genefilter package.
> I used the 'nsfilter' function in gene filter to remove probes that
have
> low variance across samples. When I checked the documentation for
nsfilter,
> I found out that applying the function removes 50% of the probes by
> default.
> I computed the variance for each probe in the remaining probes and
for the
> removed probes separately. When I plot the density for each set of
> variances, they overlap completely showing that both sets have most
of
> their probes with variance close to zero and few with high variance.
> This leaves me wondering how nsfilter actually filters probes, as it
> doesn't appear from the plot that the probes with the lowest
variances are
> removed.
> What would be the best way to filter out low variance probes in 450k
data?
> If the default value in nsfilter is set to 50% assuming that 40% of
genes
> in a cell are not expressed, what percentage cutoff can be used for
> methylation data?
> Would be great if anyone can explain it.
>
> Thanks,
> Khadeeja
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]
------------------------------
Message: 29
Date: Thu, 9 Feb 2012 19:32:53 -0500
From: Sean Davis <sdavis2@mail.nih.gov>
To: Salwa Eid <salwaeid at="" hotmail.com="">
Cc: bioconductor at r-project.org
Subject: Re: [BioC] how can i convert a gse object(expression sets) to
affybatch
Message-ID:
<caneavbnmmvkbrvqlc39uexbqxfs5unw3l8rhfccyzop=vkpf+a at="" mail.gmail.com="">
Content-Type: text/plain; charset=UTF-8
On Thu, Feb 9, 2012 at 2:35 PM, Salwa Eid <salwaeid at="" hotmail.com="">
wrote:
>
>
>
>
> Dear all, ? ?I have read cel files from ncbi website using the
getGEO from GEOquery package. ?I want then to use the combineaffybatch
from the matchprobe package but the objects I have to pass is
affybatch. ?I need to convert the gse object or the expression sets to
affybatch to be able to use it. ?Any ideas? Regards,salwa
>
Hi, Salwa.
You cannot convert a GSE object or ExpressionSet into an affybatch
since the probe-level data are not contained in either object. The
getGEO function downloads the normalized data from GEO, not the .CEL
files. You want to use the getGEOSuppFiles() function instead. This
will download raw data, typically in .tar format. You can use the
untar() function to then expand the archive. Finally, you can use the
affy package to read the .CEL.gz files to get an affybatch.
Hope that helps.
Sean
------------------------------
Message: 30
Date: Thu, 9 Feb 2012 19:46:28 -0500
From: somnath bandyopadhyay <genome1976@hotmail.com>
To: <bioconductor at="" r-project.org="">
Subject: [BioC] ConsensusClusterPlus extracting features for clusters
Message-ID: <snt108-w1374f3fdc88f4315e7a936cd780 at="" phx.gbl="">
Content-Type: text/plain
Hi,
Could anybody suggest a way of extracting features/genes associated
with related clusters from the results file obtained from
ConsensusClusterPlus?Any help would be greatly appreciated.
Thanks,Som.
[[alternative HTML version deleted]]
------------------------------
Message: 31
Date: Fri, 10 Feb 2012 12:00:12 +1100 (EST)
From: Dario Strbenac <d.strbenac@garvan.org.au>
To: bioconductor at r-project.org
Subject: [BioC] limma barcodeplot with 2 groups
Message-ID: <20120210120012.BQA38016 at gimr.garvan.unsw.edu.au>
Content-Type: text/plain; charset="us-ascii"
Hello,
The labels that I get are still on the side, as for a 1-sample plot.
Would it make more sense graphically to have them at the top and
botton of the graphic for 2-sample plots ? In the current format, I
cannot tell which label relates to which colour.
I have limma_3.10.2
--------------------------------------
Dario Strbenac
Research Assistant
Cancer Epigenetics
Garvan Institute of Medical Research
Darlinghurst NSW 2010
Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: barcode2groups.png
Type: image/png
Size: 10527 bytes
Desc: not available
URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120210="" 52195af4="" attachment-0001.png="">
------------------------------
Message: 32
Date: Fri, 10 Feb 2012 01:13:30 -0800
From: Bogdan Tanasa <tanasa@gmail.com>
To: <bioconductor at="" stat.math.ethz.ch="">,
<bioc-sig-sequencing at="" r-project.org="">
Subject: [BioC] bioMart
Message-ID:
<ca+jem00hsjviz9rfe0k-sdmy=8diaj7cp6t9hltdbr3nwfaxbg at="" mail.gmail.com="">
Content-Type: text/plain
Dear all,
given current database updates, just to make sure, when using BioMart,
in
the command below
>mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
how shall I access the homo sapiens ensembl database that is quivalent
to
UCSC genome hg18 ?
thanks !
Bogdan
[[alternative HTML version deleted]]
------------------------------
Message: 33
Date: Fri, 10 Feb 2012 02:22:13 -0800 (PST)
From: "Richard Coulson [guest]" <guest@bioconductor.org>
To: bioconductor at r-project.org, richard.coulson at cimr.cam.ac.uk
Subject: [BioC] Inconsistent coefficient values
Message-ID: <20120210102213.D70D31255EF at mamba.fhcrc.org>
Hi,
I have a problem with using 'limma' when I'm analysing some microarray
data. If I run the below code WITHOUT setting a seed, I get slightly
different values for the coefficients each time it's run; however this
problem does not occur if I do set one (e.g. set.seed(1223762671)) :-
raw.data <-ReadAffy( celfile.path="CEL directory" )
normalised.data <-vsnrma(raw.data)
transfect.lmFit <-lmFit( normalised.data, design.matrix )
cont.lmFit <-contrasts.fit(transfect.lmFit, cont.matrix)
i.e. the values in cont.lmFit$coefficients are altered from one R
session to another.
Please could anyone help with this?
Many thanks,
Richard.
-- output of sessionInfo():
> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] hugene11stv1cdf_1.26.0 limma_3.4.4 vsn_3.16.0
[4] affyPLM_1.24.0 preprocessCore_1.10.0 gcrma_2.20.0
[7] affy_1.26.1 Biobase_2.8.0
loaded via a namespace (and not attached):
[1] affyio_1.16.0 Biostrings_2.16.9 grid_2.12.2
IRanges_1.6.8
[5] lattice_0.19-17 splines_2.12.2 tools_2.12.2
> cont.matrix
Contrasts
Levels
(C.GFP.24+C.GFP.48+C.GFP.72)-(mock.24+mock.48+mock.72)
C.GFP.24
1
C.GFP.48
1
C.GFP.72
1
mock.24
-1
mock.48
-1
mock.72
-1
myc.24
0
myc.48
0
myc.72
0
N.GFP.24
0
N.GFP.48
0
N.GFP.72
0
untransfected.0
0
Contrasts
Levels
(N.GFP.24+N.GFP.48+N.GFP.72)-(mock.24+mock.48+mock.72)
C.GFP.24
0
C.GFP.48
0
C.GFP.72
0
mock.24
-1
mock.48
-1
mock.72
-1
myc.24
0
myc.48
0
myc.72
0
N.GFP.24
1
N.GFP.48
1
N.GFP.72
1
untransfected.0
0
Contrasts
Levels (myc.24+myc.48+myc.72)-(mock.24+mock.48+mock.72)
C.GFP.24 0
C.GFP.48 0
C.GFP.72 0
mock.24 -1
mock.48 -1
mock.72 -1
myc.24 1
myc.48 1
myc.72 1
N.GFP.24 0
N.GFP.48 0
N.GFP.72 0
untransfected.0 0
--
Sent via the guest posting facility at bioconductor.org.
------------------------------
Message: 34
Date: Fri, 10 Feb 2012 11:40:04 +0100
From: "Manuela Di Russo" <manuela.dirusso@for.unipi.it>
To: <bioconductor at="" r-project.org="">
Subject: [BioC] a problem in reading in cel files
Message-ID: <fd438ece49ff463591d66ef5526f3119 at="" maanalysis2="">
Content-Type: text/plain
Dear all,
I am learning to analyse Affymetrix microarray data but I have a
problem in reading .cel files in.
I downloaded from GEO the raw data provided as supplementary files
(GSE12345_RAW.tar), than I have extracted the cel files in a directory
which I have set as my working directory.
Here is the R code I used:
> setwd("C:/BACKUP/Dati/Progetti/Landi/meta-analisi MPM/GSE12345_RAW")
> library(affy)
Carico il pacchetto richiesto: Biobase
Welcome to Bioconductor
Vignettes contain introductory material. To view, type
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")' and for packages 'citation("pkgname")'.
> dir()
[1] "data analysis.txt" "E-GEOD-12345.sdrf.txt"
"E-GEOD-12345.sdrf.xls"
[4] "GSM309986.CEL" "GSM309987.CEL" "GSM309988.CEL"
[7] "GSM309989.CEL" "GSM309990.CEL" "GSM309991.CEL"
[10] "GSM310012.CEL" "GSM310013.CEL" "GSM310014.CEL"
[13] "GSM310015.CEL" "GSM310016.CEL" "GSM310068.CEL"
[16] "GSM310070.CEL" "target.txt" "target.xls"
> pd <- read.AnnotatedDataFrame("target.txt",header=TRUE,row.names=1,a
s.is=TRUE)
> pData(pd)
FileName Target
N1 GSM309986.CEL pleural tissue
N2 GSM309987.CEL pleural tissue
N3 GSM309988.CEL pleural tissue
N4 GSM309989.CEL pleural tissue
MM1 GSM309990.CEL mesothelioma tissue
MM2 GSM309991.CEL mesothelioma tissue
MM3 GSM310012.CEL mesothelioma tissue
MM4 GSM310013.CEL mesothelioma tissue
MM5 GSM310014.CEL mesothelioma tissue
MM6 GSM310015.CEL mesothelioma tissue
MM7 GSM310016.CEL mesothelioma tissue
MM8 GSM310068.CEL mesothelioma tissue
MM9 GSM310070.CEL mesothelioma tissue
> rawData <- read.affybatch(filenames=pData(pd)$FileName,phenoData=pd)
Error in try(.Call("ReadHeaderDetailed", filename, PACKAGE =
"affyio")) :
Is GSM310016.CEL really a CEL file? tried reading as text, gzipped
text, binary, gzipped binary, command console and gzipped command
console formats.
Errore in read.celfile.header(filenames[i], info = "full") :
Failed to get full header information for GSM310016.CEL
> rawData1<-ReadAffy()
Error in try(.Call("ReadHeaderDetailed", filename, PACKAGE =
"affyio")) :
Is C:/BACKUP/Dati/Progetti/Landi/meta-analisi
MPM/GSE12345_RAW/GSM310016.CEL really a CEL file? tried reading as
text, gzipped text, binary, gzipped binary, command console and
gzipped command console formats.
Errore in read.celfile.header(filenames[i], info = "full") :
Failed to get full header information for
C:/BACKUP/Dati/Progetti/Landi/meta-analisi
MPM/GSE12345_RAW/GSM310016.CEL
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
[5] LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] affy_1.32.1 Biobase_2.14.0
loaded via a namespace (and not attached):
[1] affyio_1.22.0 BiocInstaller_1.2.1 preprocessCore_1.16.0
[4] zlibbioc_1.0.0
> traceback()
7: stop("Failed to get full header information for ", filename)
6: read.celfile.header(filenames[i], info = "full")
5: FUN(1:13[[11L]], ...)
4: lapply(X = X, FUN = FUN, ...)
3: sapply(seq_len(length(filenames)), function(i) {
sdate <- read.celfile.header(filenames[i], info =
"full")[["ScanDate"]]
if (is.null(sdate) || length(sdate) == 0)
NA_character_
else sdate
})
2: read.affybatch(filenames = l$filenames, phenoData = l$phenoData,
description = l$description, notes = notes, compress =
compress,
rm.mask = rm.mask, rm.outliers = rm.outliers, rm.extra =
rm.extra,
verbose = verbose, sd = sd, cdfname = cdfname)
1: ReadAffy()
May be there is a problem in reading the cel file header, so I opened
one of the cel files with a text-editor but it seems correct.
Can anyone help me?
Thank you very much!
Manuela
----------------------------------------------------------------------
------------------
Manuela Di Russo, Ph.D. Student
Department of Experimental Pathology, MBIE
University of Pisa
Pisa, Italy
e-mail: manuela.dirusso at for.unipi.it
tel: +39050993538
[[alternative HTML version deleted]]
------------------------------
Message: 35
Date: Fri, 10 Feb 2012 11:55:47 +0100
From: James Perkins <jperkins@biochem.ucl.ac.uk>
To: Pepijn Kooij <pkooij at="" bio.ku.dk="">
Cc: bioconductor at r-project.org
Subject: Re: [BioC] NormqPCR and ReadqPCR
Message-ID:
<caljpjm2ojypf10h-don1_xhkrrgutt1e2uoiwzt15e_uwkvkna at="" mail.gmail.com="">
Content-Type: text/plain
HI Pepijn
Thanks for your interest.
>My question now is, is there a way or package to go further from
this? As
in, I am interested to see which of the genes have a foldchange above
a
threshhold of 1, is there a way to see that?
By a "foldchange threshold of 1", do you mean a doubling or halving?
i.e. |logFC | > 1?
If so you can find everything in your ddCt output data.frame that
meets this criteria and look at only these genes.
ddCtRes[abs(log2(as.numeric(as.vector(ddCtRes$`2^-ddCt`)))) > 1,]
For the gene IDs alone this will work:
ddCtRes[abs(log2(as.numeric(as.vector(ddCtRes$`2^-ddCt`)))) > 1,"ID"]
This is assuming you have done something with the undetermined values,
if you have "+"s and "-"s for your 2^ddCt values it gets a bit more
involved.
>And, is it possible to plot the 2^??????Ct values in a bargraph with
SD's or
SE's?
I hope you can help me out with this,
It is possible, you could use something like barplot2 from gplots and
plot the min 2^-ddCt.min and max values 2^-ddCt.max as upper and lower
confidence intevals. The min and max values are calculated along the
lines of http://www.ncbi.nlm.nih.gov/pubmed/11846609, and the error
bars will be assymetric if you plot the x axis in non-log space, since
the Ct values from which SD worked out on arithmetic scale but then
gets 2^ddCT+/-SD.
arguments to barplot2 something like:
plot.ci=TRUE, ci.l=MIN2^DDCTS, ci.u=MIN2^DDCTS,
Hope this helps, let me know if you have any issues with the plotting,
and if you want to send me a more concrete example I can give some
more focused code.
Cheers,
Jim
On 8 February 2012 16:32, Pepijn Kooij <pkooij at="" bio.ku.dk=""> wrote:
> Dear all
>
> My name is Pepijn Kooij, PhD student at the Centre for Social
Evolution,
> Copenhagen University.
> I was grateful to find the ReadqPCR and NormqPCR packages to analyze
my
> rt-qPCR data. I managed to use both packages, loading in my data and
> normalizing it. In the end using the deltaDeltaCt() function I am
able to
> retrieve the table with all the data I need.
> My question now is, is there a way or package to go further from
this? As
> in, I am interested to see which of the genes have a foldchange
above a
> threshhold of 1, is there a way to see that?
> And, is it possible to plot the 2^??????Ct values in a bargraph with
SD's or
> SE's?
> I hope you can help me out with this,
>
> Best regards
> Pepijn Kooij
> _______________________________________
> Pepijn Kooij, PhD student
> Centre for Social Evolution, Department of Biology
> University of Copenhagen
> Universitetsparken 15, bygning 12
> 2100 Copenhagen ??
> Denmark
>
> Tel: (+45) 35 32 13 41
> Email: pkooij at bio.ku.dk
> CSE website: http://www.bi.ku.dk/cse
>
>
> [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
------------------------------
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
End of Bioconductor Digest, Vol 108, Issue 10