RE : RE : RE : maping SNPs

0

Entering edit mode

SimonNoël ▴ 450

@simonnoel-3455

Last seen 10.2 years ago

Hi, You say that one of the warning was because I was only having SNPs from chr1. I decided to add the 2 first SNPs I have from chrd but I get a new error... > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs10915459", "rs16838750", "rs12128230", "rs4637157", "rs11900053", "rs999999999") > mysnps <- makeGRangesFromRefSNPids(myids) Errorin solveUserSEW0(start = start, end = end, width = width) : solving row 8: range cannot be determined from the supplied arguments (too many NAs) In addition: Warning messages: 1: In ans_locs[!is.na(myrows)] <- locs$loc[myrows] : number of items to replace is not a multiple of replacement length 2: In ans_locs[!is.na(myrows)] <- locs$loc[myrows] : number of items to replace is not a multiple of replacement length But if I only try with the 2 first SNPs on chr2, I have > myids <- c("rs4637157", "rs11900053", "rs999999999") > mysnps <- makeGRangesFromRefSNPids(myids) Warning message: In ans_locs[!is.na(myrows)] <- locs$loc[myrows] : number of items to replace is not a multiple of replacement length > mysnps GRangeswith 3 ranges and 1 elementMetadata value seqnames ranges strand | RefSNP_id <rle> <iranges> <rle> | <character> [1] ch2 [29443, 29443] * | rs4637157 [2] ch2 [36787, 36787] * | rs11900053 [3] unknown [ 0, 0] * | rs999999999 seqlengths ch2 unknown NA NA So is that mean that I will have to go chr by chr and split my big file? Now for the problem of changing ch1 to chr1 > seqnames(mysnps) 'factor' Rle of length 8 with 2 runs Lengths: 7 1 Values : ch1 unknown Levels(2): ch1 unknown > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) > seqnames(mysnps) 'factor' Rle of length 8 with 2 runs Lengths: 7 1 Values : chr1 unknown Levels(2): chr1 unknown > map <- as.matrix(findOverlaps(mysnps, tx)) Message d'avis : In .local(query, subject, maxgap, minoverlap, type, select, ...) : Only some seqnames from 'query' and 'subject' were not identical > mapExon <- as.matrix(findOverlaps(mysnps, txExon)) Message d'avis : In .local(query, subject, maxgap, minoverlap, type, select, ...) : Only some seqnames from 'query' and 'subject' were not identical > > mapped_genes <- values(tx)$gene_id[map[, 2]] > mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) > snp2gene <- unique(data.frame(snp_id=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene) <- NULL > snp2gene[1:4, ] snp_id gene_id 1 rs7547453 6497 2 rs2840542 79906 3 rs1999527 63976 4 rs4648545 7161 So now it's working on my computer:) but I am only able to do SNPs from one chromosome as I say. On the super computer, it still doesn't work and on my computer, it still taking a lot of time. What isn't working is > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") Downloadthe refGene table ... OK Downloadthe refLink table ... OK Extractthe 'transcripts' data frame ... OK Extractthe 'splicings' data frame ... OK Downloadand preprocess the 'chrominfo' data frame ... OK Preparethe 'metadata' data frame ... OK Makethe TranscriptDb object ... Error in .writeMetadataTable(conn, metadata) : subscript out of bounds In addition: There were 50 or more warnings (use warnings() to see the first 50) Simon No??l CdeC

GO TranscriptDb GO TranscriptDb • 1.2k views

ADD COMMENT • link updated 13.9 years ago by Hervé Pagès 16k • written 13.9 years ago by SimonNoël ▴ 450

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 11 hours ago

Seattle, WA, United States

Hi Simon, First thanks to Valerie for providing the Windows binary of SNPlocs.Hsapiens.dbSNP.20101109. More below... On 01/11/2011 01:18 PM, Simon No?l wrote: > > Hi, > > > You say that one of the warning was because I was only having SNPs from > chr1. I decided to add the 2 first SNPs I have from chrd but I get a new > error... > > > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", > "rs10915459", "rs16838750", "rs12128230", "rs4637157", "rs11900053", > "rs999999999") > > mysnps<- makeGRangesFromRefSNPids(myids) > Errorin solveUserSEW0(start = start, end = end, width = width) : > solving row 8: range cannot be determined from the supplied arguments > (too many NAs) > In addition: Warning messages: > 1: In ans_locs[!is.na(myrows)]<- locs$loc[myrows] : > number of items to replace is not a multiple of replacement length > 2: In ans_locs[!is.na(myrows)]<- locs$loc[myrows] : > number of items to replace is not a multiple of replacement length > > But if I only try with the 2 first SNPs on chr2, I have > > > myids<- c("rs4637157", "rs11900053", "rs999999999") > > mysnps<- makeGRangesFromRefSNPids(myids) > Warning message: > In ans_locs[!is.na(myrows)]<- locs$loc[myrows] : > number of items to replace is not a multiple of replacement length > > mysnps > GRangeswith 3 ranges and 1 elementMetadata value > seqnames ranges strand | RefSNP_id > <rle> <iranges> <rle> |<character> > [1] ch2 [29443, 29443] * | rs4637157 > [2] ch2 [36787, 36787] * | rs11900053 > [3] unknown [ 0, 0] * | rs999999999 > seqlengths > ch2 unknown > NA NA > > So is that mean that I will have to go chr by chr and split my big file? No it just means that my initial makeGRangesFromRefSNPids() was broken :-/ Please try with this one instead: makeGRangesFromRefSNPids <- function(myids, verbose=FALSE) { ans_seqnames <- character(length(myids)) ans_seqnames[] <- "unknown" ans_locs <- integer(length(myids)) for (seqname in names(getSNPcount())) { if (verbose) cat("Processing ", seqname, " SNPs ...\n", sep="") locs <- getSNPlocs(seqname) ids <- paste("rs", locs$RefSNP_id, sep="") myrows <- match(myids, ids) hit_idx <- !is.na(myrows) ans_seqnames[hit_idx] <- seqname ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]] } GRanges(seqnames=ans_seqnames, IRanges(start=ans_locs, width=1), RefSNP_id=myids) } Since this function ends up loading all the SNPs in memory (and there are 23 millions of them) you need a machine with at least 2.5 or 3 GB of RAM (although makeGRangesFromRefSNPids() could easily be improved to use less memory). > > Now for the problem of changing ch1 to chr1 > > > seqnames(mysnps) > 'factor' Rle of length 8 with 2 runs > Lengths: 7 1 > Values : ch1 unknown > Levels(2): ch1 unknown > > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > > seqnames(mysnps) > 'factor' Rle of length 8 with 2 runs > Lengths: 7 1 > Values : chr1 unknown > Levels(2): chr1 unknown > > map<- as.matrix(findOverlaps(mysnps, tx)) > Message d'avis : > In .local(query, subject, maxgap, minoverlap, type, select, ...) : > Only some seqnames from 'query' and 'subject' were not identical > > mapExon<- as.matrix(findOverlaps(mysnps, txExon)) > Message d'avis : > In .local(query, subject, maxgap, minoverlap, type, select, ...) : > Only some seqnames from 'query' and 'subject' were not identical This warning is caused by the presence of the fake "unknown" sequence in 'mysnps' so you can safely ignore it. I agree that it might not be totally clear what this warning is about or why we need such warning in the first place. In BioC devel, the warning message has been slightly modified (in the hope that it would be a little bit more explicit) to something like: Warning message: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': unknown - in 'y': chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9... Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). > > > > mapped_genes<- values(tx)$gene_id[map[, 2]] > > mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], > elementLengths(mapped_genes)) > > snp2gene<- unique(data.frame(snp_id=mapped_snps, > gene_id=unlist(mapped_genes))) > > rownames(snp2gene)<- NULL > > snp2gene[1:4, ] > snp_id gene_id > 1 rs7547453 6497 > 2 rs2840542 79906 > 3 rs1999527 63976 > 4 rs4648545 7161 > So now it's working on my computer:) but I am only able to do SNPs from one > chromosome as I say. > > On the super computer, it still doesn't work and on my > computer, it still taking a lot of time. What isn't working is > > > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > Downloadthe refGene table ... OK > Downloadthe refLink table ... OK > Extractthe 'transcripts' data frame ... OK > Extractthe 'splicings' data frame ... OK > Downloadand preprocess the 'chrominfo' data frame ... OK > Preparethe 'metadata' data frame ... OK > Makethe TranscriptDb object ... Error in .writeMetadataTable(conn, > metadata) : subscript out of bounds > In addition: There were 50 or more warnings (use warnings() to see the first > 50) Probably because some of the packages need to be updated on this machine (as you are already aware). See my sessionInfo below. Cheers, H. R version 2.12.0 (2010-10-15) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.2.3 [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2 [3] GenomicRanges_1.2.3 [4] IRanges_1.8.8 loaded via a namespace (and not attached): [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2 BSgenome_1.18.2 [5] DBI_0.2-5 RCurl_1.5-0 RSQLite_0.9-4 rtracklayer_1.10.6 [9] XML_3.2-0 > > > Simon No??l > CdeC > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi As you can see, the package GenomicRanges and IRanges don't have the same version as yours on the supercomputer. I will ask our informaticien to update them to the version you have and will see if now makeTranscriptDbFromUCSC works. I will let you know of any positive or negative result. > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attachedbase packages: [1] stats graphics grDevices utils datasets methods base otherattached packages: [1] GenomicFeatures_1.2.3 [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2 [3] GenomicRanges_1.2.2 [4] IRanges_1.8.7 loadedvia a namespace (and not attached): [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2 BSgenome_1.18.2 [5] DBI_0.2-5 RCurl_1.5-0 RSQLite_0.9-4 rtracklayer_1.10.6 [9] tools_2.12.0 XML_3.2-0 Simon No?l CdeC ________________________________________ De : Hervé Pagès [hpages at fhcrc.org] Date d'envoi : 11 janvier 2011 21:05 ? : Simon No?l Cc : Valerie Obenchain; bioconductor at r-project.org Objet : Re: [BioC] RE : RE : RE : maping SNPs > On the super computer, it still doesn't work and on my > computer, it still taking a lot of time. What isn't working is > > > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > Downloadthe refGene table ... OK > Downloadthe refLink table ... OK > Extractthe 'transcripts' data frame ... OK > Extractthe 'splicings' data frame ... OK > Downloadand preprocess the 'chrominfo' data frame ... OK > Preparethe 'metadata' data frame ... OK > Makethe TranscriptDb object ... Error in .writeMetadataTable(conn, > metadata) : subscript out of bounds > In addition: There were 50 or more warnings (use warnings() to see the first > 50) Probablybecause some of the packages need to be updated on this machine (as you are already aware). See my sessionInfo below. Cheers, H. R version 2.12.0 (2010-10-15) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attachedbase packages: [1] stats graphics grDevices utils datasets methods base otherattached packages: [1] GenomicFeatures_1.2.3 [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2 [3] GenomicRanges_1.2.3 [4] IRanges_1.8.8 loadedvia a namespace (and not attached): [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2 BSgenome_1.18.2 [5] DBI_0.2-5 RCurl_1.5-0 RSQLite_0.9-4 rtracklayer_1.10.6 [9] XML_3.2-0 > > > Simon No??l > CdeC > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O.Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.9 years ago SimonNoël ▴ 450

0

Entering edit mode

On 01/12/2011 10:44 AM, Simon No?l wrote: > Hi > > As you can see, the package GenomicRanges and IRanges don't have the same version as yours on the supercomputer. I will ask our informaticien to update them to the version you have and will see if now makeTranscriptDbFromUCSC works. Another option if you can't wait is that you make this TranscriptDb object on another computer (where the BioC 2.7 installation is up to date), save it to an SQLite file with saveFeatures(), move the file to the supercomputer, and load it from there with loadFeatures(). Cheers, H. > I will let you know of any positive or negative result. > >> sessionInfo() > R version 2.12.0 (2010-10-15) > Platform: x86_64-pc-linux-gnu (64-bit) > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > attachedbase packages: > [1] stats graphics grDevices utils datasets methods base > otherattached packages: > [1] GenomicFeatures_1.2.3 > [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2 > [3] GenomicRanges_1.2.2 > [4] IRanges_1.8.7 > loadedvia a namespace (and not attached): > [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2 BSgenome_1.18.2 > [5] DBI_0.2-5 RCurl_1.5-0 RSQLite_0.9-4 rtracklayer_1.10.6 > [9] tools_2.12.0 XML_3.2-0 > > > Simon No?l > CdeC > ________________________________________ > De : Hervé Pagès [hpages at fhcrc.org] > Date d'envoi : 11 janvier 2011 21:05 > ? : Simon No?l > Cc : Valerie Obenchain; bioconductor at r-project.org > Objet : Re: [BioC] RE : RE : RE : maping SNPs > >> On the super computer, it still doesn't work and on my >> computer, it still taking a lot of time. What isn't working is >> >> > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") >> Downloadthe refGene table ... OK >> Downloadthe refLink table ... OK >> Extractthe 'transcripts' data frame ... OK >> Extractthe 'splicings' data frame ... OK >> Downloadand preprocess the 'chrominfo' data frame ... OK >> Preparethe 'metadata' data frame ... OK >> Makethe TranscriptDb object ... Error in .writeMetadataTable(conn, >> metadata) : subscript out of bounds >> In addition: There were 50 or more warnings (use warnings() to see the first >> 50) > > Probablybecause some of the packages need to be updated on this machine > (as you are already aware). See my sessionInfo below. > > Cheers, > H. > > R version 2.12.0 (2010-10-15) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 > [7] LC_PAPER=en_US.utf8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C > > attachedbase packages: > [1] stats graphics grDevices utils datasets methods base > > otherattached packages: > [1] GenomicFeatures_1.2.3 > [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2 > [3] GenomicRanges_1.2.3 > [4] IRanges_1.8.8 > > loadedvia a namespace (and not attached): > [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2 > BSgenome_1.18.2 > [5] DBI_0.2-5 RCurl_1.5-0 RSQLite_0.9-4 > rtracklayer_1.10.6 > [9] XML_3.2-0 > > >> >> >> Simon No??l >> CdeC >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O.Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.9 years ago Hervé Pagès 16k

Login before adding your answer.