SRAmetadb Bioconductor package; study record count low for 2013
1
0
Entering edit mode
@al-nasir-jamie-2012-6588
Last seen 10.2 years ago
Hello, I have been looking at the SRA (Sequence Read Archive) SQLite database provided as a Bioconductor package for R. My question concerns top-level studies, which are found in the study table and dated in the submissions table. The question is why are there so few entries for the top level studies for 2013 as compared with 2011 and 2012.... The SQL queries I have written, joining the Submission table and Study table in order to obtain the submission_date yield the following counts of top-level studies by year.... 2005|64 2006|38 2007|94 2008|269 2009|893 2010|2631 2011|4077 2012|5208 2013|724 As one can see the number of studies in the meta-data falls off on 2013. I have been using the sraDB bioconductor SQLite database which has the creation timestamp of 2013-12-03 08:29:26 in the metaInfo table. Would really appreciate if anyone has any useful thoughts on this. Best regards, Jamie Jamie Al-Nasir MPharm (Hons) Department of Computer Science Centre for Systems and Synthetic Biology Mobile: +44 (0)759 4800 229 Web: http://jamie.al-nasir.com/ [[alternative HTML version deleted]]
SRAdb SRAdb • 1.8k views
ADD COMMENT
0
Entering edit mode
Jack Zhu ▴ 170
@jack-zhu-3338
Last seen 7.1 years ago
Hi all, Regarding missing studies by submission_date for 2013 and 2014 in the SRAdb SQLite database, I did some investigation and found the reason. The metadata in the SRAdb is mainly parsed from the XML files of the SRA submissions and it is true with the submission table. But I see quite some submission xml files don't have submission date, e.g. ftp://ftp-trace.ncbi.nih.gov/sra/Submissions/SRA157/SRA157949/ SRA157949.experiment.xml SRA157949.submission.xml So it seem all the study and submission records are there, but some submission records just don't submission date. I am looking into the possibility of adding dates for those records. Jamie, thanks for the finding and I will keep you updated. Jack On Fri, Jun 6, 2014 at 3:49 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > Hi, Jack. > > I took a look at this and it does appear that the number of > submissions is very low for 2013. Also, there are no 2014 submissions > listed that I could find. This was using the June 1, 2014 sqlite > file. > > Sean > > > > ---------- Forwarded message ---------- > From: Al-Nasir, Jamie (2012) <jamie.al-nasir.2012 at="" live.rhul.ac.uk=""> > Date: Thu, Jun 5, 2014 at 2:20 PM > Subject: [BioC] SRAmetadb Bioconductor package; study record count low for 2013 > To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Cc: "Shanahan, Hugh" <hugh.shanahan at="" rhul.ac.uk=""> > > > Hello, > > > I have been looking at the SRA (Sequence Read Archive) SQLite database > > provided as a Bioconductor package for R. > > > My question concerns top-level studies, which are found in the study table > > and dated in the submissions table. > > > The question is why are there so few entries for the top level studies for 2013 > > as compared with 2011 and 2012.... > > > The SQL queries I have written, joining the Submission table and Study table > > in order to obtain the submission_date yield the following counts of top-level > > studies by year.... > > > 2005|64 > 2006|38 > 2007|94 > 2008|269 > 2009|893 > 2010|2631 > 2011|4077 > 2012|5208 > 2013|724 > > > As one can see the number of studies in the meta-data falls off on 2013. > > I have been using the sraDB bioconductor SQLite database which has > > the creation timestamp of 2013-12-03 08:29:26 in the metaInfo table. > > > Would really appreciate if anyone has any useful thoughts on this. > > > Best regards, > > Jamie > > Jamie Al-Nasir MPharm (Hons) > Department of Computer Science > Centre for Systems and Synthetic Biology > Mobile: +44 (0)759 4800 229 > Web: http://jamie.al-nasir.com/ > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Jamie and all, By modifying my codes and pulling data from a curated table (SRA_Accessions.tab) from the SRA, I think the missing 'submission_date' in the submission table have been fixed: strftime('%Y', s.submission_date) count(*) 1 2008 348 2 2009 1260 3 2010 2865 4 2011 4276 5 2012 6606 6 2013 15309 7 2014 7706 Please let me know if you still see any problems or have any questions. Thanks. Jack ---- Yuelin Jack Zhu Genetics Branch/CCR/NCI/NIH Tel: (301)496-4527 FAX: (301) 402-3241 E-mail: zhujack at mail.nih.gov On Sun, Jun 8, 2014 at 11:11 AM, Jack Zhu <zhujack at="" mail.nih.gov=""> wrote: > Hi all, > > Regarding missing studies by submission_date for 2013 and 2014 in the > SRAdb SQLite database, I did some investigation and found the reason. > The metadata in the SRAdb is mainly parsed from the XML files of the > SRA submissions and it is true with the submission table. But I see > quite some submission xml files don't have submission date, e.g. > > ftp://ftp-trace.ncbi.nih.gov/sra/Submissions/SRA157/SRA157949/ > > SRA157949.experiment.xml > SRA157949.submission.xml > > So it seem all the study and submission records are there, but some > submission records just don't submission date. I am looking into the > possibility of adding dates for those records. > > Jamie, thanks for the finding and I will keep you updated. > > Jack > > > On Fri, Jun 6, 2014 at 3:49 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >> Hi, Jack. >> >> I took a look at this and it does appear that the number of >> submissions is very low for 2013. Also, there are no 2014 submissions >> listed that I could find. This was using the June 1, 2014 sqlite >> file. >> >> Sean >> >> >> >> ---------- Forwarded message ---------- >> From: Al-Nasir, Jamie (2012) <jamie.al-nasir.2012 at="" live.rhul.ac.uk=""> >> Date: Thu, Jun 5, 2014 at 2:20 PM >> Subject: [BioC] SRAmetadb Bioconductor package; study record count low for 2013 >> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> >> Cc: "Shanahan, Hugh" <hugh.shanahan at="" rhul.ac.uk=""> >> >> >> Hello, >> >> >> I have been looking at the SRA (Sequence Read Archive) SQLite database >> >> provided as a Bioconductor package for R. >> >> >> My question concerns top-level studies, which are found in the study table >> >> and dated in the submissions table. >> >> >> The question is why are there so few entries for the top level studies for 2013 >> >> as compared with 2011 and 2012.... >> >> >> The SQL queries I have written, joining the Submission table and Study table >> >> in order to obtain the submission_date yield the following counts of top-level >> >> studies by year.... >> >> >> 2005|64 >> 2006|38 >> 2007|94 >> 2008|269 >> 2009|893 >> 2010|2631 >> 2011|4077 >> 2012|5208 >> 2013|724 >> >> >> As one can see the number of studies in the meta-data falls off on 2013. >> >> I have been using the sraDB bioconductor SQLite database which has >> >> the creation timestamp of 2013-12-03 08:29:26 in the metaInfo table. >> >> >> Would really appreciate if anyone has any useful thoughts on this. >> >> >> Best regards, >> >> Jamie >> >> Jamie Al-Nasir MPharm (Hons) >> Department of Computer Science >> Centre for Systems and Synthetic Biology >> Mobile: +44 (0)759 4800 229 >> Web: http://jamie.al-nasir.com/ >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6