Changes between Version 5 and Version 6 of Report

Show
Ignore:
Timestamp:
2009/07/02 22:51:43 (12 years ago)
Author:
ktym
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Report

    v5 v6  
    11= !BioHackathon 2009 meeting report = 
    22 
     3[[PageOutline]] 
     4 
    35== Work groups == 
    46 
    5  * 1 Kiyoko Kinoshita - [wiki:SatelliteGlycoBiology] 
    6  * 2 Naohisa Goto - [wiki:SatelliteBioRuby] 
    7  * 3 Alberto Labarga - [wiki:SatelliteLiterature] 
    8  * 4 Takeshi Kawashima - [wiki:SatelliteUseCases] 
    9  * 5 Riu Yamashita - [wiki:SatelliteTranscriptionRegulation] 
    10  * 6 Akira Kinjo - [wiki:DDBJ-KEGG-PDBj] 
    11  * 7 Kazuharu Arakawa - [wiki:SatelliteG-language] (DONE) 
     7=== 1 Kiyoko Kinoshita - [wiki:SatelliteGlycoBiology] (DONE) === 
     8 
     9The satellite group on glycobiology focused on the development of workflows through the use of 
     10Taverna, which relied on the development of web services provided by the following glycobiology- 
     11related sites: 
     12 
     13 * GlycoEpitopeDB [http://www.glyco.is.ritsumei.ac.jp/epitope/] 
     14 * RINGS [http://rings.t.soka.ac.jp/] 
     15 * Consortium for Functional Glycomics (CFG) [http://www.functionalglycomics.org] 
     16 * GlycomeDB [http://www.glycome-db.org] 
     17 * (GlycoGeneDB) [http://riodb.ibase.aist.go.jp/rcmg/ggdb/] 
     18 * H-InvDB [http://h-invitational.jp/] 
     19 
     20There were also some use cases that were suggested, which were incorporated into the list of targets 
     21for the group.  (Target 1) The development and registration of web services on BioMoby such that glycobiologists  
     22could create workflows for glycobiology research.  In particular, since the participants of the BioHackathon 
     23this year were developers of GlycoEpitopeDB and RINGS, the following targets were set.  For GlycoEpitopeDB, 
     24web services to perform the following were to be developed: a) Query GlycoEpitope DB using a keyword and  
     25retrieve all IDs of entries containing the keyword, and b) retrieve glycan structures in IUPAC format using  
     26GlycoEpitope IDs.  In order to handle the data returned from these services, an additional web service was 
     27to be developed in RINGS to convert a glycan structure in IUPAC format into KCF format, by which glycan  
     28structure queries could be made.  (Target 2) Create a workflow for analyzing glyco-gene-related diseases:   
     29Use OMIM to search for diseases related to loci of human homologs of target glyco-genes from other species,  
     30such as fruit fly, and retrieve any SNPs that are known to be related. 
     31 
     32In summary, the programming took longer than expected, but the group was able to create BioMoby web services  
     33in GlycoEpitopeDB called getGlycoEpitopeIDfromKeyword and getIUPACfromGlycoEpitopeID, which could be 
     34connected to the RINGS web service called getKCFfromIUPAC.  The resulting workflow took as input a keyword  
     35and could search GlycoEpitopeDB for entries matching the keyword.  The retrieved entries could then be used to 
     36obtain the glycan structures in IUPAC format.  The RINGS web service then could convert the data from IUPAC to  
     37KCF format, by which the KCaM alignment algorithm could be executed to retrieve similar glycan structures to the 
     38converted ones in KCF format.  However, a parser still needs to be developed to retrieve the KEGG GLYCAN IDs  
     39along with their scores from the GlycomicsObject. 
     40 
     41The disease-related workflow was able to retrieve H-Inv entries containing the OMIM Ids which were retrieved from  
     42OMIM for entries containing a particular keyword.  The issue was that the output was in XML format and contained  
     43entries which did not contain any SNPs.  A simple beanshell script was thus developed to filter out the unnecessary 
     44results. 
     45 
     46As a result, the glycobiology satellite group was successfully able to meet the major targets set for the BioHackathon. 
     47Based on these experiences, the group members could build on the newly developed web services to further implement 
     48more complex utilities that would enable glycobiologists to efficiently analyze their data. 
     49 
     50=== 2 Naohisa Goto - [wiki:SatelliteBioRuby] (DONE) === 
     51 
     52We discussed about how to deal big data coming from next 
     53generation DNA sequencers by using BioRuby.  For the purpose, 
     54FASTQ format parser were implemented. 
     55 
     56We also discussed about milestones for the next version 
     57releases.  We agreed that we would soon release 1.3.1  
     58as a maintenance release with bug fixes and refactoring. 
     59After that, major version up with many new functions and 
     60improvements were planned. 
     61 
     62Quality improvement is one of the important issues for BioRuby. 
     63During the hackathon, we fixed some bugs, for example, problems 
     64found in the Bio::PubMed.esearch method and the Bio::Fasta::Report 
     65class.  In addition to the bug fixes, refactoring of classes 
     66for BioSQL supports were also incorporated into main repository. 
     67 
     68Lack of documentation, especially use-case documents, is also 
     69a big problem.  To add documents easily with small efforts, 
     70we decided to translate BioPerl HOWTOs from Perl to Ruby. 
     71Because BioPerl HOWTOs are distributed under the terms of the 
     72Perl Artistic License, we can freely modify and distribute 
     73modified version of them under the same license.  In addition, 
     74we will also write new HOWTOs from scratch for BioRuby specific 
     75functions. 
     76 
     77 
     78=== 3 Alberto Labarga - [wiki:SatelliteLiterature] (DONE) === 
     79 
     80Although a significant portion of our knowledge about life sciences is 
     81stored in papers the relationship between this knowledge and the 
     82information stored in existing biological databases is almost 
     83negligible. 
     84 
     85During the Literature Services meeting reported here, we aimed to 
     86investigate how to annotate atomic components of research papers in 
     87life sciences by combining automatic ontology-based and manual 
     88user-generated tags, and how such annotation could also facilitate the 
     89generation of networks of papers, based on an enriched set of 
     90metadata. 
     91 
     92Different web tools allow researchers to search literature databases 
     93and integrate semantic information extracted from text with external 
     94databases and ontologies. These include Whatizit 
     95(http://www.ebi.ac.uk/webservices/whatizit), iHop 
     96(http://www.ihop-net.org), Novoseek (http://www.novoseek.com/), 
     97Reflect (http://reflect.ws/) or Concept Web Linker 
     98(http://www.knewco.com/). BioCreative MetaServer Platform, Allie, 
     99OReFiL. However, only some of them provide accessible APIs so users 
     100can build their own text mining pipelines. 
     101 
     102During the Biohackathon, we explored BioCreative MetaServer Platform, 
     103Whatizit and iHop web services, and developed several clients and 
     104workflows based on them. Also, the new information extraction XML 
     105format (ieXML) proposed by the European Bioinformatics Institute to 
     106standardize the annotation of named entities was reviewed 
     107(http://www.ebi.ac.uk/Rebholz-srv/IeXML/), and both iHop2ieXML and 
     108BCMS2ieXML mappings based on XSLT were proposed. 
     109 
     110Besides automatic annotation using controlled vocabularies, within the 
     111biomedical community the notion community annotation has recently 
     112started to be adopted. For instance, WikiProteins delivers an 
     113environment for the annotation of proteins. Another example that 
     114illustrates the usefulness of harnessing the collective intelligence 
     115is BIOWiki; this collaborative ontology annotation and curation 
     116framework facilitates the engagement of the community with the sole 
     117purpose of improving an ontology. Similar to BIOWiki, Bioportal also 
     118makes it possible for the community to define mappings across 
     119ontologies, thus making it possible for domain experts to generate new 
     120relationships that are then evaluated by the community. 
     121 
     122During the BioHackathon, we reviewed different solutions for such 
     123collaborative annotation environments, focused on extracting 
     124supporting statements for biological facts in literature, being the 
     125most relevants WiredMarker (http://www.wired-marker.org), XConc 
     126(http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page=XConc+Suite) 
     127, (http://a.kazusa.or.jp/) and Bionotate (http://bionotate.). 
     128 
     129Furthermore, although papers implicitly contain Friend Of A Friend 
     130(FOAF) profiles –author name, affiliation, email, interests, etc, 
     131digital libraries provide very little support for interaction. We 
     132explored the architecture and functionality of scifoaf and the need 
     133for unique IDs (URI) for authors in publications was pointed as a key 
     134requirement for any really useful system to work. 
     135 
     136Within the context of digital libraries in life sciences and 
     137annotation, this could be further improved if authors are allowed to 
     138find who is working on similar diseases, molecules, biomaterials, and 
     139other valid bio-related terminology. In this manner, the network 
     140becomes more concept centric, so further integration of automatic 
     141annotation in the collaborative annotation tools is required. The 
     142concept of living document, one of the selected proposals for the 
     143Elsevier Grand Challenge, was presented as a suitable framework for 
     144this kind of integration. 
     145 
     146 * Rebholz-Schuhmann, D., et al., Text processing through Web Services: Calling Whatizit. Bioinformatics, 2007. 24(2). 
     147 * Labarga et al. Web Services at the European Bioinformatics Institute. Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W6-W11 
     148 * Mons, B., et al., Calling on a million minds for community annotation in  WikiProteins. Genome Biology 2008. 
     149 * Backhaus, M. and J. Kelso. BIOWiki - a collaborative annotation and ontology curation framework. in 16th International World Wide Web Conference. 2007. Banff, Alberta, Canada. 
     150 * Rubin, D.L., et al. BioPortal: A Web Portal to Biomedical Ontologies. in AAAI Spring Symposium Series, Symbiotic Relationships between Semantic Web and Knowledge Engineering. 2008: Stanford University Press. 
     151 * Garcia-Castro, et al. Semantic Web and Social Web heading towards a Living Document in Life Sciences. Journal of Web Semantics, 2009, accepted 
     152 
     153 
     154=== 4 Takeshi Kawashima - [wiki:SatelliteUseCases] (DONE) === 
     155 
     156In order to handle the large data from the new generation sequencers, each of the major genome institutes are developing the tools for the data analyses and constructing their own pipelines by themselves. On another front, because the rather decreased sequencing cost as well as several large sequencing centers starting automated customer services, now a variety of small labs also can plan the sequencing projects. Since computational analyses after sequencing have a rich variety of directions, such small labs often face to difficulties when handling their large data, despite the trend of the unification of sequencing facilities.On account of this, it is expected that the collaborations between these "small" labs and Bioinformatics labs will increase. Useful Free- / Open source- software for genome analyses are required for such collaborations.We had a satellite meeting for "UseCase" in BioHackathon2009 to discuss how easily making the pipelines adapted to collaborator's request. 
     157 
     158 
     159In the "Satelite meeting for UseCase", the developers and users of bioinformatics-tools joined and exchanged of views on current their issues. The following five developers participated in the meeting: BioMart, Galaxy, jORCA, ANNOTATOR, TogoDB. On the users side, the participants are mainly consisted of genome biologists from developmental/evolutional/genetic/medical interests.  
     160Of special note for this meeting is the high interactivity among the both side of researchers.  The meeting is mainly constructed from the two parts, the introduction from the each "developer" and discussion-part with "users". 
     161 
     162In the introductions, each developer explained what their program can do. In the discussion-part, the users pass their own data to developers, then the developer proved the usability of their tools similar as to how they explained. By showing the usage of these programs in front of the user's eyes, not only the users realize how to use the programs clearly, but also developers can recognize which part of usage is the difficult to understand for users. These steps were a good opportunity to bridge the gap between software developers and their users. 
     163 
     164For one of the concrete examples, we made a pipeline to construct an annotation database for a small/medium size of EST project.   
     165For making this pipeline, 100,000 sequence for invertebrate organisms were annotated through the ANNOTATOR, BioMart with blast2GO and KAAS. Then all of these data can publish through the TogoDB.  jORCA and Taverna can support to connect these web-services. The schematic view of this pipeline is shown in Fig. 1. 
     166 
     167[[Image(bh2009-usecase_1.pdf)]] 
     168 
     169=== 5 Riu Yamashita/Alberto Labarga - [wiki:SatelliteTranscriptionRegulation] (DONE) === 
     170 
     171Transcription regulation is one of the hot topic in biology, and many researchers are focusing on several target, such as transcription factor binding sites, methylaion, copy number variation, histon modification, and so on. Even though there are a number of useful tools to help such researches, the experimental research side still needs for bioinformatics help. Therefore, we first picked up general experimental techniques and tried to find what experimental researchers’ requests is. We then realized that there was no suitable web tool to predict transcriptional regulation caused by transcription factor or SNPs. 
     172To solve it, we have been constructing a new web server “Churaumi”. Churaumi is based on DAS server, and it has FESD II DB for SNP data and DBTSS data for transcription start sites. It also has potential transcription factor binding sites (TFBSs) in promoter region, and user can predict which TFBSs can be overrepresented in his/her specific gene set. 
     173 
     174 --- 
     175 
     176Biologists need to get more detailed information about what is 
     177happening a gene level when they look at functional genomics data such 
     178as microarray data. Besides differential expression or network 
     179analysis, further regulatory information such as enrichment analysis 
     180of transcription factor binding sites and transcription factor binding 
     181site SNPs is available and could easily be integrated in the analysis. 
     182This integration could also be extended to existing knowledge about 
     183histone binding, siRNA or miRNA that control function, methylation, 
     184copy number variation, etc. 
     185 
     186The objective of the meeting was to design a unified solution tool to 
     187integrate the information that biologists may need when they look at 
     188their functional studies, and identify feasible techniques and 
     189economic solutions for them. 
     190 
     191We decided to start with the problem of analysing TFBS enriched in 
     192differential expressed genes from microarray data, and explore 
     193possible variations in the genomic sequence that could explain the 
     194differences in expression. 
     195 
     196The base for the integrated system was the Functional Element SNP 
     197database (FESD II) which is a web-based system for selecting sets of 
     198SNPs in putative functional elements in human genes. It provides sets 
     199of SNPs located in 10 different functional elements; promoter regions, 
     200CpG islands, 5'UTRs (untranslated regions), translation start sites, 
     201splice sites, coding exons, introns, translation stop sites, poly 
     202adenylation signals (PASes), and 3'UTRs. 
     203 
     204We decided to use the Distributed Annotation System (DAS) as the 
     205integration mechanism. DAS defines a communication protocol used to 
     206exchange annotations on genomic or protein sequences. DAS can be 
     207implemented as a client-server system in which a single client 
     208integrates information from multiple servers. It allows a single 
     209machine to gather up sequence annotation information from multiple 
     210distant web sites, collate the information, and display it to the user 
     211in a single view. Since it is heavily used in the genome 
     212bioinformatics community, open source clients and servers are 
     213available. 
     214 
     215The work started at the BioHackathon consisted in implementing a DAS 
     216layer for FESD II and the TFBS prediction systems and developing a web 
     217application that computes the enrichment of the TFBS for a list of 
     218genes or proteins, and presents the results using an Ajax DAS viewer. 
     219These developments are still work in progress. 
     220 
     221 * Jimenez, R, et al. Dasty2, an AJAX protein DAS client. Bioinformatics, 2008, 24(18):2119-2121; 
     222 
     223 
     224=== 6 Akira Kinjo - [wiki:DDBJ-KEGG-PDBj] (DONE) === 
     225 
     226A DDBJ-KEGG-PDBj workflow: from pathways to protein-protein interactions 
     227Objectives and Outline 
     228 
     229The objective of this satellite group is to examine the potentials and obstacles in web services by implementing a real-life use case. The goal of the workflow is to enumerate possible physical protein-protein interactions among proteins in a biochemical pathway. More specifically, the workflow proceeds as follows. (1) The user provide a KEGG pathway ID. (2) Extract the protein sequence of each enzyme in the specified pathway. (3) For each protein sequence, run BLAST search against Swiss-Prot database. (4) Construct a phylogenetic profile (a species-by-enzyme matrix) by identifying the top hits for each proteins and each species. (5) For each species in the phylogenetic profile, run BLAST searches for each protein sequence against PDB. (6) If two amino acid sequences (of the same species) have homologs in the same PDB entry, they are inferred to be in physical contact, and hence predicted to be an interacting pair. (7) Output image files highlighting the conserved and interacting proteins in the pathway map. 
     230 
     231Implementation 
     232 
     233To implement the workflow outlined above, we have used the SOAP and REST APIs of DDBJ (http://www.ddbj.nig.ac.jp/), KEGG (http://www.genome.jp/) and PDBj (http://www.pdbj.org/). The workflow can be divided into three parts corresponding to the three web sites. Part I consists of steps (1) and (2) (using KEGG API), Part II of steps (3) and (4) (using DDBJ WABI), Part III of steps (5) and (6) (using PDBj sequence navigator SOAP); step (7) is handled by a customized program on the client side. The main part of the client program was written in Java, but we were forced to switch to Perl for PDBj's sequence navigator due to a version incompatibility in SOAP libraries. Image manipulation programs were written in Perl and Ruby. 
     234 
     235Outcome 
     236 
     237We were able to implement the above workflow within the three days of BioHackathon?. The following is a list of what we have noticed during the development. First, we had to do a significant amount of coding in spite of the wealth of web services. Most of the codes were dedicated to converting file formats between different steps. Although some of such conversions may be automated by providing new web services, we suspect that non-trivial amount of coding for format conversions is inevitable if we try to tackle new problems. It was also noted that the non-standard output format of BLAST search in PDBj's sequence navigator caused some trouble, suggesting any web services should stick to the standard formats if any. Second, it can take a significant amount of time to finish a whole analysis. This is for the most part due to the amount of BLAST searches required for constructing a phylogenetic profile. In retrospect, this problem might have been solved if our client program was made multi-threaded (assuming the web servers can handle hundreds of requests at the same time), however, this would increase the burden of the client-side programming. Finally, by actually solving biologically oriented problems, we could identify some typical use cases which might be useful for further development of web services (e.g., Given a set of gene names, return a phylogenetic profile; Given a set of blast hits, group them according to their species). In summary, the implementation of the DDBJ-KEGG-PDBj workflow allowed us to realize the usefulness as well as limitations of the web services in its current status, and to identify the potential room for further improvements. The activity of this satellite group has been recorded at http://hackathon2.dbcls.jp/wiki/DDBJ-KEGG-PDBj . 
     238 
     239  
     240=== 7 Kazuharu Arakawa - [wiki:SatelliteG-language] (DONE) === 
    12241 
    13242G-language Genome Analysis Environment (G-language GAE) is a set of Perl libraries for genome sequence analysis that is compatible with BioPerl, equipped with several software interfaces (interactive Perl/UNIX shell with persistent data, AJAX Web GUI, Perl API) [1-3]. The software package contains more than 100 original analysis programs especially focusing on bacterial genome analysis, including those for the identification of binding sites with information theory, analysis of nucleotide composition bias, analysis of the distribution of characteristic oligonucleotides, analysis of codons and prediction of expression levels, and visualization of genomic information. In this hackathon, the attendees from G-language Project implemented web-service interfaces for G-language GAE in order to provide higher interoperability. The RESTful web services provided at http://rest.g-language.org/ provides URL-based access to all functions of G-language GAE, which is highly interoperable to be accessed from other online resources. For example, graphical result of the GC skew analysis of Escherichia coli K12 genome is given by http://rest.g-language.org/NC_000913/gcskew. Another interface through the SOAP protocol provides programming language-independent access to more than 100 analysis programs. The WSDL file (http://soap.g-language.org/g-language.wsdl) contains descriptions for all available programs in a single file, and can be readily loaded in Taverna 2 workbench [4] to integrate with other services to construct workflows.  
     
    20249 
    21250 
    22  * 8 Bruno Aranda - [wiki:SatelliteVisualization] 
    23  * 9 Mitsuteru Nakao - [wiki:SatelliteGalaxy] 
    24  * 10 Jan/Jessica - [wiki:SatelliteBigData] (and/or [wiki:SatelliteSeqAnalysis]) 
    25  * 11 Paul Gordon - [wiki:SemanticWebServices] (and/or [wiki:SatelliteSemanticAnnotation]) 
    26  * 12 Chisato Yamasaki - [wiki:SatelliteManifestWebServiseGuidelines] 
     251=== 8 Bruno Aranda - [wiki:SatelliteVisualization] (DONE) === 
     252 
     253Visualization in Bioinformatics is a difficult field, as complex as 
     254the data that we try to visualize. In the Biohackathon 2009 three 
     255categories were created: interaction networks and pathways, expression 
     256and genomic region visualization. Many tools where demonstrated, many 
     257of them trying different approaches to visualization with some 
     258overlaps. 
     259 
     260There are common problems that need to be solved. The most important 
     261one is how to visualize in an easy and meaningful way vast amounts of 
     262data. The participants of the Biohackathon showed how they were 
     263approaching this issue in their tools. We could see 3D gene region 
     264visualization, diagrams that linked different base pairs of a genome 
     265or tools to visualize interaction networks with thousand of nodes that 
     266integrated in some cases gene expression data. 
     267 
     268To complicate the field even further, each category of visualization 
     269requires a different style depending on the nature and size of the 
     270studied data. The size and nature of a genome (or multiple genomes) is 
     271much bigger that the amount of interactome data at the moment, so 
     272different techniques must be used. In the Biohackathon some innovative 
     273ways to browse through the genomes were shown, like the possibility of 
     274using huge wall projections that could be controlled using a custom 
     275controller designed for the task. 
     276 
     277It was clear for the participants that the current state of 
     278visualization tools is still insufficient in most cases and new 
     279meaningful approaches are needed. None of the tools shown seemed to be 
     280generic enough to solve all the problems effectively. Some ideas for 
     281future directions were using semantic zooming -where data is browsed 
     282semantically- or improving the interoperability between large data 
     283servers and the visualization systems. 
     284 
     285Biological visualization is not just about a tool to visualize the 
     286data, as effective ways to retrieve and store the data are needed for 
     287a fast and complete visual representation. Both data providers and 
     288tool developers need to work collaboratively to address this issue in 
     289the future, in a field that is still in a very early stage. 
     290 
     291 
     292=== 9 Mitsuteru Nakao - [wiki:SatelliteGalaxy] === 
     293 
     294=== 10 Jan Aerts/Jessica Severin - [wiki:SatelliteBigData] (and/or [wiki:SatelliteSeqAnalysis]) (in progress?) === 
     295 
     296Satellite meeting Data Handling 
     297 
     298What is "big data"? The notion of "big data" has to be thought of as relative to the storage and processing techniques used to work with it. One gigabyte of data for example is only a very small size for a genome sequencing center, but it is considered to be "big" for genome viewing software that has trouble handling this size. Data becomes big when its size starts interfering with its handling. 
     299 
     300The data-handling satellite meeting provided the opportunity to discuss current issues and possible solutions for working with larger datasets. Discussions focussed on the types of data that the attendees as researchers store and manipulate themselves, not on the problems faced by large specialized data centers such as the Short Read Archive at the European Bioinformatics Institute or the Large Hedron Collider at CERN. Views were exchanged on three stages in the data handling pipelines: storage, querying and processing. Attention was however focussed on the first two. 
     301 
     302Storage and querying issues are often tightly related. The problem space can be viewed on two dimensions: according to the number of objects to be stored and according to the size of individual objects. Protein-protein interaction datasets and assembled genomes, for example, represent the group of objects that is small in number and whose size is still easily manageable on a standard filesystem or in a simple database. Microarray results and next-generation sequencing reads, however, involve a largish number of objects which become more difficult to query. They are often still stored in relational databases, but require tweaking that digresses from a normalized relational database model. Approaches to store (and query) this type of dataset involve - apart from the obvious such as creating good indices - database setups with a limited number of denormalized tables, avoiding table joins when querying. Unfortunately, existing relational database systems are proving more and more lacking in their ability to handle large datasets, both at the storage and the querying level. Several solutions are popping up that can help in storing large data, such as OGSA-DAI (www.ogsadai.org.uk) - a grid-based solution allowing multiple databases to be queried as one - and the cloud. Amazon Simple Storage Service (S3) or Elastic Block Storage (EBS) and GoogleBase, for example, allow for storing very large amounts of data and are relatively cheap. However, their usability in scientific data storage and processing still has to be proven. At the time of writing, large objects (bigger than 5Gb) can for example not be stored on the Amazon S3 service. In addition, moving large datasets from local disks to the cloud becomes a significant bottleneck. 
     303 
     304Several approaches were discussed to make data accessible to others once it is stored. Regular SQL access can be granted for smaller datasets (e.g. the Ensembl MySQL server at ensembldb.ensembl.org). It was however made clear that APIs should play a big role, particularly APIs where an object can be retrieved by URL (for example http://www.example.com/genes/BRCA2;format=bed). A toolkit that is able to perform a limited number of queries is recommended for larger or more complicated datasets such as next-generation sequence reads and alignments (e.g. SAMtools at http://samtools.sourceforge.net). 
     305 
     306The main conclusion from the discussions in this workgroup is that every data storage/querying/processing problem has to be investigated independently and often requires custom solutions. It is however critical in finding those solutions to be aware of the available technologies and practices. 
     307 
     308{{{ 
     309Hey Jessica, 
     310 
     3112009/6/26 Jessica Severin <jessica.severin@gmail.com> 
     312> Dear Jan, Katayama-san, Arek 
     313> 
     314> Jan, Thank you for writing this.  I am sorry I was very busy this week. 
     315> 
     316> You commented about how loading into the cloud is slow and a bottleneck.  We did not discuss this at the meeting since no one had experience at the time.  Did you work on this since biohackathon?  If this is an easy conclusion to come to, then we can include such a comment.  The only reason I say this, is that if people use this document as material for management decisions or new directions of work, we should be careful on what we recommend. 
     317 
     318We did briefly touch on that subject (mentioning sending around harddisks with FedEx) during the discussion. Also, the sequencing group at Sanger has tested this out recently, and I believe it took about 10 days to upload one dataset. So that was _before_ they could actually start processing the data on Amazon EC2... I will double-check with those guys next week. 
     319 
     320> Is there a way to "track edits" in googledoc?  I want to do some big editing but want you to see what I am doing, rather than try to explain it here. 
     321 
     322Google Documents have revisions baked in. Anything you change is recorded and new revisions are created automatically. So just edit away. 
     323 
     324> The bigs point I want to add 
     325> - files with indexes work very well and may become more important, but need API options to enable system integration. 
     326> - more distributed (DAS) or federated (HDF5, OGSA-DAI) approaches may prove necessary 
     327> - internet bandwidth is becoming a bottleneck too (genome centers sending data to repositories). even custom gigabit lines between buildings is limiting (<1TB/day in practice) 
     328> - HDF5 should be mentioned along with  OGSA-DAI since I think it may be a valid option for someone to explore. 
     329 
     330OK. 
     331 
     332> I also want to remove lots of details of the  mysql schema ideas (few tables, joins...). It is too many details for this discussion.  We can just simply to "relational databases with data-mining tricks(biomart, eedb) are useful, but even this approach also has limits".  If we can make reference to eeDB and biomart papers for the "datamining" that would be nice.   
     333 
     334OK. Didn't add any references yet; that still has to be done. 
     335 
     336It would be great if you could get it a bit shorter as well, although it _could_ have been worse :-) 
     337 
     338Thanks for looking into this, 
     339jan. 
     340}}} 
     341 
     342=== 11 Paul Gordon/Rutger Vos/Mark Wilkinson - [wiki:SemanticWebServices] (and/or [wiki:SatelliteSemanticAnnotation]) (in progress) === 
     343 
     344=== 12 Chisato Yamasaki/Atsuko Yamaguchi - [wiki:SatelliteManifestWebServiseGuidelines] (in progress) === 
     345 
     346Manifest of Bio-Web Service (Bio-WS) guidelines 
     347 
     348A web service may be very helpful to process data from heterogeneous biological databases. However, several problems, such as the lack of documents for methods, the lack of compliance of the SOAP/WSDL specification in the language's library and various data formats, make users feel difficult to built a workflow by using web services. 
     349 
     350TogoWS is designed to address such issues for the improved usability as follows. 
     351(1) SOAP based web services ideally follow the open standard and they should be independent from computer languages, however, many services still require language specific hacks in practice. TogoWS proxies these services to make them available in any programming language without difficulty. 
     352(2) Query mechanism and syntax differs service to service that requires user to learn each usage beforehand. TogoWS provides a simple REST interface to query and retrieve in a unified manner. 
     353(3) Entries obtained from databases are needed to be parsed by the client program to be fully utilized. TogoWS embeds open source bioinformatics libraries like BioPerl and BioRuby (which we also developed) to provide functionalities for parsing and conversion of various biological data formats without any installation on user side.  
     354  
     355Although the TogoWS supports many biological databases and the number of available databases is increasing, it is not practical to proxy all the major biological databases by TogoWS because the number of biological databases is extremely large. Therefore, it is preferable that users could combine the original web services without effort under a direction. To do so, we would like to establish a guideline for designing web services on biological databases.    
     356 
     357We first began by discussing among Japanese database providers including The University of Tokyo as the KEGG provider, Osaka University as the PDBj provider, Computational Biology Research Center, Biomedicinal Information Research Center and Database Center for Life Science to make the web services on the databases work together. Then, the first guildeline for designing web services was proposed in Japanese in October, 2008. We improved it to be applicable to variable databases and easy to understand even for a beginner of web services. After that, we translated it to English to sharpen at the BioHackathon 2009.  
     358 
     359 
     360