5 | | * 1 Kiyoko Kinoshita - [wiki:SatelliteGlycoBiology] |
6 | | * 2 Naohisa Goto - [wiki:SatelliteBioRuby] |
7 | | * 3 Alberto Labarga - [wiki:SatelliteLiterature] |
8 | | * 4 Takeshi Kawashima - [wiki:SatelliteUseCases] |
9 | | * 5 Riu Yamashita - [wiki:SatelliteTranscriptionRegulation] |
10 | | * 6 Akira Kinjo - [wiki:DDBJ-KEGG-PDBj] |
11 | | * 7 Kazuharu Arakawa - [wiki:SatelliteG-language] (DONE) |
| 7 | === 1 Kiyoko Kinoshita - [wiki:SatelliteGlycoBiology] (DONE) === |
| 8 | |
| 9 | The satellite group on glycobiology focused on the development of workflows through the use of |
| 10 | Taverna, which relied on the development of web services provided by the following glycobiology- |
| 11 | related sites: |
| 12 | |
| 13 | * GlycoEpitopeDB [http://www.glyco.is.ritsumei.ac.jp/epitope/] |
| 14 | * RINGS [http://rings.t.soka.ac.jp/] |
| 15 | * Consortium for Functional Glycomics (CFG) [http://www.functionalglycomics.org] |
| 16 | * GlycomeDB [http://www.glycome-db.org] |
| 17 | * (GlycoGeneDB) [http://riodb.ibase.aist.go.jp/rcmg/ggdb/] |
| 18 | * H-InvDB [http://h-invitational.jp/] |
| 19 | |
| 20 | There were also some use cases that were suggested, which were incorporated into the list of targets |
| 21 | for the group. (Target 1) The development and registration of web services on BioMoby such that glycobiologists |
| 22 | could create workflows for glycobiology research. In particular, since the participants of the BioHackathon |
| 23 | this year were developers of GlycoEpitopeDB and RINGS, the following targets were set. For GlycoEpitopeDB, |
| 24 | web services to perform the following were to be developed: a) Query GlycoEpitope DB using a keyword and |
| 25 | retrieve all IDs of entries containing the keyword, and b) retrieve glycan structures in IUPAC format using |
| 26 | GlycoEpitope IDs. In order to handle the data returned from these services, an additional web service was |
| 27 | to be developed in RINGS to convert a glycan structure in IUPAC format into KCF format, by which glycan |
| 28 | structure queries could be made. (Target 2) Create a workflow for analyzing glyco-gene-related diseases: |
| 29 | Use OMIM to search for diseases related to loci of human homologs of target glyco-genes from other species, |
| 30 | such as fruit fly, and retrieve any SNPs that are known to be related. |
| 31 | |
| 32 | In summary, the programming took longer than expected, but the group was able to create BioMoby web services |
| 33 | in GlycoEpitopeDB called getGlycoEpitopeIDfromKeyword and getIUPACfromGlycoEpitopeID, which could be |
| 34 | connected to the RINGS web service called getKCFfromIUPAC. The resulting workflow took as input a keyword |
| 35 | and could search GlycoEpitopeDB for entries matching the keyword. The retrieved entries could then be used to |
| 36 | obtain the glycan structures in IUPAC format. The RINGS web service then could convert the data from IUPAC to |
| 37 | KCF format, by which the KCaM alignment algorithm could be executed to retrieve similar glycan structures to the |
| 38 | converted ones in KCF format. However, a parser still needs to be developed to retrieve the KEGG GLYCAN IDs |
| 39 | along with their scores from the GlycomicsObject. |
| 40 | |
| 41 | The disease-related workflow was able to retrieve H-Inv entries containing the OMIM Ids which were retrieved from |
| 42 | OMIM for entries containing a particular keyword. The issue was that the output was in XML format and contained |
| 43 | entries which did not contain any SNPs. A simple beanshell script was thus developed to filter out the unnecessary |
| 44 | results. |
| 45 | |
| 46 | As a result, the glycobiology satellite group was successfully able to meet the major targets set for the BioHackathon. |
| 47 | Based on these experiences, the group members could build on the newly developed web services to further implement |
| 48 | more complex utilities that would enable glycobiologists to efficiently analyze their data. |
| 49 | |
| 50 | === 2 Naohisa Goto - [wiki:SatelliteBioRuby] (DONE) === |
| 51 | |
| 52 | We discussed about how to deal big data coming from next |
| 53 | generation DNA sequencers by using BioRuby. For the purpose, |
| 54 | FASTQ format parser were implemented. |
| 55 | |
| 56 | We also discussed about milestones for the next version |
| 57 | releases. We agreed that we would soon release 1.3.1 |
| 58 | as a maintenance release with bug fixes and refactoring. |
| 59 | After that, major version up with many new functions and |
| 60 | improvements were planned. |
| 61 | |
| 62 | Quality improvement is one of the important issues for BioRuby. |
| 63 | During the hackathon, we fixed some bugs, for example, problems |
| 64 | found in the Bio::PubMed.esearch method and the Bio::Fasta::Report |
| 65 | class. In addition to the bug fixes, refactoring of classes |
| 66 | for BioSQL supports were also incorporated into main repository. |
| 67 | |
| 68 | Lack of documentation, especially use-case documents, is also |
| 69 | a big problem. To add documents easily with small efforts, |
| 70 | we decided to translate BioPerl HOWTOs from Perl to Ruby. |
| 71 | Because BioPerl HOWTOs are distributed under the terms of the |
| 72 | Perl Artistic License, we can freely modify and distribute |
| 73 | modified version of them under the same license. In addition, |
| 74 | we will also write new HOWTOs from scratch for BioRuby specific |
| 75 | functions. |
| 76 | |
| 77 | |
| 78 | === 3 Alberto Labarga - [wiki:SatelliteLiterature] (DONE) === |
| 79 | |
| 80 | Although a significant portion of our knowledge about life sciences is |
| 81 | stored in papers the relationship between this knowledge and the |
| 82 | information stored in existing biological databases is almost |
| 83 | negligible. |
| 84 | |
| 85 | During the Literature Services meeting reported here, we aimed to |
| 86 | investigate how to annotate atomic components of research papers in |
| 87 | life sciences by combining automatic ontology-based and manual |
| 88 | user-generated tags, and how such annotation could also facilitate the |
| 89 | generation of networks of papers, based on an enriched set of |
| 90 | metadata. |
| 91 | |
| 92 | Different web tools allow researchers to search literature databases |
| 93 | and integrate semantic information extracted from text with external |
| 94 | databases and ontologies. These include Whatizit |
| 95 | (http://www.ebi.ac.uk/webservices/whatizit), iHop |
| 96 | (http://www.ihop-net.org), Novoseek (http://www.novoseek.com/), |
| 97 | Reflect (http://reflect.ws/) or Concept Web Linker |
| 98 | (http://www.knewco.com/). BioCreative MetaServer Platform, Allie, |
| 99 | OReFiL. However, only some of them provide accessible APIs so users |
| 100 | can build their own text mining pipelines. |
| 101 | |
| 102 | During the Biohackathon, we explored BioCreative MetaServer Platform, |
| 103 | Whatizit and iHop web services, and developed several clients and |
| 104 | workflows based on them. Also, the new information extraction XML |
| 105 | format (ieXML) proposed by the European Bioinformatics Institute to |
| 106 | standardize the annotation of named entities was reviewed |
| 107 | (http://www.ebi.ac.uk/Rebholz-srv/IeXML/), and both iHop2ieXML and |
| 108 | BCMS2ieXML mappings based on XSLT were proposed. |
| 109 | |
| 110 | Besides automatic annotation using controlled vocabularies, within the |
| 111 | biomedical community the notion community annotation has recently |
| 112 | started to be adopted. For instance, WikiProteins delivers an |
| 113 | environment for the annotation of proteins. Another example that |
| 114 | illustrates the usefulness of harnessing the collective intelligence |
| 115 | is BIOWiki; this collaborative ontology annotation and curation |
| 116 | framework facilitates the engagement of the community with the sole |
| 117 | purpose of improving an ontology. Similar to BIOWiki, Bioportal also |
| 118 | makes it possible for the community to define mappings across |
| 119 | ontologies, thus making it possible for domain experts to generate new |
| 120 | relationships that are then evaluated by the community. |
| 121 | |
| 122 | During the BioHackathon, we reviewed different solutions for such |
| 123 | collaborative annotation environments, focused on extracting |
| 124 | supporting statements for biological facts in literature, being the |
| 125 | most relevants WiredMarker (http://www.wired-marker.org), XConc |
| 126 | (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page=XConc+Suite) |
| 127 | , (http://a.kazusa.or.jp/) and Bionotate (http://bionotate.). |
| 128 | |
| 129 | Furthermore, although papers implicitly contain Friend Of A Friend |
| 130 | (FOAF) profiles –author name, affiliation, email, interests, etc, |
| 131 | digital libraries provide very little support for interaction. We |
| 132 | explored the architecture and functionality of scifoaf and the need |
| 133 | for unique IDs (URI) for authors in publications was pointed as a key |
| 134 | requirement for any really useful system to work. |
| 135 | |
| 136 | Within the context of digital libraries in life sciences and |
| 137 | annotation, this could be further improved if authors are allowed to |
| 138 | find who is working on similar diseases, molecules, biomaterials, and |
| 139 | other valid bio-related terminology. In this manner, the network |
| 140 | becomes more concept centric, so further integration of automatic |
| 141 | annotation in the collaborative annotation tools is required. The |
| 142 | concept of living document, one of the selected proposals for the |
| 143 | Elsevier Grand Challenge, was presented as a suitable framework for |
| 144 | this kind of integration. |
| 145 | |
| 146 | * Rebholz-Schuhmann, D., et al., Text processing through Web Services: Calling Whatizit. Bioinformatics, 2007. 24(2). |
| 147 | * Labarga et al. Web Services at the European Bioinformatics Institute. Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W6-W11 |
| 148 | * Mons, B., et al., Calling on a million minds for community annotation in WikiProteins. Genome Biology 2008. |
| 149 | * Backhaus, M. and J. Kelso. BIOWiki - a collaborative annotation and ontology curation framework. in 16th International World Wide Web Conference. 2007. Banff, Alberta, Canada. |
| 150 | * Rubin, D.L., et al. BioPortal: A Web Portal to Biomedical Ontologies. in AAAI Spring Symposium Series, Symbiotic Relationships between Semantic Web and Knowledge Engineering. 2008: Stanford University Press. |
| 151 | * Garcia-Castro, et al. Semantic Web and Social Web heading towards a Living Document in Life Sciences. Journal of Web Semantics, 2009, accepted |
| 152 | |
| 153 | |
| 154 | === 4 Takeshi Kawashima - [wiki:SatelliteUseCases] (DONE) === |
| 155 | |
| 156 | In order to handle the large data from the new generation sequencers, each of the major genome institutes are developing the tools for the data analyses and constructing their own pipelines by themselves. On another front, because the rather decreased sequencing cost as well as several large sequencing centers starting automated customer services, now a variety of small labs also can plan the sequencing projects. Since computational analyses after sequencing have a rich variety of directions, such small labs often face to difficulties when handling their large data, despite the trend of the unification of sequencing facilities.On account of this, it is expected that the collaborations between these "small" labs and Bioinformatics labs will increase. Useful Free- / Open source- software for genome analyses are required for such collaborations.We had a satellite meeting for "UseCase" in BioHackathon2009 to discuss how easily making the pipelines adapted to collaborator's request. |
| 157 | |
| 158 | |
| 159 | In the "Satelite meeting for UseCase", the developers and users of bioinformatics-tools joined and exchanged of views on current their issues. The following five developers participated in the meeting: BioMart, Galaxy, jORCA, ANNOTATOR, TogoDB. On the users side, the participants are mainly consisted of genome biologists from developmental/evolutional/genetic/medical interests. |
| 160 | Of special note for this meeting is the high interactivity among the both side of researchers. The meeting is mainly constructed from the two parts, the introduction from the each "developer" and discussion-part with "users". |
| 161 | |
| 162 | In the introductions, each developer explained what their program can do. In the discussion-part, the users pass their own data to developers, then the developer proved the usability of their tools similar as to how they explained. By showing the usage of these programs in front of the user's eyes, not only the users realize how to use the programs clearly, but also developers can recognize which part of usage is the difficult to understand for users. These steps were a good opportunity to bridge the gap between software developers and their users. |
| 163 | |
| 164 | For one of the concrete examples, we made a pipeline to construct an annotation database for a small/medium size of EST project. |
| 165 | For making this pipeline, 100,000 sequence for invertebrate organisms were annotated through the ANNOTATOR, BioMart with blast2GO and KAAS. Then all of these data can publish through the TogoDB. jORCA and Taverna can support to connect these web-services. The schematic view of this pipeline is shown in Fig. 1. |
| 166 | |
| 167 | [[Image(bh2009-usecase_1.pdf)]] |
| 168 | |
| 169 | === 5 Riu Yamashita/Alberto Labarga - [wiki:SatelliteTranscriptionRegulation] (DONE) === |
| 170 | |
| 171 | Transcription regulation is one of the hot topic in biology, and many researchers are focusing on several target, such as transcription factor binding sites, methylaion, copy number variation, histon modification, and so on. Even though there are a number of useful tools to help such researches, the experimental research side still needs for bioinformatics help. Therefore, we first picked up general experimental techniques and tried to find what experimental researchers’ requests is. We then realized that there was no suitable web tool to predict transcriptional regulation caused by transcription factor or SNPs. |
| 172 | To solve it, we have been constructing a new web server “Churaumi”. Churaumi is based on DAS server, and it has FESD II DB for SNP data and DBTSS data for transcription start sites. It also has potential transcription factor binding sites (TFBSs) in promoter region, and user can predict which TFBSs can be overrepresented in his/her specific gene set. |
| 173 | |
| 174 | --- |
| 175 | |
| 176 | Biologists need to get more detailed information about what is |
| 177 | happening a gene level when they look at functional genomics data such |
| 178 | as microarray data. Besides differential expression or network |
| 179 | analysis, further regulatory information such as enrichment analysis |
| 180 | of transcription factor binding sites and transcription factor binding |
| 181 | site SNPs is available and could easily be integrated in the analysis. |
| 182 | This integration could also be extended to existing knowledge about |
| 183 | histone binding, siRNA or miRNA that control function, methylation, |
| 184 | copy number variation, etc. |
| 185 | |
| 186 | The objective of the meeting was to design a unified solution tool to |
| 187 | integrate the information that biologists may need when they look at |
| 188 | their functional studies, and identify feasible techniques and |
| 189 | economic solutions for them. |
| 190 | |
| 191 | We decided to start with the problem of analysing TFBS enriched in |
| 192 | differential expressed genes from microarray data, and explore |
| 193 | possible variations in the genomic sequence that could explain the |
| 194 | differences in expression. |
| 195 | |
| 196 | The base for the integrated system was the Functional Element SNP |
| 197 | database (FESD II) which is a web-based system for selecting sets of |
| 198 | SNPs in putative functional elements in human genes. It provides sets |
| 199 | of SNPs located in 10 different functional elements; promoter regions, |
| 200 | CpG islands, 5'UTRs (untranslated regions), translation start sites, |
| 201 | splice sites, coding exons, introns, translation stop sites, poly |
| 202 | adenylation signals (PASes), and 3'UTRs. |
| 203 | |
| 204 | We decided to use the Distributed Annotation System (DAS) as the |
| 205 | integration mechanism. DAS defines a communication protocol used to |
| 206 | exchange annotations on genomic or protein sequences. DAS can be |
| 207 | implemented as a client-server system in which a single client |
| 208 | integrates information from multiple servers. It allows a single |
| 209 | machine to gather up sequence annotation information from multiple |
| 210 | distant web sites, collate the information, and display it to the user |
| 211 | in a single view. Since it is heavily used in the genome |
| 212 | bioinformatics community, open source clients and servers are |
| 213 | available. |
| 214 | |
| 215 | The work started at the BioHackathon consisted in implementing a DAS |
| 216 | layer for FESD II and the TFBS prediction systems and developing a web |
| 217 | application that computes the enrichment of the TFBS for a list of |
| 218 | genes or proteins, and presents the results using an Ajax DAS viewer. |
| 219 | These developments are still work in progress. |
| 220 | |
| 221 | * Jimenez, R, et al. Dasty2, an AJAX protein DAS client. Bioinformatics, 2008, 24(18):2119-2121; |
| 222 | |
| 223 | |
| 224 | === 6 Akira Kinjo - [wiki:DDBJ-KEGG-PDBj] (DONE) === |
| 225 | |
| 226 | A DDBJ-KEGG-PDBj workflow: from pathways to protein-protein interactions |
| 227 | Objectives and Outline |
| 228 | |
| 229 | The objective of this satellite group is to examine the potentials and obstacles in web services by implementing a real-life use case. The goal of the workflow is to enumerate possible physical protein-protein interactions among proteins in a biochemical pathway. More specifically, the workflow proceeds as follows. (1) The user provide a KEGG pathway ID. (2) Extract the protein sequence of each enzyme in the specified pathway. (3) For each protein sequence, run BLAST search against Swiss-Prot database. (4) Construct a phylogenetic profile (a species-by-enzyme matrix) by identifying the top hits for each proteins and each species. (5) For each species in the phylogenetic profile, run BLAST searches for each protein sequence against PDB. (6) If two amino acid sequences (of the same species) have homologs in the same PDB entry, they are inferred to be in physical contact, and hence predicted to be an interacting pair. (7) Output image files highlighting the conserved and interacting proteins in the pathway map. |
| 230 | |
| 231 | Implementation |
| 232 | |
| 233 | To implement the workflow outlined above, we have used the SOAP and REST APIs of DDBJ (http://www.ddbj.nig.ac.jp/), KEGG (http://www.genome.jp/) and PDBj (http://www.pdbj.org/). The workflow can be divided into three parts corresponding to the three web sites. Part I consists of steps (1) and (2) (using KEGG API), Part II of steps (3) and (4) (using DDBJ WABI), Part III of steps (5) and (6) (using PDBj sequence navigator SOAP); step (7) is handled by a customized program on the client side. The main part of the client program was written in Java, but we were forced to switch to Perl for PDBj's sequence navigator due to a version incompatibility in SOAP libraries. Image manipulation programs were written in Perl and Ruby. |
| 234 | |
| 235 | Outcome |
| 236 | |
| 237 | We were able to implement the above workflow within the three days of BioHackathon?. The following is a list of what we have noticed during the development. First, we had to do a significant amount of coding in spite of the wealth of web services. Most of the codes were dedicated to converting file formats between different steps. Although some of such conversions may be automated by providing new web services, we suspect that non-trivial amount of coding for format conversions is inevitable if we try to tackle new problems. It was also noted that the non-standard output format of BLAST search in PDBj's sequence navigator caused some trouble, suggesting any web services should stick to the standard formats if any. Second, it can take a significant amount of time to finish a whole analysis. This is for the most part due to the amount of BLAST searches required for constructing a phylogenetic profile. In retrospect, this problem might have been solved if our client program was made multi-threaded (assuming the web servers can handle hundreds of requests at the same time), however, this would increase the burden of the client-side programming. Finally, by actually solving biologically oriented problems, we could identify some typical use cases which might be useful for further development of web services (e.g., Given a set of gene names, return a phylogenetic profile; Given a set of blast hits, group them according to their species). In summary, the implementation of the DDBJ-KEGG-PDBj workflow allowed us to realize the usefulness as well as limitations of the web services in its current status, and to identify the potential room for further improvements. The activity of this satellite group has been recorded at http://hackathon2.dbcls.jp/wiki/DDBJ-KEGG-PDBj . |
| 238 | |
| 239 | |
| 240 | === 7 Kazuharu Arakawa - [wiki:SatelliteG-language] (DONE) === |
22 | | * 8 Bruno Aranda - [wiki:SatelliteVisualization] |
23 | | * 9 Mitsuteru Nakao - [wiki:SatelliteGalaxy] |
24 | | * 10 Jan/Jessica - [wiki:SatelliteBigData] (and/or [wiki:SatelliteSeqAnalysis]) |
25 | | * 11 Paul Gordon - [wiki:SemanticWebServices] (and/or [wiki:SatelliteSemanticAnnotation]) |
26 | | * 12 Chisato Yamasaki - [wiki:SatelliteManifestWebServiseGuidelines] |
| 251 | === 8 Bruno Aranda - [wiki:SatelliteVisualization] (DONE) === |
| 252 | |
| 253 | Visualization in Bioinformatics is a difficult field, as complex as |
| 254 | the data that we try to visualize. In the Biohackathon 2009 three |
| 255 | categories were created: interaction networks and pathways, expression |
| 256 | and genomic region visualization. Many tools where demonstrated, many |
| 257 | of them trying different approaches to visualization with some |
| 258 | overlaps. |
| 259 | |
| 260 | There are common problems that need to be solved. The most important |
| 261 | one is how to visualize in an easy and meaningful way vast amounts of |
| 262 | data. The participants of the Biohackathon showed how they were |
| 263 | approaching this issue in their tools. We could see 3D gene region |
| 264 | visualization, diagrams that linked different base pairs of a genome |
| 265 | or tools to visualize interaction networks with thousand of nodes that |
| 266 | integrated in some cases gene expression data. |
| 267 | |
| 268 | To complicate the field even further, each category of visualization |
| 269 | requires a different style depending on the nature and size of the |
| 270 | studied data. The size and nature of a genome (or multiple genomes) is |
| 271 | much bigger that the amount of interactome data at the moment, so |
| 272 | different techniques must be used. In the Biohackathon some innovative |
| 273 | ways to browse through the genomes were shown, like the possibility of |
| 274 | using huge wall projections that could be controlled using a custom |
| 275 | controller designed for the task. |
| 276 | |
| 277 | It was clear for the participants that the current state of |
| 278 | visualization tools is still insufficient in most cases and new |
| 279 | meaningful approaches are needed. None of the tools shown seemed to be |
| 280 | generic enough to solve all the problems effectively. Some ideas for |
| 281 | future directions were using semantic zooming -where data is browsed |
| 282 | semantically- or improving the interoperability between large data |
| 283 | servers and the visualization systems. |
| 284 | |
| 285 | Biological visualization is not just about a tool to visualize the |
| 286 | data, as effective ways to retrieve and store the data are needed for |
| 287 | a fast and complete visual representation. Both data providers and |
| 288 | tool developers need to work collaboratively to address this issue in |
| 289 | the future, in a field that is still in a very early stage. |
| 290 | |
| 291 | |
| 292 | === 9 Mitsuteru Nakao - [wiki:SatelliteGalaxy] === |
| 293 | |
| 294 | === 10 Jan Aerts/Jessica Severin - [wiki:SatelliteBigData] (and/or [wiki:SatelliteSeqAnalysis]) (in progress?) === |
| 295 | |
| 296 | Satellite meeting Data Handling |
| 297 | |
| 298 | What is "big data"? The notion of "big data" has to be thought of as relative to the storage and processing techniques used to work with it. One gigabyte of data for example is only a very small size for a genome sequencing center, but it is considered to be "big" for genome viewing software that has trouble handling this size. Data becomes big when its size starts interfering with its handling. |
| 299 | |
| 300 | The data-handling satellite meeting provided the opportunity to discuss current issues and possible solutions for working with larger datasets. Discussions focussed on the types of data that the attendees as researchers store and manipulate themselves, not on the problems faced by large specialized data centers such as the Short Read Archive at the European Bioinformatics Institute or the Large Hedron Collider at CERN. Views were exchanged on three stages in the data handling pipelines: storage, querying and processing. Attention was however focussed on the first two. |
| 301 | |
| 302 | Storage and querying issues are often tightly related. The problem space can be viewed on two dimensions: according to the number of objects to be stored and according to the size of individual objects. Protein-protein interaction datasets and assembled genomes, for example, represent the group of objects that is small in number and whose size is still easily manageable on a standard filesystem or in a simple database. Microarray results and next-generation sequencing reads, however, involve a largish number of objects which become more difficult to query. They are often still stored in relational databases, but require tweaking that digresses from a normalized relational database model. Approaches to store (and query) this type of dataset involve - apart from the obvious such as creating good indices - database setups with a limited number of denormalized tables, avoiding table joins when querying. Unfortunately, existing relational database systems are proving more and more lacking in their ability to handle large datasets, both at the storage and the querying level. Several solutions are popping up that can help in storing large data, such as OGSA-DAI (www.ogsadai.org.uk) - a grid-based solution allowing multiple databases to be queried as one - and the cloud. Amazon Simple Storage Service (S3) or Elastic Block Storage (EBS) and GoogleBase, for example, allow for storing very large amounts of data and are relatively cheap. However, their usability in scientific data storage and processing still has to be proven. At the time of writing, large objects (bigger than 5Gb) can for example not be stored on the Amazon S3 service. In addition, moving large datasets from local disks to the cloud becomes a significant bottleneck. |
| 303 | |
| 304 | Several approaches were discussed to make data accessible to others once it is stored. Regular SQL access can be granted for smaller datasets (e.g. the Ensembl MySQL server at ensembldb.ensembl.org). It was however made clear that APIs should play a big role, particularly APIs where an object can be retrieved by URL (for example http://www.example.com/genes/BRCA2;format=bed). A toolkit that is able to perform a limited number of queries is recommended for larger or more complicated datasets such as next-generation sequence reads and alignments (e.g. SAMtools at http://samtools.sourceforge.net). |
| 305 | |
| 306 | The main conclusion from the discussions in this workgroup is that every data storage/querying/processing problem has to be investigated independently and often requires custom solutions. It is however critical in finding those solutions to be aware of the available technologies and practices. |
| 307 | |
| 308 | {{{ |
| 309 | Hey Jessica, |
| 310 | |
| 311 | 2009/6/26 Jessica Severin <jessica.severin@gmail.com> |
| 312 | > Dear Jan, Katayama-san, Arek |
| 313 | > |
| 314 | > Jan, Thank you for writing this. I am sorry I was very busy this week. |
| 315 | > |
| 316 | > You commented about how loading into the cloud is slow and a bottleneck. We did not discuss this at the meeting since no one had experience at the time. Did you work on this since biohackathon? If this is an easy conclusion to come to, then we can include such a comment. The only reason I say this, is that if people use this document as material for management decisions or new directions of work, we should be careful on what we recommend. |
| 317 | |
| 318 | We did briefly touch on that subject (mentioning sending around harddisks with FedEx) during the discussion. Also, the sequencing group at Sanger has tested this out recently, and I believe it took about 10 days to upload one dataset. So that was _before_ they could actually start processing the data on Amazon EC2... I will double-check with those guys next week. |
| 319 | |
| 320 | > Is there a way to "track edits" in googledoc? I want to do some big editing but want you to see what I am doing, rather than try to explain it here. |
| 321 | |
| 322 | Google Documents have revisions baked in. Anything you change is recorded and new revisions are created automatically. So just edit away. |
| 323 | |
| 324 | > The bigs point I want to add |
| 325 | > - files with indexes work very well and may become more important, but need API options to enable system integration. |
| 326 | > - more distributed (DAS) or federated (HDF5, OGSA-DAI) approaches may prove necessary |
| 327 | > - internet bandwidth is becoming a bottleneck too (genome centers sending data to repositories). even custom gigabit lines between buildings is limiting (<1TB/day in practice) |
| 328 | > - HDF5 should be mentioned along with OGSA-DAI since I think it may be a valid option for someone to explore. |
| 329 | |
| 330 | OK. |
| 331 | |
| 332 | > I also want to remove lots of details of the mysql schema ideas (few tables, joins...). It is too many details for this discussion. We can just simply to "relational databases with data-mining tricks(biomart, eedb) are useful, but even this approach also has limits". If we can make reference to eeDB and biomart papers for the "datamining" that would be nice. |
| 333 | |
| 334 | OK. Didn't add any references yet; that still has to be done. |
| 335 | |
| 336 | It would be great if you could get it a bit shorter as well, although it _could_ have been worse :-) |
| 337 | |
| 338 | Thanks for looking into this, |
| 339 | jan. |
| 340 | }}} |
| 341 | |
| 342 | === 11 Paul Gordon/Rutger Vos/Mark Wilkinson - [wiki:SemanticWebServices] (and/or [wiki:SatelliteSemanticAnnotation]) (in progress) === |
| 343 | |
| 344 | === 12 Chisato Yamasaki/Atsuko Yamaguchi - [wiki:SatelliteManifestWebServiseGuidelines] (in progress) === |
| 345 | |
| 346 | Manifest of Bio-Web Service (Bio-WS) guidelines |
| 347 | |
| 348 | A web service may be very helpful to process data from heterogeneous biological databases. However, several problems, such as the lack of documents for methods, the lack of compliance of the SOAP/WSDL specification in the language's library and various data formats, make users feel difficult to built a workflow by using web services. |
| 349 | |
| 350 | TogoWS is designed to address such issues for the improved usability as follows. |
| 351 | (1) SOAP based web services ideally follow the open standard and they should be independent from computer languages, however, many services still require language specific hacks in practice. TogoWS proxies these services to make them available in any programming language without difficulty. |
| 352 | (2) Query mechanism and syntax differs service to service that requires user to learn each usage beforehand. TogoWS provides a simple REST interface to query and retrieve in a unified manner. |
| 353 | (3) Entries obtained from databases are needed to be parsed by the client program to be fully utilized. TogoWS embeds open source bioinformatics libraries like BioPerl and BioRuby (which we also developed) to provide functionalities for parsing and conversion of various biological data formats without any installation on user side. |
| 354 | |
| 355 | Although the TogoWS supports many biological databases and the number of available databases is increasing, it is not practical to proxy all the major biological databases by TogoWS because the number of biological databases is extremely large. Therefore, it is preferable that users could combine the original web services without effort under a direction. To do so, we would like to establish a guideline for designing web services on biological databases. |
| 356 | |
| 357 | We first began by discussing among Japanese database providers including The University of Tokyo as the KEGG provider, Osaka University as the PDBj provider, Computational Biology Research Center, Biomedicinal Information Research Center and Database Center for Life Science to make the web services on the databases work together. Then, the first guildeline for designing web services was proposed in Japanese in October, 2008. We improved it to be applicable to variable databases and easy to understand even for a beginner of web services. After that, we translated it to English to sharpen at the BioHackathon 2009. |
| 358 | |
| 359 | |
| 360 | |