BioHackathon 2009 meeting report

Work groups

1 Kiyoko Kinoshita - SatelliteGlycoBiology (DONE)

The satellite group on glycobiology focused on the development of workflows through the use of Taverna, which relied on the development of web services provided by the following glycobiology- related sites:

There were also some use cases that were suggested, which were incorporated into the list of targets for the group. (Target 1) The development and registration of web services on BioMoby? such that glycobiologists could create workflows for glycobiology research. In particular, since the participants of the BioHackathon? this year were developers of GlycoEpitopeDB and RINGS, the following targets were set. For GlycoEpitopeDB, web services to perform the following were to be developed: a) Query GlycoEpitope? DB using a keyword and retrieve all IDs of entries containing the keyword, and b) retrieve glycan structures in IUPAC format using GlycoEpitope? IDs. In order to handle the data returned from these services, an additional web service was to be developed in RINGS to convert a glycan structure in IUPAC format into KCF format, by which glycan structure queries could be made. (Target 2) Create a workflow for analyzing glyco-gene-related diseases: Use OMIM to search for diseases related to loci of human homologs of target glyco-genes from other species, such as fruit fly, and retrieve any SNPs that are known to be related.

In summary, the programming took longer than expected, but the group was able to create BioMoby? web services in GlycoEpitopeDB called getGlycoEpitopeIDfromKeyword and getIUPACfromGlycoEpitopeID, which could be connected to the RINGS web service called getKCFfromIUPAC. The resulting workflow took as input a keyword and could search GlycoEpitopeDB for entries matching the keyword. The retrieved entries could then be used to obtain the glycan structures in IUPAC format. The RINGS web service then could convert the data from IUPAC to KCF format, by which the KCaM alignment algorithm could be executed to retrieve similar glycan structures to the converted ones in KCF format. However, a parser still needs to be developed to retrieve the KEGG GLYCAN IDs along with their scores from the GlycomicsObject?.

The disease-related workflow was able to retrieve H-Inv entries containing the OMIM Ids which were retrieved from OMIM for entries containing a particular keyword. The issue was that the output was in XML format and contained entries which did not contain any SNPs. A simple beanshell script was thus developed to filter out the unnecessary results.

As a result, the glycobiology satellite group was successfully able to meet the major targets set for the BioHackathon?. Based on these experiences, the group members could build on the newly developed web services to further implement more complex utilities that would enable glycobiologists to efficiently analyze their data.

2 Naohisa Goto - SatelliteBioRuby (DONE)

We discussed about how to deal big data coming from next generation DNA sequencers by using BioRuby?. For the purpose, FASTQ format parser were implemented.

We also discussed about milestones for the next version releases. We agreed that we would soon release 1.3.1 as a maintenance release with bug fixes and refactoring. After that, major version up with many new functions and improvements were planned.

Quality improvement is one of the important issues for BioRuby?. During the hackathon, we fixed some bugs, for example, problems found in the Bio::PubMed?.esearch method and the Bio::Fasta::Report class. In addition to the bug fixes, refactoring of classes for BioSQL supports were also incorporated into main repository.

Lack of documentation, especially use-case documents, is also a big problem. To add documents easily with small efforts, we decided to translate BioPerl? HOWTOs from Perl to Ruby. Because BioPerl? HOWTOs are distributed under the terms of the Perl Artistic License, we can freely modify and distribute modified version of them under the same license. In addition, we will also write new HOWTOs from scratch for BioRuby? specific functions.

3 Alberto Labarga - SatelliteLiterature (DONE)

Although a significant portion of our knowledge about life sciences is stored in papers the relationship between this knowledge and the information stored in existing biological databases is almost negligible.

During the Literature Services meeting reported here, we aimed to investigate how to annotate atomic components of research papers in life sciences by combining automatic ontology-based and manual user-generated tags, and how such annotation could also facilitate the generation of networks of papers, based on an enriched set of metadata.

Different web tools allow researchers to search literature databases and integrate semantic information extracted from text with external databases and ontologies. These include Whatizit ( http://www.ebi.ac.uk/webservices/whatizit), iHop ( http://www.ihop-net.org), Novoseek ( http://www.novoseek.com/), Reflect ( http://reflect.ws/) or Concept Web Linker ( http://www.knewco.com/). BioCreative? MetaServer? Platform, Allie, OReFiL. However, only some of them provide accessible APIs so users can build their own text mining pipelines.

During the Biohackathon, we explored BioCreative? MetaServer? Platform, Whatizit and iHop web services, and developed several clients and workflows based on them. Also, the new information extraction XML format (ieXML) proposed by the European Bioinformatics Institute to standardize the annotation of named entities was reviewed ( http://www.ebi.ac.uk/Rebholz-srv/IeXML/), and both iHop2ieXML and BCMS2ieXML mappings based on XSLT were proposed.

Besides automatic annotation using controlled vocabularies, within the biomedical community the notion community annotation has recently started to be adopted. For instance, WikiProteins? delivers an environment for the annotation of proteins. Another example that illustrates the usefulness of harnessing the collective intelligence is BIOWiki; this collaborative ontology annotation and curation framework facilitates the engagement of the community with the sole purpose of improving an ontology. Similar to BIOWiki, Bioportal also makes it possible for the community to define mappings across ontologies, thus making it possible for domain experts to generate new relationships that are then evaluated by the community.

During the BioHackathon?, we reviewed different solutions for such collaborative annotation environments, focused on extracting supporting statements for biological facts in literature, being the most relevants WiredMarker? ( http://www.wired-marker.org), XConc ( http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page=XConc+Suite) , ( http://a.kazusa.or.jp/) and Bionotate ( http://bionotate.).

Furthermore, although papers implicitly contain Friend Of A Friend (FOAF) profiles –author name, affiliation, email, interests, etc, digital libraries provide very little support for interaction. We explored the architecture and functionality of scifoaf and the need for unique IDs (URI) for authors in publications was pointed as a key requirement for any really useful system to work.

Within the context of digital libraries in life sciences and annotation, this could be further improved if authors are allowed to find who is working on similar diseases, molecules, biomaterials, and other valid bio-related terminology. In this manner, the network becomes more concept centric, so further integration of automatic annotation in the collaborative annotation tools is required. The concept of living document, one of the selected proposals for the Elsevier Grand Challenge, was presented as a suitable framework for this kind of integration.

  • Rebholz-Schuhmann, D., et al., Text processing through Web Services: Calling Whatizit. Bioinformatics, 2007. 24(2).
  • Labarga et al. Web Services at the European Bioinformatics Institute. Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W6-W11
  • Mons, B., et al., Calling on a million minds for community annotation in WikiProteins?. Genome Biology 2008.
  • Backhaus, M. and J. Kelso. BIOWiki - a collaborative annotation and ontology curation framework. in 16th International World Wide Web Conference. 2007. Banff, Alberta, Canada.
  • Rubin, D.L., et al. BioPortal?: A Web Portal to Biomedical Ontologies. in AAAI Spring Symposium Series, Symbiotic Relationships between Semantic Web and Knowledge Engineering. 2008: Stanford University Press.
  • Garcia-Castro, et al. Semantic Web and Social Web heading towards a Living Document in Life Sciences. Journal of Web Semantics, 2009, accepted

4 Takeshi Kawashima - SatelliteUseCases (DONE)

In order to handle the large data from the new generation sequencers, each of the major genome institutes are developing the tools for the data analyses and constructing their own pipelines by themselves. On another front, because the rather decreased sequencing cost as well as several large sequencing centers starting automated customer services, now a variety of small labs also can plan the sequencing projects. Since computational analyses after sequencing have a rich variety of directions, such small labs often face to difficulties when handling their large data, despite the trend of the unification of sequencing facilities.On account of this, it is expected that the collaborations between these "small" labs and Bioinformatics labs will increase. Useful Free- / Open source- software for genome analyses are required for such collaborations.We had a satellite meeting for "UseCase?" in BioHackathon?2009 to discuss how easily making the pipelines adapted to collaborator's request.

In the "Satelite meeting for UseCase?", the developers and users of bioinformatics-tools joined and exchanged of views on current their issues. The following five developers participated in the meeting: BioMart, Galaxy, jORCA, ANNOTATOR, TogoDB. On the users side, the participants are mainly consisted of genome biologists from developmental/evolutional/genetic/medical interests. Of special note for this meeting is the high interactivity among the both side of researchers. The meeting is mainly constructed from the two parts, the introduction from the each "developer" and discussion-part with "users".

In the introductions, each developer explained what their program can do. In the discussion-part, the users pass their own data to developers, then the developer proved the usability of their tools similar as to how they explained. By showing the usage of these programs in front of the user's eyes, not only the users realize how to use the programs clearly, but also developers can recognize which part of usage is the difficult to understand for users. These steps were a good opportunity to bridge the gap between software developers and their users.

For one of the concrete examples, we made a pipeline to construct an annotation database for a small/medium size of EST project. For making this pipeline, 100,000 sequence for invertebrate organisms were annotated through the ANNOTATOR, BioMart with blast2GO and KAAS. Then all of these data can publish through the TogoDB. jORCA and Taverna can support to connect these web-services. The schematic view of this pipeline is shown in Fig. 1.

5 Riu Yamashita/Alberto Labarga - SatelliteTranscriptionRegulation (DONE)

Transcription regulation is one of the hot topic in biology, and many researchers are focusing on several target, such as transcription factor binding sites, methylaion, copy number variation, histon modification, and so on. Even though there are a number of useful tools to help such researches, the experimental research side still needs for bioinformatics help. Therefore, we first picked up general experimental techniques and tried to find what experimental researchers’ requests is. We then realized that there was no suitable web tool to predict transcriptional regulation caused by transcription factor or SNPs. To solve it, we have been constructing a new web server “Churaumi”. Churaumi is based on DAS server, and it has FESD II DB for SNP data and DBTSS data for transcription start sites. It also has potential transcription factor binding sites (TFBSs) in promoter region, and user can predict which TFBSs can be overrepresented in his/her specific gene set.

---

Biologists need to get more detailed information about what is happening a gene level when they look at functional genomics data such as microarray data. Besides differential expression or network analysis, further regulatory information such as enrichment analysis of transcription factor binding sites and transcription factor binding site SNPs is available and could easily be integrated in the analysis. This integration could also be extended to existing knowledge about histone binding, siRNA or miRNA that control function, methylation, copy number variation, etc.

The objective of the meeting was to design a unified solution tool to integrate the information that biologists may need when they look at their functional studies, and identify feasible techniques and economic solutions for them.

We decided to start with the problem of analysing TFBS enriched in differential expressed genes from microarray data, and explore possible variations in the genomic sequence that could explain the differences in expression.

The base for the integrated system was the Functional Element SNP database (FESD II) which is a web-based system for selecting sets of SNPs in putative functional elements in human genes. It provides sets of SNPs located in 10 different functional elements; promoter regions, CpG islands, 5'UTRs (untranslated regions), translation start sites, splice sites, coding exons, introns, translation stop sites, poly adenylation signals (PASes), and 3'UTRs.

We decided to use the Distributed Annotation System (DAS) as the integration mechanism. DAS defines a communication protocol used to exchange annotations on genomic or protein sequences. DAS can be implemented as a client-server system in which a single client integrates information from multiple servers. It allows a single machine to gather up sequence annotation information from multiple distant web sites, collate the information, and display it to the user in a single view. Since it is heavily used in the genome bioinformatics community, open source clients and servers are available.

The work started at the BioHackathon? consisted in implementing a DAS layer for FESD II and the TFBS prediction systems and developing a web application that computes the enrichment of the TFBS for a list of genes or proteins, and presents the results using an Ajax DAS viewer. These developments are still work in progress.

  • Jimenez, R, et al. Dasty2, an AJAX protein DAS client. Bioinformatics, 2008, 24(18):2119-2121;

6 Akira Kinjo - DDBJ-KEGG-PDBj (DONE)

A DDBJ-KEGG-PDBj workflow: from pathways to protein-protein interactions Objectives and Outline

The objective of this satellite group is to examine the potentials and obstacles in web services by implementing a real-life use case. The goal of the workflow is to enumerate possible physical protein-protein interactions among proteins in a biochemical pathway. More specifically, the workflow proceeds as follows. (1) The user provide a KEGG pathway ID. (2) Extract the protein sequence of each enzyme in the specified pathway. (3) For each protein sequence, run BLAST search against Swiss-Prot database. (4) Construct a phylogenetic profile (a species-by-enzyme matrix) by identifying the top hits for each proteins and each species. (5) For each species in the phylogenetic profile, run BLAST searches for each protein sequence against PDB. (6) If two amino acid sequences (of the same species) have homologs in the same PDB entry, they are inferred to be in physical contact, and hence predicted to be an interacting pair. (7) Output image files highlighting the conserved and interacting proteins in the pathway map.

Implementation

To implement the workflow outlined above, we have used the SOAP and REST APIs of DDBJ ( http://www.ddbj.nig.ac.jp/), KEGG ( http://www.genome.jp/) and PDBj ( http://www.pdbj.org/). The workflow can be divided into three parts corresponding to the three web sites. Part I consists of steps (1) and (2) (using KEGG API), Part II of steps (3) and (4) (using DDBJ WABI), Part III of steps (5) and (6) (using PDBj sequence navigator SOAP); step (7) is handled by a customized program on the client side. The main part of the client program was written in Java, but we were forced to switch to Perl for PDBj's sequence navigator due to a version incompatibility in SOAP libraries. Image manipulation programs were written in Perl and Ruby.

Outcome

We were able to implement the above workflow within the three days of BioHackathon??. The following is a list of what we have noticed during the development. First, we had to do a significant amount of coding in spite of the wealth of web services. Most of the codes were dedicated to converting file formats between different steps. Although some of such conversions may be automated by providing new web services, we suspect that non-trivial amount of coding for format conversions is inevitable if we try to tackle new problems. It was also noted that the non-standard output format of BLAST search in PDBj's sequence navigator caused some trouble, suggesting any web services should stick to the standard formats if any. Second, it can take a significant amount of time to finish a whole analysis. This is for the most part due to the amount of BLAST searches required for constructing a phylogenetic profile. In retrospect, this problem might have been solved if our client program was made multi-threaded (assuming the web servers can handle hundreds of requests at the same time), however, this would increase the burden of the client-side programming. Finally, by actually solving biologically oriented problems, we could identify some typical use cases which might be useful for further development of web services (e.g., Given a set of gene names, return a phylogenetic profile; Given a set of blast hits, group them according to their species). In summary, the implementation of the DDBJ-KEGG-PDBj workflow allowed us to realize the usefulness as well as limitations of the web services in its current status, and to identify the potential room for further improvements. The activity of this satellite group has been recorded at http://hackathon2.dbcls.jp/wiki/DDBJ-KEGG-PDBj .

7 Kazuharu Arakawa - SatelliteG-language (DONE)

G-language Genome Analysis Environment (G-language GAE) is a set of Perl libraries for genome sequence analysis that is compatible with BioPerl?, equipped with several software interfaces (interactive Perl/UNIX shell with persistent data, AJAX Web GUI, Perl API) [1-3]. The software package contains more than 100 original analysis programs especially focusing on bacterial genome analysis, including those for the identification of binding sites with information theory, analysis of nucleotide composition bias, analysis of the distribution of characteristic oligonucleotides, analysis of codons and prediction of expression levels, and visualization of genomic information. In this hackathon, the attendees from G-language Project implemented web-service interfaces for G-language GAE in order to provide higher interoperability. The RESTful web services provided at  http://rest.g-language.org/ provides URL-based access to all functions of G-language GAE, which is highly interoperable to be accessed from other online resources. For example, graphical result of the GC skew analysis of Escherichia coli K12 genome is given by  http://rest.g-language.org/NC_000913/gcskew. Another interface through the SOAP protocol provides programming language-independent access to more than 100 analysis programs. The WSDL file ( http://soap.g-language.org/g-language.wsdl) contains descriptions for all available programs in a single file, and can be readily loaded in Taverna 2 workbench [4] to integrate with other services to construct workflows.

References:

  1. Arakawa K, Mori K, Ikeda K, Matsuzaki T, Kobayashi Y, Tomita M, "G-language Genome Analysis Environment: a workbench for nucleotide sequence data mining", Bioinformatics, 2003, 19(2):305-306.
  2. Arakawa K, Tomita M, "G-language System as a platform for large-scale analysis of high-throughput omics data", Journal of Pesticide Science, 2006, 31(3):282-288.
  3. Arakawa K*, Suzuki H, Tomita M, "Computational Genome Analysis Using The G-language System", Genes, Genomes and Genomics, 2008, 2(1): 1-13.
  4. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P, “Taverna: a tool for the composition and enactment of bioinformatics workflows”, Bioinformatics, 2004, 20(17): 3045-3054

8 Bruno Aranda - SatelliteVisualization (DONE)

Visualization in Bioinformatics is a difficult field, as complex as the data that we try to visualize. In the Biohackathon 2009 three categories were created: interaction networks and pathways, expression and genomic region visualization. Many tools where demonstrated, many of them trying different approaches to visualization with some overlaps.

There are common problems that need to be solved. The most important one is how to visualize in an easy and meaningful way vast amounts of data. The participants of the Biohackathon showed how they were approaching this issue in their tools. We could see 3D gene region visualization, diagrams that linked different base pairs of a genome or tools to visualize interaction networks with thousand of nodes that integrated in some cases gene expression data.

To complicate the field even further, each category of visualization requires a different style depending on the nature and size of the studied data. The size and nature of a genome (or multiple genomes) is much bigger that the amount of interactome data at the moment, so different techniques must be used. In the Biohackathon some innovative ways to browse through the genomes were shown, like the possibility of using huge wall projections that could be controlled using a custom controller designed for the task.

It was clear for the participants that the current state of visualization tools is still insufficient in most cases and new meaningful approaches are needed. None of the tools shown seemed to be generic enough to solve all the problems effectively. Some ideas for future directions were using semantic zooming -where data is browsed semantically- or improving the interoperability between large data servers and the visualization systems.

Biological visualization is not just about a tool to visualize the data, as effective ways to retrieve and store the data are needed for a fast and complete visual representation. Both data providers and tool developers need to work collaboratively to address this issue in the future, in a field that is still in a very early stage.

9 Mitsuteru Nakao - SatelliteGalaxy

The satellite group on Galaxy discussed the usability issues including workflows, tools, internationalization and authentication, and developed the solutions for the issues. Galaxy is an interactive platform to process biological data using various tools to obtain form data sources, to analyze and to share the data.

Galaxy is as a data processing hub to import from various data sources such as local files, web site urls, BioMart and UCSC Genome Browser, and to export to various applications as well. To enhance this hub scenario, we developed data import/export tools, visualization and text-mining tools which are the rest of the Galaxy tools. We developed two-way communication to import data from wormbase and export data to wormbase on gbowse.

We developed the data import tool via TogoWS to extend data sources including databases on the NCBI, EBI, KEGG, DDBJ and PDBj web service API. We developed a bookmarklet to visualize the selected database ID string (c.f. PDB ID) on the original web site. Such function can be replaced by the semantics web technology. And we developed text-mining tools against the MEDLINE abstract. Such the research line is prospective.

On the usability issues, Galaxy could not deal with non-ascii characters in the data and the messages. The characters often appear in the gene symbols and the full text articles. And Galaxy have messages in English only. End-users of such the interactive workflow platform are not only researchers but also technicians in the worldwide. Often they are awkward in English to read and/or understand. Therefore we solved the issues of multilingualization and internationalization to develop the internationalization Galaxy by using the standard GetText? library and the UTF-8 character set. Galaxy can use the external authentication system such as OpenID. We developed the mod_openid based front application.

In addition, we discussed the user support materials. Although a high standard of graphical user interface on the web application is too complex to describe text, it is difficult to communicate between end-user biologists and developers without graphical materials such as screencasts. Galaxy provides a series of screencasts for the end-users. Such the use of screencast to support users is one of the best practice for web application.

10 Jan Aerts/Jessica Severin - SatelliteBigData (and/or SatelliteSeqAnalysis) (in progress?)

Satellite meeting Data Handling

What is "big data"? The notion of "big data" has to be thought of as relative to the storage and processing techniques used to work with it. One gigabyte of data for example is only a very small size for a genome sequencing center, but it is considered to be "big" for genome viewing software that has trouble handling this size. Data becomes big when its size starts interfering with its handling.

The data-handling satellite meeting provided the opportunity to discuss current issues and possible solutions for working with larger datasets. Discussions focussed on the types of data that the attendees as researchers store and manipulate themselves, not on the problems faced by large specialized data centers such as the Short Read Archive at the European Bioinformatics Institute or the Large Hedron Collider at CERN. Views were exchanged on three stages in the data handling pipelines: storage, querying and processing. Attention was however focussed on the first two.

Storage and querying issues are often tightly related. The problem space can be viewed on two dimensions: according to the number of objects to be stored and according to the size of individual objects. Protein-protein interaction datasets and assembled genomes, for example, represent the group of objects that is small in number and whose size is still easily manageable on a standard filesystem or in a simple database. Microarray results and next-generation sequencing reads, however, involve a largish number of objects which become more difficult to query. They are often still stored in relational databases, but require tweaking that digresses from a normalized relational database model. Approaches to store (and query) this type of dataset involve - apart from the obvious such as creating good indices - database setups with a limited number of denormalized tables, avoiding table joins when querying. Unfortunately, existing relational database systems are proving more and more lacking in their ability to handle large datasets, both at the storage and the querying level. Several solutions are popping up that can help in storing large data, such as OGSA-DAI (www.ogsadai.org.uk) - a grid-based solution allowing multiple databases to be queried as one - and the cloud. Amazon Simple Storage Service (S3) or Elastic Block Storage (EBS) and GoogleBase?, for example, allow for storing very large amounts of data and are relatively cheap. However, their usability in scientific data storage and processing still has to be proven. At the time of writing, large objects (bigger than 5Gb) can for example not be stored on the Amazon S3 service. In addition, moving large datasets from local disks to the cloud becomes a significant bottleneck.

Several approaches were discussed to make data accessible to others once it is stored. Regular SQL access can be granted for smaller datasets (e.g. the Ensembl MySQL server at ensembldb.ensembl.org). It was however made clear that APIs should play a big role, particularly APIs where an object can be retrieved by URL (for example  http://www.example.com/genes/BRCA2;format=bed). A toolkit that is able to perform a limited number of queries is recommended for larger or more complicated datasets such as next-generation sequence reads and alignments (e.g. SAMtools at  http://samtools.sourceforge.net).

The main conclusion from the discussions in this workgroup is that every data storage/querying/processing problem has to be investigated independently and often requires custom solutions. It is however critical in finding those solutions to be aware of the available technologies and practices.

Hey Jessica,

2009/6/26 Jessica Severin <jessica.severin@gmail.com>
> Dear Jan, Katayama-san, Arek
>
> Jan, Thank you for writing this.  I am sorry I was very busy this week.
>
> You commented about how loading into the cloud is slow and a bottleneck.  We did not discuss this at the meeting since no one had experience at the time.  Did you work on this since biohackathon?  If this is an easy conclusion to come to, then we can include such a comment.  The only reason I say this, is that if people use this document as material for management decisions or new directions of work, we should be careful on what we recommend.

We did briefly touch on that subject (mentioning sending around harddisks with FedEx) during the discussion. Also, the sequencing group at Sanger has tested this out recently, and I believe it took about 10 days to upload one dataset. So that was _before_ they could actually start processing the data on Amazon EC2... I will double-check with those guys next week.

> Is there a way to "track edits" in googledoc?  I want to do some big editing but want you to see what I am doing, rather than try to explain it here.

Google Documents have revisions baked in. Anything you change is recorded and new revisions are created automatically. So just edit away.

> The bigs point I want to add
> - files with indexes work very well and may become more important, but need API options to enable system integration.
> - more distributed (DAS) or federated (HDF5, OGSA-DAI) approaches may prove necessary
> - internet bandwidth is becoming a bottleneck too (genome centers sending data to repositories). even custom gigabit lines between buildings is limiting (<1TB/day in practice)
> - HDF5 should be mentioned along with  OGSA-DAI since I think it may be a valid option for someone to explore.

OK.

> I also want to remove lots of details of the  mysql schema ideas (few tables, joins...). It is too many details for this discussion.  We can just simply to "relational databases with data-mining tricks(biomart, eedb) are useful, but even this approach also has limits".  If we can make reference to eeDB and biomart papers for the "datamining" that would be nice.  

OK. Didn't add any references yet; that still has to be done.

It would be great if you could get it a bit shorter as well, although it _could_ have been worse :-)

Thanks for looking into this,
jan.

11 Paul Gordon/Rutger Vos/Mark Wilkinson - SemanticWebServices (and/or SatelliteSemanticAnnotation) (DONE)

Interoperability between traditional web services is hampered by the fact that supporting technologies only allow such services to be defined in terms of their syntax. To combine such services in larger work flows, developers need to find agreement on the meaning of the data types that these services produce and consume.

For example, if one service emits numbers constrained to a range between 0 and 100, and another service consumes such numbers, then these services are interoperable in terms of their syntax - but if the former service emits these numbers as measurements of the temperature of liquid water and the latter consumes them to record entry-level donations to a charity then these services are not truly interoperable in any meaningful way. Human intervention in traditional web service composition is necessary to assess whether such services are truly interoperable.

Semantic web services hold the promise of facilitating automatic service discovery and composition by describing the meaning of entities (operations, inputs and outputs, metadata) in a way that a machine can consume and draw conclusions from by itself. This is made possible by the recent development of a number of standards, conceptual and technological, among which are:

RDF - conceptually, RDF's contribution is the notion that anything can be described in terms of simple statements consisting of subject, predicate and object ("triples"). To transmit such triples, a number of serialization formats is used, including RDF/XML, N3 and Turtle.

OWL - the Web Ontology Language is used to describe whatever goes into triples - i.e. subjects, predicates and objects - in class hierarchies so that reasoners can draw conclusions about the implications of triples that include instances of OWL classes.

SPARQL - sets of RDF triples form graphs akin to joins across traditional relational database tables. The SPARQL language has been developed to facilitate queries over such graphs using a syntax and logic similar to SQL.

Web stack - to uniquely identify and dereference entities (subjects, predicates and objects) within an RDF graph, traditional web technologies are used. For example, because the DNS system guarantees uniqueness of entries, URLs are commonly used to identify entities, and HTTP is used to dereference them.

Recently, these technologies have started to be applied to bioinformatics. Among the participants of the hackathon the project that is most advanced in its leveraging of semantic web service technologies is SADI, and so this was used as a model to illustrate semantic web services in general.

The BioHackathon? was the first opportunity to describe and discuss the SADI (Semantic Automated Discovery and Integration) project. SADI is the successor to BioMoby?, and is distinct from BioMoby? in several key ways, including:

SADI utilizes the Semantic Web standards RDF and OWL, and their XML serializations, for passing its data; SADI does not define any novel messaging formats; There is no centralized datatype ontology, rather SADI can utilize any classes, and any predicates, from any Web-published ontology; Web Service interfaces are defined in terms of the OWL classes they consume/produce, and the Web Services must consume/produce RDF Individuals of those OWL Classes; Web Services are discovered based on the RDF predicate(s) that they attach to the incoming data; Web Services are invoked using HTTP POST; Batch-invocations and asynchronous invocations hare handled at the level of the URI and pure HTTP protocol, rather than an extension of SOAP and/or WSRF headers.

The 'novelty' of SADI is in its perspective of what Web Services in bioinformatics “do”. While observing the BioMoby? project, we came to realize that most (possibly all) Web Services in Bioinformatics have one thing in common – they uncover some biological relationship between the data that is input to the service, and the data that is returned. In SADI we merely require that Web Services are modeled to make this implicit behavior explicit – i.e., that a Web Service should consume a piece of data, and the explicitly create the RDF triple linking that input data and the resulting output, with a predicate describing the biological relationship between that input and that output. For example, a BLAST Web Service consumes a sequence and generates “hits”. Viewed from the perspective of the Semantic Web, the Service consumes a URI representing a sequence, outputs a URI representing a related sequence, and joins those two sequences by a predicate such as “hasSequenceSimilarityTo”. By modeling Web Services this way, we demonstrate (e.g. using the SHARE client:  http://biordf.net/cardioSHARE/) that Web Services can be made to appear as transparent parts of the Semantic Web, thus removing the walls between the Semantic Web and Web Service worlds.

The World Wide Web Consortium has recently adopted a standard called SAWSDL for the semantic annotation of Web Services. This standard provides a way to reference both external data models and schema mapping rules, encouraging interoperability regardless of the Web Service's original input and output schemas. The 2008 BioHackathon? highlighted the need for a standards-compliant retrofit of existing Web services to BioMoby?, therefore a SAWSDL-compliant service proxy was implemented [Gordon P.M.K., Sensen C.W. (2008) Creating Bioinformatics Semantic Web Services from Existing Web Services: A real-world application of SAWSDL. In: Proceedings of the IEEE International Conference on Web Services (Beijing, China), September 23-26, 2008, pp. 608-614.]. By the BioHackathon? 2009, it became obvious that manual maintenance of extra SAWSDL markup in Web Services' WSDL documents can be overly burdensome, especially because the original WSDL documents are usually auto-generated. To simplify SAWSDL creation and maintenance, existing Web Services can now be made BioMoby?-compliant via a simple point-and-click interface (www.daggoo.net). This functionality will be extended to SADI services in order to facilitate full Semantic Web compliance of Web Services for bioinformatics service providers and future BioHackathon? events.

12 Chisato Yamasaki/Atsuko Yamaguchi - SatelliteManifestWebServiseGuidelines (in progress)

Manifest of Bio-Web Service (Bio-WS) guidelines

A web service may be very helpful to process data from heterogeneous biological databases. However, several problems, such as the lack of documents for methods, the lack of compliance of the SOAP/WSDL specification in the language's library and various data formats, make users feel difficult to build a workflow by using web services.

TogoWS is designed to address such issues for the improved usability as follows. (1) SOAP based web services ideally follow the open standard and they should be independent from computer languages. However, many services still require language specific programming in practice. TogoWS proxies these services to make them available in any programming language without difficulty. (2) TogoWS provides a simple REST interface to query and retrieve in a unified manner. (3) Entries obtained from databases are needed to be parsed by the client program to be fully utilized. TogoWS provides functionalities for parsing and conversion of various biological data formats without any installation on user side.

Although the TogoWS supports many biological databases and the number of available databases is increasing, it is not practical to proxy all the major biological databases by TogoWS because the number of biological databases is extremely large. Therefore, it is preferable that users could combine the original web services without effort under a unified direction. To do so, we would like to establish a guideline for designing web services on biological databases.

Prior to BioHackathon? 2009, we first began by discussing among Japanese database providers including The University of Tokyo as the KEGG provider, Osaka University as the PDBj provider, Computational Biology Research Center, Biomedicinal Information Research Center and Database Center for Life Science to make the web services on the databases work together. Then, the first guildeline for designing web services was proposed in Japanese in October, 2008. We improved it to be applicable to variable databases and easy to understand even for a beginner of web services.

To propose the Bio-WS guidelines to BioHackathon? 2009 participants and widely to the public, we translated the first version of the guidelines to English to sharpen at BioHackathon? 2009. In brief, this manifest of the Bio-Web Service guideline proposed the standard specifications of methods of REST and SOAP to search, get or convert ID or format of the entries. It also proposed the format of query string and supported data types, as well as the preparation of sample codes and documents. The full descriptions of the manifest were stated at the following URL; http://hackathon2.dbcls.jp/wiki/GuidelineForWebService.

In summary, we had proposed the Bio-WS guideline to BioHackathon? 2009 participants. Further discussion and brushing up of the detailed spesifications will be required for the guideline to be an authorized international standard guideline for the Bio-Web Service developments, but we will continue this activity. The future version of the guideline may incorporate the aspects of the Semantic Web Services.