Changes between Version 25 and Version 26 of SatelliteBigData

Show
Ignore:
Timestamp:
2009/03/19 17:55:42 (16 years ago)
Author:
severin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SatelliteBigData

    v25 v26  
    4040 
    4141Data storage will not be a problem (if you have enogh money). But downloading and/or manipulating Big sequence data must be problem for both provider (us=ddbj) and user. Is there a possible solution? And I have a big question: Do you really need and/or use "Short Read Archive?" -- yaskaz 
     42                                                
     43                                              
     44== Discussion == 
    4245 
    43 == Work goup discussions == 
     46The purpose of this workshop was to discuss the approaches people use to handle their data. We could obviously not go into the very technical approaches as are being applied to very large datasets such as the Short Read Archive because that requires a different expertise. 
     47 
     48Three different levels in the data handling pipeline were discussed. 
     49 
     50==== Storage ==== 
     51 
     52We listed the different types of storage solution used in different fields according to the amount and type of data. A distinction was made between storing a lot of small objects versus storing a few huge objects. Issues like latency and variety of data influence this distinction as well. 
     53 
     54Protein-protein interaction datasets typically consist of a relatively small number of small objects. This type of data requires no advanced storage systems or tweaks to common systems; a simple RDBMS will do. 
     55Data like genome sequences and assembly data involve larger objects (1 genome sequence -> 3Gb), but are still easily manageable on a standard filesystem. 
     56Really big objects such as the data from simulations [IS THIS CORRECT?] require specialized storage systems such as ZFS, lustre or PVFS. 
     57 
     58In contrast to the above, diffraction results, microarray results or next-gen sequencing reads involve a largish number of objects which become more difficult to query. They are typically still stored in RDBMS but might require some tweaking that digresses from a normalized relational database model. Apart from obvious things to do such as creating good indices, further optimization can be found by using as few joins as possible and therefore organizing the data so that it can be stored in 2 or 3 tables (e.g. eeDB). 
     59 
     60Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including OGSADAI which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. Amazon S3 and GoogleBase allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload is however very slow to these systems which must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens. 
     61 
     62[Jessica: can you add your vertical red line to this as well?] 
     63 
     64==== Querying ==== 
     65 
     66Several approaches to make the data accessible to others were discussed. For smaller datasets regular SQL can be used to get to the data (example: the Ensembl MySQL server on ensembldb.ensembl.org). If the database schema has to be tweaked to allow larger datasets (e.g. using a minimal number of tables such as eeDB) an API becomes necessary. A RESTful API where an object can be retrieved by URL was mentioned as useful in this case. For even larger datasets you want to try and limit the types of queries people can run. This way you can build a toolkit to ask this limited number of questions, but optimize that toolkit so that latency is reduced. 
     67 
     68Streaming 
     69 
     70  SQL -> API -> toolkit 
     71   ^      | 
     72   |      | 
     73   +------+ 
     74 
     75==== Processing ==== 
     76 
     77Apart from storage and getting the data out again we briefly discussed how to actually work with these data, i.e. running scripts on them. It is clear that shipment of large datasets should be avoided; it is often much simpler to bring the software to the data instead. 
     78 
     79The main message from this discussion was that each database should be optimized for its own data and will therefore be completely different from the next. However, it is the API that can provide interoperability if we can come up with a common set of terms for querying. We believe that current web-technologies like mashups based on very simple APIs like Flickr's can be an alternative to other grid-based approached for integrating data. 
     80 
     81== Work goup discussion topics == 
    4482 
    4583next gen sequencer WG (big data and seq analysis) 
     
    92130 - NEED data providers to create complete meta-data descriptions. 
    93131 
    94 IN THE END the goal is to find biology and getting access to the individual data elements is critical and this can not be just locked away in files that can not be internally accessed. 
     132IN THE END the goal is to find biology. Having access to the individual data elements is critical and this can not be just locked away inside files that can not be internally accessed. 
    95133 
    96 == Results == 
    97134 
    98135== TODOs ==