Changes between Version 33 and Version 34 of SatelliteBigData

Show
Ignore:
Timestamp:
2009/03/20 13:46:44 (15 years ago)
Author:
jan.aerts
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SatelliteBigData

    v33 v34  
    33== Topics == 
    44 
    5 As more and more projects are spewing out datasets we might have to look at new paradigms of working with them. The conventional way of storing everything in a database and querying with SQL has become challenging or just impossible. Data are now more often stored in their original data formats. 
     5As more and more projects are spewing out datasets we might have to look at new ways of working with them. Really big datasets (e.g. SRA) require technical expertise other than what we can provide. But we can still try and learn from each other on how to handle our own datasets. 
    66 
    77The object of this meeting is to exchange ideas on this topic and discuss possible solutions or practices. 
     
    1616 
    1717 * Jan Aerts 
    18  *  Jessica Severin 
     18 * Jessica Severin 
    1919 
    2020== Date == 
     
    6464In contrast to the above, diffraction results, microarray results or next-gen sequencing reads involve a largish number of objects which become more difficult to query. They are typically still stored in RDBMS but might require some tweaking that digresses from a normalized relational database model. Apart from obvious things to do such as creating good indices, further optimization can be found by using as few joins as possible and therefore organizing the data so that it can be stored in 2 or 3 tables/indexes (e.g. eeDB). 
    6565 
    66 Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including OGSADAI which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. Amazon S3 and GoogleBase allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload is however very slow to these systems which must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens. 
     66Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including [http://www.ogsadai.org.uk/ OGSADAI] which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. [http://aws.amazon.com/s3/ Amazon S3] and [http://www.google.com/base/ GoogleBase] allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload to these systems is however very slow which must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens. 
    6767 
    6868[Jessica: can you add your vertical red line to this as well?] 
     
    7070==== Querying ==== 
    7171 
    72 Several approaches to make the data accessible to others were discussed. For smaller datasets regular SQL can be used to get to the data (example: the Ensembl MySQL server on ensembldb.ensembl.org). If the database schema has to be tweaked to allow larger datasets (e.g. using a minimal number of tables such as eeDB) an API becomes necessary. A RESTful API where an object can be retrieved by URL was mentioned as useful in this case. For even larger datasets you want to try and limit the types of queries people can run. This way you can build a toolkit to ask this limited number of questions, but optimize that toolkit so that latency is reduced. 
     72Several approaches to make the data accessible to others were discussed. For smaller datasets regular SQL can be used to get to the data (example: the Ensembl MySQL server on ensembldb.ensembl.org). If the database schema has to be tweaked to allow larger datasets (e.g. using a minimal number of tables such as eeDB) an API becomes necessary. A RESTful API where a text-representation of and object can be retrieved by URL was mentioned as useful in this case (e.g. http://www.example.com/genes/BRCA2;format=bed). For even larger datasets you want to try and limit the types of queries people can run. This way you can build a toolkit to ask this limited number of questions, but optimize that toolkit so that latency is reduced. 
    7373 
    7474Streaming 
    75  
     75{{{ 
    7676  [SQL, File indexes] -> API -> toolkit -> [web, clients, mashup-analyzers] 
    7777   ^                      | 
    7878   |                      | 
    7979   +----------------------+ 
     80}}} 
    8081 
    8182==== Processing ====