Changes between Version 33 and Version 34 of SatelliteBigData
- Timestamp:
- 2009/03/20 13:46:44 (16 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SatelliteBigData
v33 v34 3 3 == Topics == 4 4 5 As more and more projects are spewing out datasets we might have to look at new paradigms of working with them. The conventional way of storing everything in a database and querying with SQL has become challenging or just impossible. Data are now more often stored in their original data formats.5 As more and more projects are spewing out datasets we might have to look at new ways of working with them. Really big datasets (e.g. SRA) require technical expertise other than what we can provide. But we can still try and learn from each other on how to handle our own datasets. 6 6 7 7 The object of this meeting is to exchange ideas on this topic and discuss possible solutions or practices. … … 16 16 17 17 * Jan Aerts 18 * 18 * Jessica Severin 19 19 20 20 == Date == … … 64 64 In contrast to the above, diffraction results, microarray results or next-gen sequencing reads involve a largish number of objects which become more difficult to query. They are typically still stored in RDBMS but might require some tweaking that digresses from a normalized relational database model. Apart from obvious things to do such as creating good indices, further optimization can be found by using as few joins as possible and therefore organizing the data so that it can be stored in 2 or 3 tables/indexes (e.g. eeDB). 65 65 66 Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including OGSADAI which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. Amazon S3 and GoogleBase allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload is however very slow to these systemswhich must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens.66 Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including [http://www.ogsadai.org.uk/ OGSADAI] which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. [http://aws.amazon.com/s3/ Amazon S3] and [http://www.google.com/base/ GoogleBase] allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload to these systems is however very slow which must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens. 67 67 68 68 [Jessica: can you add your vertical red line to this as well?] … … 70 70 ==== Querying ==== 71 71 72 Several approaches to make the data accessible to others were discussed. For smaller datasets regular SQL can be used to get to the data (example: the Ensembl MySQL server on ensembldb.ensembl.org). If the database schema has to be tweaked to allow larger datasets (e.g. using a minimal number of tables such as eeDB) an API becomes necessary. A RESTful API where a n object can be retrieved by URL was mentioned as useful in this case. For even larger datasets you want to try and limit the types of queries people can run. This way you can build a toolkit to ask this limited number of questions, but optimize that toolkit so that latency is reduced.72 Several approaches to make the data accessible to others were discussed. For smaller datasets regular SQL can be used to get to the data (example: the Ensembl MySQL server on ensembldb.ensembl.org). If the database schema has to be tweaked to allow larger datasets (e.g. using a minimal number of tables such as eeDB) an API becomes necessary. A RESTful API where a text-representation of and object can be retrieved by URL was mentioned as useful in this case (e.g. http://www.example.com/genes/BRCA2;format=bed). For even larger datasets you want to try and limit the types of queries people can run. This way you can build a toolkit to ask this limited number of questions, but optimize that toolkit so that latency is reduced. 73 73 74 74 Streaming 75 75 {{{ 76 76 [SQL, File indexes] -> API -> toolkit -> [web, clients, mashup-analyzers] 77 77 ^ | 78 78 | | 79 79 +----------------------+ 80 }}} 80 81 81 82 ==== Processing ====