Changes between Version 67 and Version 68 of SatelliteBigData

Show
Ignore:
Timestamp:
2009/03/20 15:09:11 (15 years ago)
Author:
severin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SatelliteBigData

    v67 v68  
    6868Apart from obvious things to do such as creating good indices, further optimization can be found by using as few joins as possible and therefore organizing the data so that it can be stored in 2 or 3 tables/indexes (e.g. eeDB). Another alternative could be the usage of specialized storage systems, like the ones used in high energy physics experiments or astronomy (for instance [http://www.hdfgroup.org/HDF5/ HDF5]). 
    6969 
    70 RIKEN OSC-LSA [http://www.osc.riken.jp/] is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be published and released to the public.  EdgeExpressDB (eeDB) was developed during FANTOM4 project and is now being used for in-house big data management and visualization of big datasets.  eeDB is effectively an object-database which is implemented as an API and webservices. The system is currently being ported to C and file indexes, and based on the prototype code, we are expecting around a 20x-100x performance boost.  The current version of the eeDB API toolkit and webservices are written in perl with a narrow/deep mysql snowflake schema. The perl/mysql implementation of the eeDB API-toolkit can manipulate billions of short-read data and 100s of microarray expression experiment datasets (each with 10000s of probes) for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily.  Queries can access individual objects, edges, and work with streams or sets of objects queried by regions, node, or networks.  
     70RIKEN OSC-LSA [http://www.osc.riken.jp/] is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be published and released to the public.  EdgeExpressDB (eeDB) was developed during the FANTOM4 project and is now being used for in-house big data management and visualization of big datasets.  eeDB is effectively an object-database which is implemented as an API and webservices. The system is currently being ported to C and file indexes, and based on the prototype code, we are expecting around a 20x-100x performance boost.  The current version of the eeDB API toolkit and webservices are written in perl with a narrow/deep mysql snowflake schema. The perl/mysql implementation of the eeDB API-toolkit can manipulate billions of short-read data and 100s of microarray expression experiment datasets (each with 10000s of probes) for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily.  Queries can access individual objects, edges, and work with streams or sets of objects queried by regions, node, or networks.  
    7171 
    7272Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including [http://www.ogsadai.org.uk/ OGSADAI] which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. [http://aws.amazon.com/s3/ Amazon S3] and [http://www.google.com/base/ GoogleBase] allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload to these systems is however very slow which must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens.