Changes between Version 76 and Version 77 of SatelliteBigData
- Timestamp:
- 2009/03/20 15:39:14 (16 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SatelliteBigData
v76 v77 72 72 In addition to the storage problem there is the issue of actually submitting our new data to public resources when publishing. Our datasets are getting very big and FTP often is not good enough. This forces many of us to have to submit data on DVD via FedEx for example. This is not optimal. 73 73 74 In systems biology and transcriptomics (like the work of RIKEN-OSC), there is a need to collect lots of small data, but the process of network analysis and molecular function adds aditional structure onto this primary data. Since this is still an emerging area of research, we need to keep the primary data available since many of our hypotheses and paradigms many change over time. If we would compress this data and throw it away, then we are limiting our future ability to re-evaluate our theories in the face of both new and old data. But at the same time as our analyses become more sophisticated, we may need to work with larger and larger portions of this primary data . This leads to a situation where we are working with both many small things, and many large things making this domain space big on both axes.74 In systems biology and transcriptomics (like the work of RIKEN-OSC), there is a need to collect lots of small data, but the process of network analysis and molecular function adds aditional structure onto this primary data. Since this is still an emerging area of research, we need to keep the primary data available since many of our hypotheses and paradigms many change over time. If we would compress this data and throw it away, then we are limiting our future ability to re-evaluate our theories in the face of both new and old data. But at the same time as our analyses become more sophisticated, we may need to work with larger and larger portions of this primary data at a time. This leads to a situation where we are working with both many small things, and many large things making this domain space big on both axes. 75 75 76 76 RIKEN OSC-LSA [http://www.osc.riken.jp/] is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be published and released to the public. EdgeExpressDB (eeDB) was developed during the FANTOM4 project and is now being used for in-house big data management and visualization of big datasets. eeDB is effectively an object-database which is implemented as an API and webservices. The system is currently being ported to C and file indexes, and based on the prototype code, we are expecting around a 20x-100x performance boost (but other backends like [http://www.hdfgroup.org/HDF5/ HDF5] mentioned above will be evaluated). The current version of the eeDB API toolkit and webservices are written in perl with a narrow/deep mysql snowflake schema. The perl/mysql implementation of the eeDB API-toolkit can manipulate billions of short-read data and 100s of microarray expression experiment datasets (each with 10000s of probes) for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily. Queries can access individual objects, edges, and work with streams or sets of objects queried by regions, node, or networks.