Changes between Version 47 and Version 48 of SatelliteBigData

Show
Ignore:
Timestamp:
2009/03/20 14:15:12 (15 years ago)
Author:
plindenbaum
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SatelliteBigData

    v47 v48  
    6464Really big objects such as the data from simulations [IS THIS CORRECT?] require specialized storage systems such as [http://opensolaris.org/os/community/zfs/ ZFS], [http://wiki.lustre.org/index.php?title=Main_Page Lustre], [http://oss.sgi.com/projects/xfs/ XFS], [http://www.pvfs.org/ PVFS2] or future Linux filesystem [http://btrfs.wiki.kernel.org/index.php/Main_Page Brtfs]. 
    6565 
    66 In contrast to the above, diffraction results, microarray results or next-gen sequencing reads involve a largish number of objects which become more difficult to query. They are typically still stored in RDBMS but might require some tweaking that digresses from a normalized relational database model. Apart from obvious things to do such as creating good indices, further optimization can be found by using as few joins as possible and therefore organizing the data so that it can be stored in 2 or 3 tables/indexes (e.g. eeDB). Another alternative could be the usage of specialized storage systems, like the ones used in high energy physics experiments or astronomy (for instance [http://www.hdfgroup.org/HDF5/ HDF5]). 
     66In contrast to the above, diffraction results, microarray results or next-gen sequencing reads involve a largish number of objects which become more difficult to query. They are typically still stored in RDBMS but might require some tweaking that digresses from a normalized relational database model, for example databases based on a key/value model (e.g. [http://www.oracle.com/technology/products/berkeley-db/index.html BerkeleyDB], [http://tokyocabinet.sourceforge.net/index.html Tokyo Cabinet],  BigTable, [http://hadoop.apache.org/core/ Hadoop] ). 
     67Apart from obvious things to do such as creating good indices, further optimization can be found by using as few joins as possible and therefore organizing the data so that it can be stored in 2 or 3 tables/indexes (e.g. eeDB). Another alternative could be the usage of specialized storage systems, like the ones used in high energy physics experiments or astronomy (for instance [http://www.hdfgroup.org/HDF5/ HDF5]). 
    6768 
    6869Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including [http://www.ogsadai.org.uk/ OGSADAI] which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. [http://aws.amazon.com/s3/ Amazon S3] and [http://www.google.com/base/ GoogleBase] allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload to these systems is however very slow which must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens.