Satellite meeting for handling next-generation datasets
As more and more projects are spewing out datasets we might have to look at new ways of working with them. Really big datasets (e.g. SRA) require technical expertise other than what we can provide. But we can still try and learn from each other on how to handle our own datasets.
The object of this meeting is to exchange ideas on this topic and discuss possible solutions or practices.
- Next-gen sequencing
- interaction networks and systems biology on-top of this big data
- Cloud storage/compute?
- Jan Aerts
- Jessica Severin
- 18th March, 2009
- Meeting Room (1F)
- Jan Aerts
- Jessica Severin
- José María Fernández
- Todd Harris
- Yunsun Nam
- Pierre Lindenbaum
- Keun-Joon Park
- Raoul Jean Pierre Bonnal
- Yasukazu "yaskaz" Nakamura (yaskaz@…)
- Takeshi Kawashima
- Shuichi Kawashima
- Fumikazu Konishi
Data storage will not be a problem (if you have enough money). But downloading and/or manipulating Big sequence data must be problem for data producers (Sanger, RIKEN OSC), data distribution providers (NCBI, EBI/SRA, DDBJ) and end users after the data is made public. Is there a possible solution?
And I have a big question: Do you really need and/or use "Short Read Archive?" -- yaskaz
- yes, there is interesting biology in there and comparing our new experiments against existing knowledge is needed, but currently this requires us to download data out of SRA and into our local integration systems - jessica
- yes and no: No, for the short reads where the position on a reference seq is known you could just store the position rather than the entire read. Yes, imagine a user find a new unknown SNP at a given position: he will want to get the sequences/quality(?) of all the reads overlapping the mutation for this assembly as well as for the other individuals' genome. - Pierre
- this is only true for genome resquencing. for RNA->genome mapping data you do want the read since it can capture RNA modifications which are not the same as a SNP -- jessica
- not everything maps to the reference genome. this un-mapped data may be noise or errors, or it may be interesting biology.
The purpose of this workshop was to discuss the approaches people use to handle their data. We could obviously not go into the very technical approaches as are being applied to very large datasets such as the Short Read Archive because that requires a different expertise.
Three different levels in the data handling pipeline were discussed.
We listed the different types of storage solution used in different fields according to the amount and type of data. A distinction was made between storing a lot of small objects versus storing a few huge objects. Issues like latency and variety of data influence this distinction as well.
Protein-protein interaction datasets typically consist of a relatively small number of small objects. This type of data requires no advanced storage systems or tweaks to common systems; a simple RDBMS will do. Data like genome sequences and assembly data involve larger objects (1 genome sequence -> 3Gb), but are still easily manageable on a standard filesystem. Really big objects such as the data from simulations [IS THIS CORRECT?] require specialized storage systems such as XFS, ZFS, Lustre, PVFS2 or future Linux filesystem Brtfs.
In contrast to the above, diffraction results, microarray results or next-gen sequencing reads involve a largish number of objects which become more difficult to query. They are typically still stored in RDBMS but might require some tweaking that digresses from a normalized relational database model, for example databases based on a key/value model (e.g. BerkeleyDB, Tokyo Cabinet, BigTable, Hadoop ). Apart from obvious things to do such as creating good indices, further optimization can be found by using as few joins as possible and therefore organizing the data so that it can be stored in 2 or 3 tables/indexes (e.g. eeDB). Another alternative could be the usage of specialized storage systems, like the ones used in high energy physics experiments or astronomy (for instance HDF5).
Several attendees are looking into new ways to store their data because they are hitting the ceiling of their storage capacity. Several technologies were mentioned, including OGSADAI which is a grid-based solution where you can setup several databases that then can be queried as one. Other technologies like the cloud might also provide part of a solution. Amazon S3 and GoogleBase allow for storing very large amounts of data (large numbers of large objects) and are relatively cheap. The upload to these systems is however very slow which must be taken into account. In addition, these services are commercial and it might be dangerous to store non-public data there. A possible solution mentioned involves creating an encrypted data image to be uploaded instead of the original dataset. Another issue is that these companies might in the future decide to increase their prices or even stop their activities. Using the cloud for data storage therefore means that you still need a local backup in case this happens.
In addition to the storage problem there is the issue of actually submitting our new data to public resources when publishing. Our datasets are getting very big and FTP often is not good enough. This forces many of us to have to submit data on DVD via FedEx? for example. This is not optimal.
In systems biology and transcriptomics (like the work of RIKEN-OSC), there is a need to collect lots of small data, but the process of network analysis and molecular function adds aditional structure onto this primary data. Since this is still an emerging area of research, we need to keep the primary data available since many of our hypotheses and paradigms many change over time. If we would compress this data and throw it away, then we are limiting our future ability to re-evaluate our theories in the face of both new and old data. But at the same time as our analyses become more sophisticated, we may need to work with larger and larger portions of this primary data at a time. This leads to a situation where we are working with both many small things, and many large things making this domain space big on both axes.
RIKEN OSC-LSA http://www.osc.riken.jp/ is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be published and released to the public. EdgeExpressDB (eeDB) was developed during the FANTOM4 project and is now being used for in-house big data management and visualization of big datasets. eeDB is effectively an object-database which is implemented as an API and webservices. The system is currently being ported to C and file indexes, and based on the prototype code, we are expecting around a 20x-100x performance boost (but other backends like HDF5 mentioned above will be evaluated). The current version of the eeDB API toolkit and webservices are written in perl with a narrow/deep mysql snowflake schema. The perl/mysql implementation of the eeDB API-toolkit can manipulate billions of short-read data and 100s-1000s of microarray expression experiment datasets (each with 10000s of probes) for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily. Queries can access individual objects, edges, and work with streams or sets of objects queried by regions, node, or networks.
Several approaches to make the data accessible to others were discussed. For smaller datasets regular SQL can be used to get to the data (example: the Ensembl MySQL server on ensembldb.ensembl.org). For complex data however a fully normalized database can become difficult to query because you might need a lot of joins. In this case an API becomes necessary or the database schema can be rewritten in a [@Arek: what's that called?] version. A webservice API where a text-representation of an object can be retrieved by URL was mentioned as useful (e.g. http://www.example.com/genes/BRCA2;format=bed). For even larger datasets you want to try and limit the types of queries people can run. This way you can build a toolkit to ask this limited number of questions, but optimize that toolkit so that latency is reduced.
[SQL, File indexes] -> API -> toolkit -> [web, clients, mashup-analyzers] ^ | | | +----------------------+
Currently the available public resources like SRA, GEO, ArrayExpress?, Cibex are only providing query facilities on the metadata of the experiments surrounding the data. The data is available as files to download (often in the original format) but they do not provide facilities to externally explore the data and ask biological questions on the data. This then forces anyone who wants to explore the dataset to download this data into local integration systems before they can ask their biological questions.
But not all data is public. Research centers who generate this data need to manage it and explore it in order to produce publications and do science. This means they need the same or greater sophistication of tools than are available on the public services. Many of these research projects are often collaborative and international, which means this private datasets need to be accessible on the web, but protected and secured. These "collaboration webservices" are often then made public when the research is published (for example the FANTOM4 project).
With more and more international efforts, not all data may end up in one archive. Even today we see this between GEO, ArrayExpress? and CIBEX where some datasets are only available on one or two of the three services. This means that next-generation queries may always have to query multiple databases simultaneously in order to find the data they need.
Working with existing big data
- what are the queries we want to do?
- what are the reads in this region, definitely
- but also SRA maybe does not want do everything, but they will not turn data down and want everyone to send them the published data
- SRA is still evaluating technology to do region based access of short-reads
What are the common ways we would want to query?
- should we try to define? yes
- who should or can provide the data?
- maybe the the data production centers can also provide access to data
- eventually the DataCenters? will also provide fine-grained access
- the new high data production rates are going to make it expensive for central data providers
- region query again we need! means mapping needed by the services
- but also want the read Sequence since we can extract SNP, need quality
- NEED data providers to create complete meta-data descriptions.
Maybe we can try and look into togoWS to make our data easily available?
Apart from storage and getting the data out again we briefly discussed how to actually work with these data, i.e. running scripts on them. It is clear that shipment of large datasets should be avoided; it is often much simpler to bring the software to the data instead.
The main message from this discussion was that each database should be optimized for its own data and will therefore be completely different from the next. However, it is the API that can provide interoperability if we can come up with a common set of terms for querying. Also if new web-services can do some levels of processing of the big data before export, this can help with external integration systems. If our new webservices become available using RDF and symantic web technology, it opens the door for multi-layered processing. We believe that current and emerging web-technologies like mashups based on very simple APIs like Flickr's can be an alternative to other grid-based approached for integrating data.
Work group discussion topics
next gen sequencer WG (big data and seq analysis)
applications and use cases:
- all processing must be run on a server, can not be run on a laptop, how to control and monitor such processes from something like a laptop
- local vs webservice. web service convenient but maybe not so secure which means people need to install our systems locally. but also collaboration needs access to data and tools
- how to define the quality of sequence? contamination, auto screen tool?
- moving to a data-driven research, not so much hypothesis driven. how to FIND biology in big datasets?
- searching / comparing now mostly happening on the metadata (time series, tissue, environments)
- metagenomics or novel genomes, no reference, can our tools do this? How?
- maybe we get a strange sequence out of our instrument, how do we know? is it error? is it interesting biology? CG content off, maybe technical errors (concatonated primers)
- how to bring the computation to the data because bring the data to the computation (1TB streaming) maybe not best approach
- as sequencer companies build more and more sequence processing into the instruments, more and more processing becomes routine. it is a moving target. how do we aid in new biological discovery? moving beyond simple processing and simple work flows.
what is big data?
- read level, contigs, assembly, scaffold each different level
- amount vs complexity!! maybe amount not so hard, but complexity ....
- variation, snp
- data is big when the applications can not handle it. If 1GB of data crashes a viewer program than that is BIG. If 100k elements takes a day to upload, this upload must happen every time we want to work with the data, and it takes another day to filter for a service, then that is probably BIG data.
- is it really needed to store everything ? -- Raoul
- Compressing data in some way?
- collapse redundant informations ( possible for resequencing and de-novo sequencing )
- collapse is often domain specific. for example redundant genome tags can be collapsed, but duplicate RNA tags encodes expression levels and this expression-count should not be thrown away. - jessica
- reducing dataset complexity by processing can make BIG data small, but this may throw away biology which may be lost if we do not store the primary data.
- Compressing data in some way?
How to collaborate and communicate?
- emailing files or providing files on secure FTP sites works for now
- secure webservices are maybe a much better solution where more advanced tools can be built which can work with sections of the dataset, but this requires the ability to provide sub-access queries into these datasets
IN THE END the goal is to find biology. Having access to the individual data elements is critical and this can not be just locked away inside files that can not be internally accessed.