Version 17 (modified by severin, 16 years ago) |
---|
Satellite meeting for big datasets
Topics
As more and more projects are spewing out datasets we might have to look at new paradigms of working with them. The conventional way of storing everything in a database and querying with SQL has become challenging or just impossible. Data are now more often stored in their original data formats.
The object of this meeting is to exchange ideas on this topic and discuss possible solutions or practices.
Topics:
- Next-gen sequencing
- Protein-interacting networks?
- Cloud storage/compute?
Chairperson
- Jan Aerts
Date
- 18th March, 2009
Room
- Meeting Room (1F)
Attendees
- Jan Aerts
- Jessica Severin
- José María Fernández
- Todd Harris
- Yunsun Nam
- Pierre Lindenbaum
- Keun-Joon Park
- Raoul Jean Pierre Bonnal
- Yasukazu "yaskaz" Nakamura (yaskaz@…)
Notes
Data storage will not be a problem (if you have enogh money). But downloading and/or manipulating Big sequence data must be problem for both provider (us=ddbj) and user. Is there a possible solution? And I have a big question: Do you really need and/or use "Short Read Archive?" -- yaskaz
Work goup discussions
next gen sequencer WG (big data and seq analysis)
applications and use cases:
- all processing must be run on a server, can not be run on a laptop, how to control and monitor such processes from something like a laptop
- local vs webservice. web service convenient but maybe not so secure which means people need to install our systems locally. but also collaboration needs access to data and tools
- how to define the quality of sequence? contamination, auto screen tool?
- moving to a data-driven research, not so much hypothesis driven. how to FIND biology in big datasets?
- searching / comparing now mostly happening on the metadata (time series, tissue, environments)
- metagenomics or novel genomes, no reference, can our tools do this? How?
- maybe we get a strange sequence out of our instrument, how do we know? is it error? is it interesting biology? CG content off, maybe technical errors (concatonated primers)
- how to bring the computation to the data because bring the data to the computation (1TB streaming) maybe not best approach
- as sequencer companies build more and more sequence processing into the instruments, more and more processing becomes routine. it is a moving target. how do we aid in new biological discovery? moving beyond simple processing and simple work flows.
WG2---- afternoon session
what is big data?
- read level, contigs, assembly, scaffold each different level
- amount vs complexity!! maybe amount not so hard, but complexity ....
- variation, snp
- data is big when the applications can not handle it. If 1GB of data crashes a viewer program than that is BIG. If 100k elements takes a day to upload, this upload must happen every time we want to work with the data, and it takes another day to filter for a service, then that is probably BIG data.
How to collaborate and communicate?
Data production centers
- centers like RIKEN OSC-LSA is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be released. EdgeExpressDB (eeDB) was developed during FANTOM4 project and is now being used for in-house big data management and visualization of big datasets. This system can manipulate short-read data for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily.
- SRA is still evaluating technology to do region based access of short-reads
Working with existing big data
- SRA, GEO, ArrayExpress?
- now most of us pull the whole thing down and then work with it
- sometimes even to send, sometimes DVD to move it around
- what are the queries we want to do?
- what are the reads in this region, defitely
- but also SRA maybe does not want do everything, but they will not turn data down and want everyone to send them the published data
- maybe not all data will end up in ONE archive (because it is so big). maybe need to query multiple centers to find all data (DDBJ, SRA, GEO, ArrayExpress?, korea? China?)
What are the common ways we would want to query?
- should we try to define?
- do we wait for DataCenters??
- region query again we need! means mapping needed by the services
- but also want the read Sequence since we can extract SNP, need quality
- NEED data providers to create complete meta-data descriptions.
IN THE END the goal is to find biology and getting access to the individual data elements is critical and this can not be just locked away in files that can not be internally accessed.
Results
TODOs
- Design tests for BioSQL on different scenario related to BigData? BigD
- Can BioSQL be used for short reads?
- Test BioSQL performances on BigD
Attachments
- handling_data.pdf (23.0 KB) - added by jan.aerts 16 years ago.