Version 22 (modified by severin, 15 years ago)

--

Satellite meeting for handling next-generation datasets

Topics

As more and more projects are spewing out datasets we might have to look at new paradigms of working with them. The conventional way of storing everything in a database and querying with SQL has become challenging or just impossible. Data are now more often stored in their original data formats.

The object of this meeting is to exchange ideas on this topic and discuss possible solutions or practices.

Topics:

  • Next-gen sequencing
  • Protein-interacting networks?
  • Cloud storage/compute?

Chairperson

  • Jan Aerts

Date

  • 18th March, 2009

Room

  • Meeting Room (1F)

Attendees

  • Jan Aerts
  • Jessica Severin
  • José María Fernández
  • Todd Harris
  • Yunsun Nam
  • Pierre Lindenbaum
  • Keun-Joon Park
  • Raoul Jean Pierre Bonnal
  • Yasukazu "yaskaz" Nakamura (yaskaz@…)

Notes

Data storage will not be a problem (if you have enogh money). But downloading and/or manipulating Big sequence data must be problem for both provider (us=ddbj) and user. Is there a possible solution? And I have a big question: Do you really need and/or use "Short Read Archive?" -- yaskaz

Work goup discussions

next gen sequencer WG (big data and seq analysis)

applications and use cases:

  • all processing must be run on a server, can not be run on a laptop, how to control and monitor such processes from something like a laptop
  • local vs webservice. web service convenient but maybe not so secure which means people need to install our systems locally. but also collaboration needs access to data and tools
  • how to define the quality of sequence? contamination, auto screen tool?
  • moving to a data-driven research, not so much hypothesis driven. how to FIND biology in big datasets?
  • searching / comparing now mostly happening on the metadata (time series, tissue, environments)
  • metagenomics or novel genomes, no reference, can our tools do this? How?
  • maybe we get a strange sequence out of our instrument, how do we know? is it error? is it interesting biology? CG content off, maybe technical errors (concatonated primers)
  • how to bring the computation to the data because bring the data to the computation (1TB streaming) maybe not best approach
  • as sequencer companies build more and more sequence processing into the instruments, more and more processing becomes routine. it is a moving target. how do we aid in new biological discovery? moving beyond simple processing and simple work flows.

what is big data?

  • read level, contigs, assembly, scaffold each different level
  • amount vs complexity!! maybe amount not so hard, but complexity ....
  • variation, snp
  • data is big when the applications can not handle it. If 1GB of data crashes a viewer program than that is BIG. If 100k elements takes a day to upload, this upload must happen every time we want to work with the data, and it takes another day to filter for a service, then that is probably BIG data.

How to collaborate and communicate?

Data production centers

  • centers like RIKEN OSC-LSA is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be released. EdgeExpressDB (eeDB) was developed during FANTOM4 project and is now being used for in-house big data management and visualization of big datasets. eeDB is effectively an object-database which is implemented as an API and webservices. The system will be ported to C and file indexes this summer which will give us at least a 100x performance boost. Currently this API toolkit and webservices are written in perl with a narrow/deep mysql snowflake schema. This generation1 system of the API can manipulate short-read data for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily. Queries can access individual objects, edges, and work with streams or sets of objects queried by regions, node, or networks.
  • SRA is still evaluating technology to do region based access of short-reads

Working with existing big data

  • SRA, GEO, ArrayExpress?
  • now most of us pull the whole thing down and then work with it
  • sometimes it is even hard to send for submission, sometimes DVD to move it around. not optimal.
  • what are the queries we want to do?
    • what are the reads in this region, definitely
  • but also SRA maybe does not want do everything, but they will not turn data down and want everyone to send them the published data
  • maybe not all data will end up in ONE archive (because it is so big). maybe need to query multiple centers to find all data (DDBJ, SRA, GEO, ArrayExpress?, korea? China?)

What are the common ways we would want to query?

  • should we try to define?
  • do we wait for DataCenters??
  • region query again we need! means mapping needed by the services
  • but also want the read Sequence since we can extract SNP, need quality
  • NEED data providers to create complete meta-data descriptions.

IN THE END the goal is to find biology and getting access to the individual data elements is critical and this can not be just locked away in files that can not be internally accessed.

Results

TODOs

  • Design tests for BioSQL on different scenario related to BigData? BigD
    • Can BioSQL be used for short reads?
  • Test BioSQL performances on BigD

Attachments