Changes between Version 14 and Version 15 of SatelliteBigData

Show
Ignore:
Timestamp:
2009/03/19 17:18:34 (16 years ago)
Author:
severin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SatelliteBigData

    v14 v15  
    2828 
    2929 * Jan Aerts 
     30 * Jessica Severin 
    3031 * José María Fernández 
    3132 * Todd Harris 
     
    4243== Results == 
    4344 
     45next gen sequencer WG (big data and seq analysis) 
     46 
     47applications and use cases: 
     48 - all processing must be run on a server, can not be run on a laptop, how to control and monitor such processes from something like a laptop 
     49 - local vs webservice. web service convenient but maybe not so secure which means people need to install our systems locally. but also collaboration needs access to data and tools 
     50 - how to define the quality of sequence?  contamination, auto screen tool? 
     51 - moving to a data-driven research, not so much hypothesis driven. how to FIND biology? 
     52 - searching / comparing now mostly happening on the metadata (time series, tissue, environments) 
     53 - metagenomics or novel genomes, no reference, can our tools do this?  How? 
     54 - maybe we get a strange sequence out of our instrument, how do we know?  is it error? is it interesting biology?  CG content off, maybe technical errors (concatonated primers) 
     55 - how to bring the computation to the data because bring the data to the computation (1TB streaming) maybe not best approach 
     56 - as sequencer companies build more and more sequence processing into the instruments, more and more processing becomes routine. it is a moving target. how do we aid in new biological discovery?  moving beyond simple processing and simple work flows. 
     57 
     58WG2---- afternoon session 
     59 
     60what is big data? 
     61 - read level, contigs, assembly, scaffold each different level 
     62 - amount vs complexity!!  maybe amount not so hard, but complexity .... 
     63 - variation, snp 
     64 - data is big when the applications can not handle it.  If 1GB of data crashes a viewer program than that is BIG.  If 100k elements takes a day to upload, this upload must happen every time we want to work with the data, and it takes another day to filter for a service, then that is probably BIG data.  
     65 
     66How to collaborate and communicate? 
     67 
     68Data production centers 
     69 - centers like RIKEN OSC-LSA is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be released.  EdgeExpressDB (eeDB) was developed during FANTOM4 project and is now being used for in-house big data management and visualization of big datasets.  This system can manipulate short-read data for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily. 
     70 - SRA is still evaluating technology to do region based access of short-reads 
     71 
     72Working with existing big data 
     73 - SRA, GEO, ArrayExpress 
     74 - now most of us pull the whole thing down and then work with it 
     75 - sometimes even to send, sometimes DVD to move it around 
     76 - what are the queries we want to do? 
     77   - what are the reads in this region, defitely  
     78 - but also SRA maybe does not want do everything, but they will not turn data down and want everyone to send them the published data 
     79 - maybe not all data will end up in ONE archive (because it is so big). maybe need to query multiple centers to find all data (DDBJ, SRA, GEO, ArrayExpress, korea? China?) 
     80 
     81What are the common ways we would want to query? 
     82 - should we try to define? 
     83 - do we wait for DataCenters? 
     84 - region query again we need! means mapping needed by the services 
     85 - but also want the read Sequence since we can extract SNP, need quality 
     86 - NEED data providers to create complete meta-data descriptions. 
     87 
     88IN THE END the goal is to find biology and getting access to the individual data elements is critical and this can not be just locked away in files that can not be internally accessed. 
     89 
    4490 
    4591== TODOs ==