| 45 | next gen sequencer WG (big data and seq analysis) |
| 46 | |
| 47 | applications and use cases: |
| 48 | - all processing must be run on a server, can not be run on a laptop, how to control and monitor such processes from something like a laptop |
| 49 | - local vs webservice. web service convenient but maybe not so secure which means people need to install our systems locally. but also collaboration needs access to data and tools |
| 50 | - how to define the quality of sequence? contamination, auto screen tool? |
| 51 | - moving to a data-driven research, not so much hypothesis driven. how to FIND biology? |
| 52 | - searching / comparing now mostly happening on the metadata (time series, tissue, environments) |
| 53 | - metagenomics or novel genomes, no reference, can our tools do this? How? |
| 54 | - maybe we get a strange sequence out of our instrument, how do we know? is it error? is it interesting biology? CG content off, maybe technical errors (concatonated primers) |
| 55 | - how to bring the computation to the data because bring the data to the computation (1TB streaming) maybe not best approach |
| 56 | - as sequencer companies build more and more sequence processing into the instruments, more and more processing becomes routine. it is a moving target. how do we aid in new biological discovery? moving beyond simple processing and simple work flows. |
| 57 | |
| 58 | WG2---- afternoon session |
| 59 | |
| 60 | what is big data? |
| 61 | - read level, contigs, assembly, scaffold each different level |
| 62 | - amount vs complexity!! maybe amount not so hard, but complexity .... |
| 63 | - variation, snp |
| 64 | - data is big when the applications can not handle it. If 1GB of data crashes a viewer program than that is BIG. If 100k elements takes a day to upload, this upload must happen every time we want to work with the data, and it takes another day to filter for a service, then that is probably BIG data. |
| 65 | |
| 66 | How to collaborate and communicate? |
| 67 | |
| 68 | Data production centers |
| 69 | - centers like RIKEN OSC-LSA is producing lots of data, but this data must be managed, manipulated, and mined for biology before it can be released. EdgeExpressDB (eeDB) was developed during FANTOM4 project and is now being used for in-house big data management and visualization of big datasets. This system can manipulate short-read data for our internal research purposes and is proving to scale very well. eeDB works with node and network, sequence tag, mapping, and expression data at the level of billions of elements very easily. |
| 70 | - SRA is still evaluating technology to do region based access of short-reads |
| 71 | |
| 72 | Working with existing big data |
| 73 | - SRA, GEO, ArrayExpress |
| 74 | - now most of us pull the whole thing down and then work with it |
| 75 | - sometimes even to send, sometimes DVD to move it around |
| 76 | - what are the queries we want to do? |
| 77 | - what are the reads in this region, defitely |
| 78 | - but also SRA maybe does not want do everything, but they will not turn data down and want everyone to send them the published data |
| 79 | - maybe not all data will end up in ONE archive (because it is so big). maybe need to query multiple centers to find all data (DDBJ, SRA, GEO, ArrayExpress, korea? China?) |
| 80 | |
| 81 | What are the common ways we would want to query? |
| 82 | - should we try to define? |
| 83 | - do we wait for DataCenters? |
| 84 | - region query again we need! means mapping needed by the services |
| 85 | - but also want the read Sequence since we can extract SNP, need quality |
| 86 | - NEED data providers to create complete meta-data descriptions. |
| 87 | |
| 88 | IN THE END the goal is to find biology and getting access to the individual data elements is critical and this can not be just locked away in files that can not be internally accessed. |
| 89 | |