Changes between Version 57 and Version 58 of SatelliteBigData

Show
Ignore:
Timestamp:
2009/03/20 14:44:37 (15 years ago)
Author:
severin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SatelliteBigData

    v57 v58  
    8787Currently the available public resources like SRA, GEO, ArrayExpress, Cibex[http://cibex.nig.ac.jp/index.jsp] are only providing query facilities on the metadata of the experiments surrounding the data.  The data is available as files to download (often in the original format) but they do not provide facilities to externally explore the data and ask biological questions on the data.  This then forces anyone who wants to explore the dataset to download this data into local integration systems before they can ask their biological questions.  
    8888 
     89But not all data is public.  Research centers who generate this data need to manage it and data-mine it in order to produce publications and do science.  This means they need the same or greater sophistication of tools than are available on the public services.  Many of these research projects are often collaborative and international, which means this private datasets need to be accessible on the web, but protected and secured.  These "collaboration webservices" are often then made public when the research is published (for example the FANTOM4 project). Also with greater and greater international efforts, not all data may end up in one archive. Even today we see this between GEO, ArrayExpress and CIBEX (some datasets are only available on one or two of the three).  This means that next-generation queries may always have to query multiple databases simultaneously in order to find the data they need. 
     90 
    8991Working with existing big data 
    90  - SRA, GEO, ArrayExpress: today they just provide the metadata of the dataset, not an ability to explore the actual data  
    91  - now most of us pull the whole thing down and then work with it 
    92  - sometimes it is even hard to send for submission, sometimes DVD to move it around. not optimal. 
    9392 - what are the queries we want to do? 
    9493   - what are the reads in this region, definitely  
    9594 - but also SRA maybe does not want do everything, but they will not turn data down and want everyone to send them the published data 
    96  - maybe not all data will end up in ONE archive (because it is so big). maybe need to query multiple centers to find all data (DDBJ, SRA, GEO, ArrayExpress, korea? China?) 
    9795 
    9896What are the common ways we would want to query?