SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Fast Exploration of the QSAR
Model Space with e-Science
Central and Windows Azure
Simon Woodman
Jacek Cala
Hugo Hiden
Paul Watson
VENUS-C
  Developing technology to ease scientific
      adoption of Cloud Computing

• EU Funded Project
  – 11 Partners
  – 5 Technology/Infrastructure
  – 6 Scenario Partners
  – Open Call – 20 Pilot Studies
  – May 2010 to May 2012
Architrave




                                                                                           AEGEAN



                                                                                                      UPV.Bio




                                                                                                                              UNEW
                                                                        COLB




                                                                                                                    CoSBI
                                                                                  CNR
             Scenario /
             Algorithm

 Users      Programming                                                                    .NE                      .NE
                                                         C++           C++      Java                 C++                     Java
             Language                                                                       T                        T


               Type of                                                                          Parameter            Map /
                           Batch                   HTC            Workflow     Data flow                                             CEP
              workload                                                                           sweep              Reduce



             Execution
VENUS-      Environments      EMIC Generic Worker                                                               BSC COMPS
  C
             Operating
                           Windows                                             Windows                          Linux




                                                                                                                                      BSC Super Computer
              System
                                                         Windows




                                                                                                                                        (not in the cloud)
                                                          Azure                                                       EMOTIV
  Infra-
                              (not in the cloud)




               Cloud
                                                                                 OpenNebula                     …
                                On Premises




             Technology                                                                                                 E
structure

              Cloud
             Paradigm                                        PaaS                                   IaaS


               Cloud       Custome
              Provider                                    MSFT                  ENG                 KTH                      BSC
                              r
The Problem
           What are the properties of this molecule?

Toxicity

                                               Biological Activity



Solubility



              Perform experiments

                                             Time consuming
                                             Expensive
                                             Ethical constraints
The alternative to Experiments
                Predict likely properties based on similar molecules


CHEMBL Database:       data on 622,824 compounds,
                       collected from 33,956 publications

WOMBAT Database:       data on 251,560 structures,
                       for over 1,966 targets

WOMBAT-PK Database:    data on 1230 compounds,
                       for over 13,000 clinical measurements



  All these databases contain structure information and numerical activity data
QSAR
                              QSAR
Quantitative Structure Activity Relationship

                 Activity ≈
                              f(               )
More accurately, Activity related to a quantifiable structural attribute


Activity ≈ f(   logP, number of atoms, shape....)

                        Currently > 3,000 recognised attributes
                                             http://www.qsarworld.com/
Method
Branching Workflows
Partition training & test   Random split
           data             80:20 split



 Calculate descriptors      Java CDK descriptors
                            C++ CDL descriptors

                            Correlation analysis
  Select descriptors        Genetic algorithms
                            Random selection


      Build model           Linear regression
                            Neural Network
                            Partial Least Squares
                            Classification Trees

   Add to database
e-Science Central
             Platform for cloud based data analysis




Azure
                                             Java
EC2
                                             R
On Premise
                                             Octave
                                             Javascript
Architecture
                                                                             <<web role>>
                                                                            Generic Worker
                                                        e-SC control data
                                                                              Workflow
                                                                               engine
                   web                                                                                                        web
                 browser                                                                                                    browser
                              rich client
                                 app                                         <<web role>>
                                                                                                             <<web role>>
                                                                            Generic Worker
                                                                                                               QSAR




                                                                                             workflow data
                                            <<Azure VM>>                      Workflow                        Explorer
                                                                               engine
            Web UI      REST API

                 e-Science
e-SC blob         Central
                main server             JMS queue                            <<web role>>
                                                                                                             Azure Blob
  store
                                                                            Generic Worker                     store
                                                                              Workflow
                                                                               engine
                                                  workflow invocations


                 e-SC db
                 backend

               <<Azure VM>>
Microsoft HPC User Group
Workflow Architecture
                       Worker Role

• Single Message            Install JRE


  Queue                   Install wf engine


  – Worker Failure       Execute the engine

    Semantics              Get Job from
                             Queue
  – Elasticity
                          Deploy Runtime?
• Runtime
                             Get Data
  Environments
                            Execute Job
  –R
  – Octave                   Put Data


  – Java                  Put Next Jobs on
                               Queue

• Deployed only once
Results
• 250k models             • QSAR Explorer
  –   Linear Regression     – Browse
  –   PLS                   – Search
  –   RPartitioning         – Get Predictions
  –   Neural Net
• 460K workflow
  executions
• 4.4M service calls
Scalability: Large Scale QSAR
                                                                                                                16:48
480 datasets sequential time: 11 days
                                                                                                                                                          GW
                                                  100 Nodes          200 Nodes
                                                                                                                14:24                                     Azur
                                                                                                                                                          e
           Response Time                          3hr 19mins         1hr 50mins

                               Speedup               94x                 156x                                   12:00




                                                                                       Execution time [hh:mm]
                               Efficiency            94%                  78%
                                                                                                                09:36
                                 Cost               $55.68               $51.84

                                250.0
                                                                                                                07:12

                                200.0
Relative processing speed-up




                                                                                                                04:48
                                150.0


                                100.0                                                                           02:24


                                 50.0                                    Azure
                                                                         ideal                                  00:00
                                                                         GW                                             0   50       100     150        200      250
                                   0.0
                                                                                                                                 Number of processors
                                         0   50       100      150   200         250
                                                  Number of processors
Cloud Applicability
• Bursty
  – ChEMBLdb updates (delta 10%)
  – New Modelling Methods (???)
• Performance depends on how chatty the
  problem is
  – Deploy (incl download) dependencies once
  – Avoid storage bottlenecks
Performance is great but …

Drug Development requires us to capture
       the data and the process
Provenance/Audit
            Requirements
• How was a model generated?
  – What algorithm?
  – What descriptors
• Are these results reproducible?

• How have bugs manifested?
  – Which models affected
  – How do we regenerate affected models?
• Performance Characteristics
• How do we deal with new data?
Storing Provenance
• Neo4j
  – Open Source Graph Database
  – Nodes/Relationships + properties
  – Querying/traversing
• Access
  – Java lib for OPM
  – e-SC library built on top of OPM lib
  – REST interface
• Options for HA and Sharding for
  performance
Provenance Model
• Based on OPM
  – Processes, Artifacts, Age
    nts
• Directed Graph
• Multiple views of
  provenance
  – Dependent on security
    privileges
Adding new model builders
1. Add new block
                                                Enumerate

2. Mine the provenance                          descriptors
                                                        1
                                      n                 n

3. Dynamically create          Build and
                             cross-validate
                                              Build and cross-
                                              validate a new
                                RPart-m        kind of model
   virtual workflows                  1

                         1
4. One invocation per    cross-validation

   data set              1
                                      1

                             Test RPart-m        Test ?-m



• Work in progress…
Future Work
• Scalability and reliability
  – SQL Azure
  – Application server replication
• Provenance visualization
• Meta-QSAR
  – Provenance Mining
• Cloud4Science
  – Applying lessons learned to new scenarios
MOVEeCloud Project
•   Investigating the links between
    physical activity and common
    diseases – type 2
    diabetes, cardiovascular
    disease,…

•   Wrist accelerometers worn over 1
    week period
•   Measures movement at 100Hz in
    three axes
•   Processing ideal for Azure
    –   Bursty data processing as new
        data gathered
    –   Embarrassingly parallel
    –   Large datasets
MOVEeCloud Process
                                    Analysis and
                                    Classification

                                R, Java, Octave

                           Walkin                               Sedentar
                 Sleep                 Sedentary     Activity
                            g                                      y




                                                   Methodology
   Clinician’s              Patient
                                                    Section for
    Report               Interventions
                                                     Papers
Data Sizes
100 samples / second                       100 rows

3600 seconds / hour                        360,000 rows

24 hours / day                             8,640,000 rows

7 days / study                             60,480,000 rows




                                             / patient / visit


                 Cohort size of 800 patients and multiple visits
Working with larger data sets
• As we add more workflow engines server
  load increases
  – One server can cope 200 engines if files are
    small
• This is not the case with movement data
  – Only support 4 engines
• Increase the bandwidth to the engines
  – Clustering appserver /database?
HDFS
• Implemented prior to Native HDFS on Azure

• Easy to integrate with e-sc
   – Java system just requires libraries included in e-sc

• Distributed store where bandwidth increases with number of
  machines
   – Bits of data spread around lots of machines
• Concept of data location
   – Potential to route workflows to execute as close as possible to
     storage

• Other applications also also built on top of HDFS
   – Open TSDB to store timeseries for movement data
Drawbacks to HDFS
• Needs a NAMENODE to co-ordinate everything
   – Single point of failure (in current HDFS)
• Metadata stored in RAM
   – doesn’t scale beyond a million or so files
   – Bad for drug discovery)
• Not particularly efficient for small files
   – There is an overhead to connecting to filesystem
• If instances terminate can lose data if not backed up
   – Redundancy helps
   – Backing in Cloud and stage to HDFS for experiments
   – Might use it is a cache of “Hot” files and use Blobstore/S3
     to back it all up
e-SC Workflows
Chunking Workflow




Chunk Processing Workflow
System Setup
• One machine for the e-sc server
     • 4 CPUs, 7GB RAM, 1TB Local storage
• One machine for Namenode
     • 4 CPUs, 7GB RAM, 1TB local storage
• Four workflow engines
     • 2 CPUs, 3.5 GB RAM, 500GB local storage
• 2TB HDFS storage using workflow engines
     • Mounts up quickly
• Increase priority of HDFS on engines
     • Competing for resources with workflow
Initial Results
  For a single data set processing went from 60 to 16 minutes
            using 4 workflow engines running HDFS

• 4 engines the limit for one e-sc server
   • Main server hit 100% CPU delivering data
   • No further improvements with more engines
• Using HDFS CPU was consistently below 5%
   • More like our earlier scalability results

• Once data had been chunked processing was the same for
  each chunk

• The improvement lay entirely in staging and uploading results
Questions?
• Thank you to our generous funders
  – Microsoft Cloud 4 Science
  – EU FP7 - VENUS-C (RI-261565)
  – RCUK – SiDE (EP/G066019/1)

• The Team
  – Jacek Cala
  – Paul Watson
  – Hugo Hiden

Weitere ähnliche Inhalte

Was ist angesagt?

REAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMS
REAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMSREAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMS
REAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMSMarcos Nieto
 
Puneet Singla
Puneet SinglaPuneet Singla
Puneet Singlapsingla
 
Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyFabio Porto
 
IEEE ECE main projects list 2012 13
IEEE ECE main projects list 2012 13IEEE ECE main projects list 2012 13
IEEE ECE main projects list 2012 13Vision Solutions
 
MajorProjects16-2
MajorProjects16-2MajorProjects16-2
MajorProjects16-2Gromit Park
 
IGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptxIGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptxgrssieee
 

Was ist angesagt? (8)

REAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMS
REAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMSREAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMS
REAL-TIME 3D MODELING OF VEHICLES IN LOW-COST MONOCAMERA SYSTEMS
 
Cudaray
CudarayCudaray
Cudaray
 
Puneet Singla
Puneet SinglaPuneet Singla
Puneet Singla
 
Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in Astronomy
 
IEEE ECE main projects list 2012 13
IEEE ECE main projects list 2012 13IEEE ECE main projects list 2012 13
IEEE ECE main projects list 2012 13
 
MajorProjects16-2
MajorProjects16-2MajorProjects16-2
MajorProjects16-2
 
Fn2611681170
Fn2611681170Fn2611681170
Fn2611681170
 
IGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptxIGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptx
 

Ähnlich wie Microsoft HPC User Group

Open stackinaction compatibleone 09212011
Open stackinaction compatibleone  09212011Open stackinaction compatibleone  09212011
Open stackinaction compatibleone 09212011CompatibleOne
 
Compatibleone @ OpenStack In Action
Compatibleone @ OpenStack In Action Compatibleone @ OpenStack In Action
Compatibleone @ OpenStack In Action CompatibleOne
 
Dell web monsters-oct2011-v6-public
Dell web monsters-oct2011-v6-publicDell web monsters-oct2011-v6-public
Dell web monsters-oct2011-v6-publicBarton George
 
Operating the Hyperscale Cloud
Operating the Hyperscale CloudOperating the Hyperscale Cloud
Operating the Hyperscale CloudOpen Stack
 
Cloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergenceCloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergenceDavid Karam
 
Cloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergenceCloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergenceDavid Karam
 
Eb07 Day Communiqué Web Content Management En
Eb07 Day Communiqué Web Content Management EnEb07 Day Communiqué Web Content Management En
Eb07 Day Communiqué Web Content Management EnValtech
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsThilina Gunarathne
 
Web Content Management And Agile
Web Content Management And AgileWeb Content Management And Agile
Web Content Management And AgileValtech UK
 
Http Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...
Http   Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...Http   Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...
Http Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...qedanne
 
CompatibleOne @ OpenWorldForum 2011
CompatibleOne @ OpenWorldForum 2011CompatibleOne @ OpenWorldForum 2011
CompatibleOne @ OpenWorldForum 2011CompatibleOne
 
OpenStack at Xen summit Asia
OpenStack at Xen summit Asia OpenStack at Xen summit Asia
OpenStack at Xen summit Asia Jaesuk Ahn
 

Ähnlich wie Microsoft HPC User Group (20)

Sumo
SumoSumo
Sumo
 
Open stackinaction compatibleone 09212011
Open stackinaction compatibleone  09212011Open stackinaction compatibleone  09212011
Open stackinaction compatibleone 09212011
 
Compatibleone @ OpenStack In Action
Compatibleone @ OpenStack In Action Compatibleone @ OpenStack In Action
Compatibleone @ OpenStack In Action
 
Dell web monsters-oct2011-v6-public
Dell web monsters-oct2011-v6-publicDell web monsters-oct2011-v6-public
Dell web monsters-oct2011-v6-public
 
Operating the Hyperscale Cloud
Operating the Hyperscale CloudOperating the Hyperscale Cloud
Operating the Hyperscale Cloud
 
Introducing JSR-283
Introducing JSR-283Introducing JSR-283
Introducing JSR-283
 
Cloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergenceCloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergence
 
Cloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergenceCloud Computing, SOA and Web 2.0, an inevitable convergence
Cloud Computing, SOA and Web 2.0, an inevitable convergence
 
Quantum Networks
Quantum NetworksQuantum Networks
Quantum Networks
 
Agile Edge Valtech
Agile Edge ValtechAgile Edge Valtech
Agile Edge Valtech
 
Eb07 Day Communiqué Web Content Management En
Eb07 Day Communiqué Web Content Management EnEb07 Day Communiqué Web Content Management En
Eb07 Day Communiqué Web Content Management En
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on Clouds
 
Web Content Management And Agile
Web Content Management And AgileWeb Content Management And Agile
Web Content Management And Agile
 
Http Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...
Http   Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...Http   Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...
Http Jaoo.Com.Au Sydney 2008 File Path= Jaoo Aus2008 Slides Dave Thomas Lif...
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
What's new in JSR-283?
What's new in JSR-283?What's new in JSR-283?
What's new in JSR-283?
 
Data-Intensive Research
Data-Intensive ResearchData-Intensive Research
Data-Intensive Research
 
Aag 2013
Aag 2013Aag 2013
Aag 2013
 
CompatibleOne @ OpenWorldForum 2011
CompatibleOne @ OpenWorldForum 2011CompatibleOne @ OpenWorldForum 2011
CompatibleOne @ OpenWorldForum 2011
 
OpenStack at Xen summit Asia
OpenStack at Xen summit Asia OpenStack at Xen summit Asia
OpenStack at Xen summit Asia
 

Microsoft HPC User Group

  • 1. Fast Exploration of the QSAR Model Space with e-Science Central and Windows Azure Simon Woodman Jacek Cala Hugo Hiden Paul Watson
  • 2. VENUS-C Developing technology to ease scientific adoption of Cloud Computing • EU Funded Project – 11 Partners – 5 Technology/Infrastructure – 6 Scenario Partners – Open Call – 20 Pilot Studies – May 2010 to May 2012
  • 3. Architrave AEGEAN UPV.Bio UNEW COLB CoSBI CNR Scenario / Algorithm Users Programming .NE .NE C++ C++ Java C++ Java Language T T Type of Parameter Map / Batch HTC Workflow Data flow CEP workload sweep Reduce Execution VENUS- Environments EMIC Generic Worker BSC COMPS C Operating Windows Windows Linux BSC Super Computer System Windows (not in the cloud) Azure EMOTIV Infra- (not in the cloud) Cloud OpenNebula … On Premises Technology E structure Cloud Paradigm PaaS IaaS Cloud Custome Provider MSFT ENG KTH BSC r
  • 4. The Problem What are the properties of this molecule? Toxicity Biological Activity Solubility Perform experiments Time consuming Expensive Ethical constraints
  • 5. The alternative to Experiments Predict likely properties based on similar molecules CHEMBL Database: data on 622,824 compounds, collected from 33,956 publications WOMBAT Database: data on 251,560 structures, for over 1,966 targets WOMBAT-PK Database: data on 1230 compounds, for over 13,000 clinical measurements All these databases contain structure information and numerical activity data
  • 6. QSAR QSAR Quantitative Structure Activity Relationship Activity ≈ f( ) More accurately, Activity related to a quantifiable structural attribute Activity ≈ f( logP, number of atoms, shape....) Currently > 3,000 recognised attributes http://www.qsarworld.com/
  • 8. Branching Workflows Partition training & test Random split data 80:20 split Calculate descriptors Java CDK descriptors C++ CDL descriptors Correlation analysis Select descriptors Genetic algorithms Random selection Build model Linear regression Neural Network Partial Least Squares Classification Trees Add to database
  • 9. e-Science Central Platform for cloud based data analysis Azure Java EC2 R On Premise Octave Javascript
  • 10. Architecture <<web role>> Generic Worker e-SC control data Workflow engine web web browser browser rich client app <<web role>> <<web role>> Generic Worker QSAR workflow data <<Azure VM>> Workflow Explorer engine Web UI REST API e-Science e-SC blob Central main server JMS queue <<web role>> Azure Blob store Generic Worker store Workflow engine workflow invocations e-SC db backend <<Azure VM>>
  • 12. Workflow Architecture Worker Role • Single Message Install JRE Queue Install wf engine – Worker Failure Execute the engine Semantics Get Job from Queue – Elasticity Deploy Runtime? • Runtime Get Data Environments Execute Job –R – Octave Put Data – Java Put Next Jobs on Queue • Deployed only once
  • 13. Results • 250k models • QSAR Explorer – Linear Regression – Browse – PLS – Search – RPartitioning – Get Predictions – Neural Net • 460K workflow executions • 4.4M service calls
  • 14. Scalability: Large Scale QSAR 16:48 480 datasets sequential time: 11 days GW 100 Nodes 200 Nodes 14:24 Azur e Response Time 3hr 19mins 1hr 50mins Speedup 94x 156x 12:00 Execution time [hh:mm] Efficiency 94% 78% 09:36 Cost $55.68 $51.84 250.0 07:12 200.0 Relative processing speed-up 04:48 150.0 100.0 02:24 50.0 Azure ideal 00:00 GW 0 50 100 150 200 250 0.0 Number of processors 0 50 100 150 200 250 Number of processors
  • 15. Cloud Applicability • Bursty – ChEMBLdb updates (delta 10%) – New Modelling Methods (???) • Performance depends on how chatty the problem is – Deploy (incl download) dependencies once – Avoid storage bottlenecks
  • 16. Performance is great but … Drug Development requires us to capture the data and the process
  • 17. Provenance/Audit Requirements • How was a model generated? – What algorithm? – What descriptors • Are these results reproducible? • How have bugs manifested? – Which models affected – How do we regenerate affected models? • Performance Characteristics • How do we deal with new data?
  • 18. Storing Provenance • Neo4j – Open Source Graph Database – Nodes/Relationships + properties – Querying/traversing • Access – Java lib for OPM – e-SC library built on top of OPM lib – REST interface • Options for HA and Sharding for performance
  • 19. Provenance Model • Based on OPM – Processes, Artifacts, Age nts • Directed Graph • Multiple views of provenance – Dependent on security privileges
  • 20. Adding new model builders 1. Add new block Enumerate 2. Mine the provenance descriptors 1 n n 3. Dynamically create Build and cross-validate Build and cross- validate a new RPart-m kind of model virtual workflows 1 1 4. One invocation per cross-validation data set 1 1 Test RPart-m Test ?-m • Work in progress…
  • 21. Future Work • Scalability and reliability – SQL Azure – Application server replication • Provenance visualization • Meta-QSAR – Provenance Mining • Cloud4Science – Applying lessons learned to new scenarios
  • 22. MOVEeCloud Project • Investigating the links between physical activity and common diseases – type 2 diabetes, cardiovascular disease,… • Wrist accelerometers worn over 1 week period • Measures movement at 100Hz in three axes • Processing ideal for Azure – Bursty data processing as new data gathered – Embarrassingly parallel – Large datasets
  • 23. MOVEeCloud Process Analysis and Classification R, Java, Octave Walkin Sedentar Sleep Sedentary Activity g y Methodology Clinician’s Patient Section for Report Interventions Papers
  • 24. Data Sizes 100 samples / second 100 rows 3600 seconds / hour 360,000 rows 24 hours / day 8,640,000 rows 7 days / study 60,480,000 rows / patient / visit Cohort size of 800 patients and multiple visits
  • 25. Working with larger data sets • As we add more workflow engines server load increases – One server can cope 200 engines if files are small • This is not the case with movement data – Only support 4 engines • Increase the bandwidth to the engines – Clustering appserver /database?
  • 26. HDFS • Implemented prior to Native HDFS on Azure • Easy to integrate with e-sc – Java system just requires libraries included in e-sc • Distributed store where bandwidth increases with number of machines – Bits of data spread around lots of machines • Concept of data location – Potential to route workflows to execute as close as possible to storage • Other applications also also built on top of HDFS – Open TSDB to store timeseries for movement data
  • 27. Drawbacks to HDFS • Needs a NAMENODE to co-ordinate everything – Single point of failure (in current HDFS) • Metadata stored in RAM – doesn’t scale beyond a million or so files – Bad for drug discovery) • Not particularly efficient for small files – There is an overhead to connecting to filesystem • If instances terminate can lose data if not backed up – Redundancy helps – Backing in Cloud and stage to HDFS for experiments – Might use it is a cache of “Hot” files and use Blobstore/S3 to back it all up
  • 29. System Setup • One machine for the e-sc server • 4 CPUs, 7GB RAM, 1TB Local storage • One machine for Namenode • 4 CPUs, 7GB RAM, 1TB local storage • Four workflow engines • 2 CPUs, 3.5 GB RAM, 500GB local storage • 2TB HDFS storage using workflow engines • Mounts up quickly • Increase priority of HDFS on engines • Competing for resources with workflow
  • 30. Initial Results For a single data set processing went from 60 to 16 minutes using 4 workflow engines running HDFS • 4 engines the limit for one e-sc server • Main server hit 100% CPU delivering data • No further improvements with more engines • Using HDFS CPU was consistently below 5% • More like our earlier scalability results • Once data had been chunked processing was the same for each chunk • The improvement lay entirely in staging and uploading results
  • 31. Questions? • Thank you to our generous funders – Microsoft Cloud 4 Science – EU FP7 - VENUS-C (RI-261565) – RCUK – SiDE (EP/G066019/1) • The Team – Jacek Cala – Paul Watson – Hugo Hiden

Hinweis der Redaktion

  1. [JC] Shouldn’t the title be “e-SC in Azure”?
  2. Work Stealing vs Work scheduling
  3. Important for drug discovery due to traceability requirements
  4. Natural fit for storing provenance as it’s a graph to start with
  5. C4S – Mutation Detection, NGS, e-SC available on Azure, data sets for bioinformatics
  6. Large datasets need smart moving of data around – HDFS? otherwise hit the limit on storage account b/w
  7. Single POFNot great for small files or 1M+ filesInstances can terminate resulting in loss of data