SlideShare a Scribd company logo
1 of 19
Download to read offline
Solving Business Problems with High Performance Computing




RED/082311
HPCC Architecture                                                 http://hpccsystems.com




   HPCC is based on a distributed shared-nothing architecture, and uses commodity
    PC servers
   HPCC uses a consolidated central network switch for best for performance, but can
    use a decentralized network of satellite switches. Computing efficiencies in HPCC
    substantially reduce the number of required nodes and, hence, network ports
    compared to other distributed platforms like Hadoop.
   HPCC Distributed File System is record oriented (supporting fixed length, variable
    length field delimited and XML records)
   HPCC DFS is tightly coupled with the processing layer, exploiting data locality to
    a higher degree to minimize network transfers and maximize throughput
   HPCC applications are built using ECL, a declarative data flow oriented language
    supporting the automation of algorithm parallelization and work distribution



WHT/082311


             Risk Solutions
Three Main Components
                                                                                                                     http://hpccsystems.com

                                                Massively Parallel Extract Transform and Load (ETL) engine
     HPCC Data Refinery (Thor)                     – Built from the ground up as a parallel data environment. Leverages inexpensive locally
                                                      attached storage. Doesn’t require a SAN infrastructure.
                                         •       Enables data integration on a scale not previously available:
                                                    – Current LexisNexis person data build process generates 350 Billion intermediate results at
                                                      peak
                                         •       Suitable for:
                                                    – Massive joins/merges
                                                    – Massive sorts & transformations
                                                    – Any N2 problem



                                                A massively parallel, high throughput, structured query response engine
    HPCC Data Delivery Engine (Roxie)
                                         
                                         
                                                 Ultra fast due to its read-only nature.
                                                 Allows indices to be built onto data for efficient multi-user retrieval of data
                                                Suitable for
                                                    Volumes of structured queries
                                                    Full text ranked Boolean search




                                             • An easy to use , data-centric programming language optimized for large-scale
   Enterprise Control Language (ECL)             data management and query processing

                                             • Highly efficient; Automatically distributes workload across all nodes.
                                             • Automatic parallelization and synchronization of sequential algorithms for
                                                   parallel and distributed processing
                                             •     Large library of efficient modules to handle common data manipulation tasks

WHT/082311


                Risk Solutions
                                                                                                                               HPCC Systems        3
The HPCC Platform Delivers Benefits
in Speed, Capacity and Cost Savings                                 http://hpccsystems.com


  Speed
   Scales to extreme workloads quickly and easily
   Increase speed of development leads to faster production/delivery
   Improved developer productivity
  Capacity
   Enables massive joins, merges, sorts, transformations or tough O(n2)
     problems
   Increases business responsiveness
   Accelerates creation of new services via rapid prototyping capabilities
   Offers a platform for collaboration and innovation leading to better results
  Cost Savings
   Commodity hardware and fewer people can do much more in less time
   Uses IT resources efficiently via sharing and higher system utilization

WHT/082311


             Risk Solutions
                                                                                             4
Scalability: Little Degradation in Performance
                                                                                                http://hpccsystems.com

    Scalability
     Scales to support 1000+TB (up to petabytes) of data
     Purposely built system to do massive I/O
        Rapidly performs complex queries on structured and unstructured data to link to a variety of data sources
     Suitable for massive joins/mergers, beyond limits of relational DBs
     Scale increases with the addition of low-cost, commodity servers
    HPCC – in production
     Current production systems range from 20 to 2000 nodes
     Currently supports over 150,000 customers, millions of end users
     Currently handling over 20 million transactions per day for our online and batch products. and innovation
        leading to better results
    Complex query example demonstrates below
     Transaction latencies increase logarithmically while data sizes grow linearly
                                Query Time




                                                                                  13 sec
                                                  10 sec



                                             50    100     150      200     250   300

WHT/082311                                                       TB of Storage



               Risk Solutions
                                                                                                                         5
Enterprise Control Language (ECL)                          http://hpccsystems.com


Enterprise Control Language (ECL) is a programming
   language developed over 10 years by a group of ex-
   Borland compiler writers.
Designed specifically for algorithmic development on top of
   large data sets.
Functional language optimized for large-scale data
   processing at scale.
Abstracts the underlying platform configuration from the
  algorithm development allowing efficient, expressive
  data manipulation
Automatic parallelization and optimization of sequential
  algorithms for parallel and distributed processing (i.e.,
  the programmer does not need to understand how to
  manage the parallel processing environment)
Large library of efficient modules to handle common data
   manipulation tasks
Inbuilt data primitives (sort, distribute, join) highly
   optimized for distributed data processing tasks




   WHT/082311


                   Risk Solutions
                                                                                       6
Many platforms, fragmented, heterogeneous :
Complex & Expensive                           http://hpccsystems.com




 WHT/082311


              Risk Solutions
One Platform End-to-End: Simple
                                                                         http://hpccsystems.com




WHT/082311   Consistent and elegant HW&SW architecture across the complete platform

              Risk Solutions
Performance comparison: PigMix style                                                                                                       http://hpccsystems.com

             PigMix Benchmark Using Subset (Based on 1M Row Base) (times in seconds – less is better)
             Hadoop 10-node (4GB Memory, 2-250GB per node), HPCC Thor 10-way, community_3.0.4-10



                                                                                                                               ECL vs. Pig (times     ECL vs. Java (times
             PigMix Script       Pig Runtime (secs)            Java Runtime (secs)          ECL Thor Runtime (secs)            faster)                faster)



             L2                                           73                           68                               0.96                 76.36                   71.13
             L3                                          156                          229                               1.99                 78.51                  115.25
             L4                                           95                          104                              14.02                   6.78                   7.42
             L5                                           85                          224                               7.02                  12.11                  31.90
             L6                                           96                          100                               1.80                 53.30                   55.52
             L7                                           90                          110                               1.40                 64.29                   78.57
             L8                                           66                           79                               0.89                 74.07                   88.66
             L9                                          162                          128                              22.39                   7.24                   5.72
             L10                                         237                          146                              18.08                  13.11                   8.08
             L11                                         414                          503                             342.65                   1.21                   1.47
             L12                                         108                          218                               2.74                 39.47                   79.68
             L13                                          87                           73                               2.09                 41.69                   34.98
             L14                                         142                            7                               1.29                109.74                    5.41
             L15                                          98                            6                             108.58                   0.90                   0.06
             L16                                          97                            7                               1.40                 69.19                    4.99
             L17                                         153                            6                               5.41                 28.29                    1.11



             Total                                     2159                          2008                             532.70                   4.05                   3.77
WHT/082311


                     Risk Solutions
Platform is now Open Source                 http://hpccsystems.com




HPCC Systems: http://hpccsystems.com

Comparison with Hadoop: http://
   hpccsystems.com/Why-HPCC/HPCC-vs-
   Hadoop

Downloads: http://hpccsystems.com/download




  WHT/082311


                Risk Solutions
HPCC Systems Case Studies




RED/082311
WHT/082311
Case Example 1: Choice Point Migration
                                                                                                      http://hpccsystems.com

Scenario
Reed Elsevier buys ChoicePoint for $4.1 billion and integrates it into LexisNexis Risk Solutions
            Spun off from Equifax, Inc. in 1997, ChoicePoint grew quickly through acquisition.
            ChoicePoint kept its business decentralized and individual units operated different – via duplicative,
             technology platforms.
            Its largest division, Insurance, operated on an expensive and outdated IBM mainframe-based architecture,
             while its public records data center used an old, inflexible relational database built off an Oracle/
             Webmethods, Sun Microsystems-based architecture.
            ChoicePoint struggled under high information technology (IT) management costs and its growth was
             constricted due to its inability to adapt its technology and develop new solutions to meet evolving business
             needs.

Task
Migrate ChoicePoint, including the Insurance business to LexisNexis HPCC platform without disrupting business

Result
            LexisNexis anticipates 70% reduction in new product time-to-market as a result of moving ChoicePoint to
             HPCC platform.
            2 years into the integration, LexisNexis is on track to generate substantial savings within the first 5 years.
            LexisNexis has seen an immediate 8-10% improvement in hit rates as a function of better matching and
             linking technology, which is a derivative of the HPCC technology
            The response time of interactive transactions for certain products has improved by a factor of 200% on
             average, simply as a function of moving to the HPCC.
WHT/082311


                 Risk Solutions
                                                                                                                              12
Case Example 2: Insurance Scoring, From 100 Days to 30
  Minutes                                                                                                        http://hpccsystems.com


Scenario
One of the top 3 insurance providers using traditional analytics products on multiple statistical model
platforms with disparate technologies
      –   Insurer issues request to re-run all past reports for a customer: 11.5M reports since 2004
      –   Using their current technology infrastructure it would take 100 days to run these 11.5M reports

Task
Migration of 75 different models, 74,000 lines of code and approximately 700 definitions to the HPCC
      –   Models were migrated in 1 man month.
      –   Using a small development system (and only one developer), we ran 11.5 million reports in 66 minutes
      –   Performance on a production-size system: 30 minutes
      –   Testing demonstrated our ability to work in batch or online

Result
      –   Reduced manual work, increases reliability, and created capability to do new scores
      –   Decreased development time from 1 year to several weeks; decreased run time from 100 days to 30
          minutes
      –   One HPCC (one infrastructure) translates into less people maintaining systems.


  WHT/082311


                   Risk Solutions
                                                                                                                                      13
Case Example 3: Detecting Insurance Collusion in
Louisiana                                          http://hpccsystems.com




Scenario
This view of carrier data shows
seven known fraud claims and
an additional linked claim.


The Insurance company data
only finds a connection
between two of the seven
claims, and only identified one
other claim as being weakly
connected.




WHT/082311


             Risk Solutions
                                                         HPCC Systems   14
Case Example 3: Continued
                                       http://hpccsystems.com



Task
After adding the Link ID to the
carrier Data, LexisNexis
HPCC technology then added        Family 1
2 additional degrees of
relative

Result
The results showed two
family groups
interconnected on all of
these seven claims.
                                      Family 2

The links were much stronger
than the carrier data
previously supported.


WHT/082311


             Risk Solutions
                                             HPCC Systems   15
Case Example 4: Network Traffic Analysis in Seconds
                                                         http://hpccsystems.com


Scenario
Conventional network sensor and monitoring solutions
are constrained by inability to quickly ingest massive
data volumes for analysis
- 15 minutes of network traffic can generate 4
Terabytes of data, which can take 6 hours to process
- 90 days of network traffic can add up to 300+
Terabytes

Task
Drill into all the data to see if any US government
systems have communicated with any suspect systems
of foreign organizations in the last 6 months
- In this scenario, we look specifically for traffic
occurring at unusual hours of the day

Result
In seconds, the HPCC sorted through months of
network traffic to identify patterns and suspicious
behavior



WHT/082311


               Risk Solutions
                                                                              16
Case Example 5: Suspicious Equity Stripping Cluster
                                                      http://hpccsystems.com




Scenario
Is mortgage fraud just an activity of isolated
individuals, or an industry?

Task
Rank the nature, connectedness, and proximity of
suspicious property transactions for every identity
in the U.S.

Result
Florida Proof of Concept
Highest ranked influencers
      Identified known ringleaders in flipping
      and equity stripping schemes.
      Typically not connected directly to
      suspicious transactions.
Clusters with high levels of potential collusion.
Clusters offloading property, generating defaults



WHT/082311


                Risk Solutions
                                                                           17
Case Example 6: Mining Wikipedia Interest
                                                           http://hpccsystems.com

Senario
21 Billion Rows of Wikipedia Hourly Pageview
Statistics for a year.

Task
Generate meaningful statistics to understand
aggregated global interest in all Wikipedia pages across
the 24 hours of the day built off all english wikipedia
pageview logs for 12 months.

Result
Produces page statistics that can be queried in seconds
to visualize which key times of day each Wikipedia
page is more actively viewed.
Helps gain insight into both regional and time of day
key interest periods in certain topics.
This result can be leveraged with Machine Learning to
cluster pages with similar 24hr Fingerprints.




WHT/082311


               Risk Solutions
                                                                 HPCC Systems   18
Conclusion
                                                                     http://hpccsystems.com



      • The HPCC platform is built from inexpensive commodity hardware.
      • ECL, a declarative data flow language, provides automatic
        parallelization and work distribution.
      • Data Linking, Clustering, and Mining applications leveraging billions of
        records are very feasible.
      • The HPCC Platform is now Open Source
      • The HPCC Platform is Enterprise ready, with many years of use as the
        Big Data platform for LexisNexis Risk Solutions
      Questions?



WHT/082311


             Risk Solutions
                                                                                          19

More Related Content

What's hot

Resource Bundle
Resource BundleResource Bundle
Resource BundleSunil OS
 
HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION bhupesh lahare
 
Cypher Query Language
Cypher Query Language Cypher Query Language
Cypher Query Language graphdevroom
 
Implementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in CImplementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in CKasun Ranga Wijeweera
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseWanBK Leo
 
Relational Algebra-Database Systems
Relational Algebra-Database SystemsRelational Algebra-Database Systems
Relational Algebra-Database Systemsjakodongo
 
Role-Based Access Control (RBAC) in Neo4j
Role-Based Access Control (RBAC) in Neo4jRole-Based Access Control (RBAC) in Neo4j
Role-Based Access Control (RBAC) in Neo4jNeo4j
 
Distributed dbms architectures
Distributed dbms architecturesDistributed dbms architectures
Distributed dbms architecturesPooja Dixit
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R PackagesRsquared Academy
 
How to choose best containers in STL (C++)
How to choose best containers in STL (C++)How to choose best containers in STL (C++)
How to choose best containers in STL (C++)Sangharsh agarwal
 
18CSL58 DBMS LAB Manual.pdf
18CSL58 DBMS LAB Manual.pdf18CSL58 DBMS LAB Manual.pdf
18CSL58 DBMS LAB Manual.pdfSyed Mustafa
 
Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentationHyphen Call
 

What's hot (20)

Resource Bundle
Resource BundleResource Bundle
Resource Bundle
 
HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION
 
Object oriented databases
Object oriented databasesObject oriented databases
Object oriented databases
 
Cypher Query Language
Cypher Query Language Cypher Query Language
Cypher Query Language
 
Relational algebra in dbms
Relational algebra in dbmsRelational algebra in dbms
Relational algebra in dbms
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Implementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in CImplementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in C
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
 
Relational Algebra-Database Systems
Relational Algebra-Database SystemsRelational Algebra-Database Systems
Relational Algebra-Database Systems
 
Role-Based Access Control (RBAC) in Neo4j
Role-Based Access Control (RBAC) in Neo4jRole-Based Access Control (RBAC) in Neo4j
Role-Based Access Control (RBAC) in Neo4j
 
Distributed dbms architectures
Distributed dbms architecturesDistributed dbms architectures
Distributed dbms architectures
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R Packages
 
How to choose best containers in STL (C++)
How to choose best containers in STL (C++)How to choose best containers in STL (C++)
How to choose best containers in STL (C++)
 
Elmasri Navathe DBMS Unit-1 ppt
Elmasri Navathe DBMS Unit-1 pptElmasri Navathe DBMS Unit-1 ppt
Elmasri Navathe DBMS Unit-1 ppt
 
Sql ppt
Sql pptSql ppt
Sql ppt
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
18CSL58 DBMS LAB Manual.pdf
18CSL58 DBMS LAB Manual.pdf18CSL58 DBMS LAB Manual.pdf
18CSL58 DBMS LAB Manual.pdf
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 
Introduction to OpenCV
Introduction to OpenCVIntroduction to OpenCV
Introduction to OpenCV
 

Viewers also liked

HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesJohn Mulhall
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studiosantosluis87
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationPhilip Yurchuk
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopHPCC Systems
 
Data Integration with CloverETL
Data Integration with CloverETLData Integration with CloverETL
Data Integration with CloverETLFariz Darari
 

Viewers also liked (7)

HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studio
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data Integration
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond Hadoop
 
Data Integration with CloverETL
Data Integration with CloverETLData Integration with CloverETL
Data Integration with CloverETL
 
Talend
TalendTalend
Talend
 

Similar to Solving Business Problems with High Performance Computing

Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCCeph Community
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Community
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Andrew Underwood
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task ComputingEric Van Hensbergen
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudAmazon Web Services
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAHAkash M Shah
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAmazon Web Services
 
Presentation at Wright State University
Presentation at Wright State UniversityPresentation at Wright State University
Presentation at Wright State UniversityHPCC Systems
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
The State of CXL-related Activities within OCP
The State of CXL-related Activities within OCPThe State of CXL-related Activities within OCP
The State of CXL-related Activities within OCPMemory Fabric Forum
 
Presentacion redislabs-ihub
Presentacion redislabs-ihubPresentacion redislabs-ihub
Presentacion redislabs-ihubssuser9d7c90
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...Angela Williams
 
Massively Scalable Applications - TechFerry
Massively Scalable Applications - TechFerryMassively Scalable Applications - TechFerry
Massively Scalable Applications - TechFerryTechFerry
 
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...HPCC Systems
 

Similar to Solving Business Problems with High Performance Computing (20)

Hpcc
HpccHpcc
Hpcc
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
 
Exadata
ExadataExadata
Exadata
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAH
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWS
 
Presentation at Wright State University
Presentation at Wright State UniversityPresentation at Wright State University
Presentation at Wright State University
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
The State of CXL-related Activities within OCP
The State of CXL-related Activities within OCPThe State of CXL-related Activities within OCP
The State of CXL-related Activities within OCP
 
Presentacion redislabs-ihub
Presentacion redislabs-ihubPresentacion redislabs-ihub
Presentacion redislabs-ihub
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
 
Massively Scalable Applications - TechFerry
Massively Scalable Applications - TechFerryMassively Scalable Applications - TechFerry
Massively Scalable Applications - TechFerry
 
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
 

More from Subrata Debnath

CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...
CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...
CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...Subrata Debnath
 
11.15.12 CBIG Event - David Rogers Presentation
11.15.12 CBIG Event - David Rogers Presentation11.15.12 CBIG Event - David Rogers Presentation
11.15.12 CBIG Event - David Rogers PresentationSubrata Debnath
 
11.15.12 CBIG Event - Kalvin & Vantiv Presentation
11.15.12 CBIG Event - Kalvin & Vantiv Presentation11.15.12 CBIG Event - Kalvin & Vantiv Presentation
11.15.12 CBIG Event - Kalvin & Vantiv PresentationSubrata Debnath
 
11/15/12 CBIG Event - Steve Peterson Presentation
11/15/12 CBIG Event - Steve Peterson Presentation11/15/12 CBIG Event - Steve Peterson Presentation
11/15/12 CBIG Event - Steve Peterson PresentationSubrata Debnath
 
11.15.12 CBIG Event - Dan Sweet\'s Presentation
11.15.12 CBIG Event - Dan Sweet\'s Presentation11.15.12 CBIG Event - Dan Sweet\'s Presentation
11.15.12 CBIG Event - Dan Sweet\'s PresentationSubrata Debnath
 
11/15/12 CBIG Presentation by Joe Koczwara
11/15/12 CBIG Presentation by Joe Koczwara11/15/12 CBIG Presentation by Joe Koczwara
11/15/12 CBIG Presentation by Joe KoczwaraSubrata Debnath
 
AE Capability Statements
AE Capability StatementsAE Capability Statements
AE Capability StatementsSubrata Debnath
 

More from Subrata Debnath (10)

Capability statements
Capability statementsCapability statements
Capability statements
 
CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...
CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...
CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...
 
11.15.12 CBIG Event - David Rogers Presentation
11.15.12 CBIG Event - David Rogers Presentation11.15.12 CBIG Event - David Rogers Presentation
11.15.12 CBIG Event - David Rogers Presentation
 
11.15.12 CBIG Event - Kalvin & Vantiv Presentation
11.15.12 CBIG Event - Kalvin & Vantiv Presentation11.15.12 CBIG Event - Kalvin & Vantiv Presentation
11.15.12 CBIG Event - Kalvin & Vantiv Presentation
 
11/15/12 CBIG Event - Steve Peterson Presentation
11/15/12 CBIG Event - Steve Peterson Presentation11/15/12 CBIG Event - Steve Peterson Presentation
11/15/12 CBIG Event - Steve Peterson Presentation
 
11.15.12 CBIG Event - Dan Sweet\'s Presentation
11.15.12 CBIG Event - Dan Sweet\'s Presentation11.15.12 CBIG Event - Dan Sweet\'s Presentation
11.15.12 CBIG Event - Dan Sweet\'s Presentation
 
11/15/12 CBIG Presentation by Joe Koczwara
11/15/12 CBIG Presentation by Joe Koczwara11/15/12 CBIG Presentation by Joe Koczwara
11/15/12 CBIG Presentation by Joe Koczwara
 
DAMA Presentation
DAMA PresentationDAMA Presentation
DAMA Presentation
 
AE case studies
AE case studiesAE case studies
AE case studies
 
AE Capability Statements
AE Capability StatementsAE Capability Statements
AE Capability Statements
 

Solving Business Problems with High Performance Computing

  • 1. Solving Business Problems with High Performance Computing RED/082311
  • 2. HPCC Architecture http://hpccsystems.com  HPCC is based on a distributed shared-nothing architecture, and uses commodity PC servers  HPCC uses a consolidated central network switch for best for performance, but can use a decentralized network of satellite switches. Computing efficiencies in HPCC substantially reduce the number of required nodes and, hence, network ports compared to other distributed platforms like Hadoop.  HPCC Distributed File System is record oriented (supporting fixed length, variable length field delimited and XML records)  HPCC DFS is tightly coupled with the processing layer, exploiting data locality to a higher degree to minimize network transfers and maximize throughput  HPCC applications are built using ECL, a declarative data flow oriented language supporting the automation of algorithm parallelization and work distribution WHT/082311 Risk Solutions
  • 3. Three Main Components http://hpccsystems.com  Massively Parallel Extract Transform and Load (ETL) engine  HPCC Data Refinery (Thor) – Built from the ground up as a parallel data environment. Leverages inexpensive locally attached storage. Doesn’t require a SAN infrastructure. • Enables data integration on a scale not previously available: – Current LexisNexis person data build process generates 350 Billion intermediate results at peak • Suitable for: – Massive joins/merges – Massive sorts & transformations – Any N2 problem  A massively parallel, high throughput, structured query response engine  HPCC Data Delivery Engine (Roxie)   Ultra fast due to its read-only nature. Allows indices to be built onto data for efficient multi-user retrieval of data  Suitable for Volumes of structured queries Full text ranked Boolean search • An easy to use , data-centric programming language optimized for large-scale  Enterprise Control Language (ECL) data management and query processing • Highly efficient; Automatically distributes workload across all nodes. • Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing • Large library of efficient modules to handle common data manipulation tasks WHT/082311 Risk Solutions HPCC Systems 3
  • 4. The HPCC Platform Delivers Benefits in Speed, Capacity and Cost Savings http://hpccsystems.com Speed  Scales to extreme workloads quickly and easily  Increase speed of development leads to faster production/delivery  Improved developer productivity Capacity  Enables massive joins, merges, sorts, transformations or tough O(n2) problems  Increases business responsiveness  Accelerates creation of new services via rapid prototyping capabilities  Offers a platform for collaboration and innovation leading to better results Cost Savings  Commodity hardware and fewer people can do much more in less time  Uses IT resources efficiently via sharing and higher system utilization WHT/082311 Risk Solutions 4
  • 5. Scalability: Little Degradation in Performance http://hpccsystems.com Scalability  Scales to support 1000+TB (up to petabytes) of data  Purposely built system to do massive I/O  Rapidly performs complex queries on structured and unstructured data to link to a variety of data sources  Suitable for massive joins/mergers, beyond limits of relational DBs  Scale increases with the addition of low-cost, commodity servers HPCC – in production  Current production systems range from 20 to 2000 nodes  Currently supports over 150,000 customers, millions of end users  Currently handling over 20 million transactions per day for our online and batch products. and innovation leading to better results Complex query example demonstrates below  Transaction latencies increase logarithmically while data sizes grow linearly Query Time 13 sec 10 sec 50 100 150 200 250 300 WHT/082311 TB of Storage Risk Solutions 5
  • 6. Enterprise Control Language (ECL) http://hpccsystems.com Enterprise Control Language (ECL) is a programming language developed over 10 years by a group of ex- Borland compiler writers. Designed specifically for algorithmic development on top of large data sets. Functional language optimized for large-scale data processing at scale. Abstracts the underlying platform configuration from the algorithm development allowing efficient, expressive data manipulation Automatic parallelization and optimization of sequential algorithms for parallel and distributed processing (i.e., the programmer does not need to understand how to manage the parallel processing environment) Large library of efficient modules to handle common data manipulation tasks Inbuilt data primitives (sort, distribute, join) highly optimized for distributed data processing tasks WHT/082311 Risk Solutions 6
  • 7. Many platforms, fragmented, heterogeneous : Complex & Expensive http://hpccsystems.com WHT/082311 Risk Solutions
  • 8. One Platform End-to-End: Simple http://hpccsystems.com WHT/082311 Consistent and elegant HW&SW architecture across the complete platform Risk Solutions
  • 9. Performance comparison: PigMix style http://hpccsystems.com PigMix Benchmark Using Subset (Based on 1M Row Base) (times in seconds – less is better) Hadoop 10-node (4GB Memory, 2-250GB per node), HPCC Thor 10-way, community_3.0.4-10 ECL vs. Pig (times ECL vs. Java (times PigMix Script Pig Runtime (secs) Java Runtime (secs) ECL Thor Runtime (secs) faster) faster) L2 73 68 0.96 76.36 71.13 L3 156 229 1.99 78.51 115.25 L4 95 104 14.02 6.78 7.42 L5 85 224 7.02 12.11 31.90 L6 96 100 1.80 53.30 55.52 L7 90 110 1.40 64.29 78.57 L8 66 79 0.89 74.07 88.66 L9 162 128 22.39 7.24 5.72 L10 237 146 18.08 13.11 8.08 L11 414 503 342.65 1.21 1.47 L12 108 218 2.74 39.47 79.68 L13 87 73 2.09 41.69 34.98 L14 142 7 1.29 109.74 5.41 L15 98 6 108.58 0.90 0.06 L16 97 7 1.40 69.19 4.99 L17 153 6 5.41 28.29 1.11 Total 2159 2008 532.70 4.05 3.77 WHT/082311 Risk Solutions
  • 10. Platform is now Open Source http://hpccsystems.com HPCC Systems: http://hpccsystems.com Comparison with Hadoop: http:// hpccsystems.com/Why-HPCC/HPCC-vs- Hadoop Downloads: http://hpccsystems.com/download WHT/082311 Risk Solutions
  • 11. HPCC Systems Case Studies RED/082311 WHT/082311
  • 12. Case Example 1: Choice Point Migration http://hpccsystems.com Scenario Reed Elsevier buys ChoicePoint for $4.1 billion and integrates it into LexisNexis Risk Solutions  Spun off from Equifax, Inc. in 1997, ChoicePoint grew quickly through acquisition.  ChoicePoint kept its business decentralized and individual units operated different – via duplicative, technology platforms.  Its largest division, Insurance, operated on an expensive and outdated IBM mainframe-based architecture, while its public records data center used an old, inflexible relational database built off an Oracle/ Webmethods, Sun Microsystems-based architecture.  ChoicePoint struggled under high information technology (IT) management costs and its growth was constricted due to its inability to adapt its technology and develop new solutions to meet evolving business needs. Task Migrate ChoicePoint, including the Insurance business to LexisNexis HPCC platform without disrupting business Result  LexisNexis anticipates 70% reduction in new product time-to-market as a result of moving ChoicePoint to HPCC platform.  2 years into the integration, LexisNexis is on track to generate substantial savings within the first 5 years.  LexisNexis has seen an immediate 8-10% improvement in hit rates as a function of better matching and linking technology, which is a derivative of the HPCC technology  The response time of interactive transactions for certain products has improved by a factor of 200% on average, simply as a function of moving to the HPCC. WHT/082311 Risk Solutions 12
  • 13. Case Example 2: Insurance Scoring, From 100 Days to 30 Minutes http://hpccsystems.com Scenario One of the top 3 insurance providers using traditional analytics products on multiple statistical model platforms with disparate technologies – Insurer issues request to re-run all past reports for a customer: 11.5M reports since 2004 – Using their current technology infrastructure it would take 100 days to run these 11.5M reports Task Migration of 75 different models, 74,000 lines of code and approximately 700 definitions to the HPCC – Models were migrated in 1 man month. – Using a small development system (and only one developer), we ran 11.5 million reports in 66 minutes – Performance on a production-size system: 30 minutes – Testing demonstrated our ability to work in batch or online Result – Reduced manual work, increases reliability, and created capability to do new scores – Decreased development time from 1 year to several weeks; decreased run time from 100 days to 30 minutes – One HPCC (one infrastructure) translates into less people maintaining systems. WHT/082311 Risk Solutions 13
  • 14. Case Example 3: Detecting Insurance Collusion in Louisiana http://hpccsystems.com Scenario This view of carrier data shows seven known fraud claims and an additional linked claim. The Insurance company data only finds a connection between two of the seven claims, and only identified one other claim as being weakly connected. WHT/082311 Risk Solutions HPCC Systems 14
  • 15. Case Example 3: Continued http://hpccsystems.com Task After adding the Link ID to the carrier Data, LexisNexis HPCC technology then added Family 1 2 additional degrees of relative Result The results showed two family groups interconnected on all of these seven claims. Family 2 The links were much stronger than the carrier data previously supported. WHT/082311 Risk Solutions HPCC Systems 15
  • 16. Case Example 4: Network Traffic Analysis in Seconds http://hpccsystems.com Scenario Conventional network sensor and monitoring solutions are constrained by inability to quickly ingest massive data volumes for analysis - 15 minutes of network traffic can generate 4 Terabytes of data, which can take 6 hours to process - 90 days of network traffic can add up to 300+ Terabytes Task Drill into all the data to see if any US government systems have communicated with any suspect systems of foreign organizations in the last 6 months - In this scenario, we look specifically for traffic occurring at unusual hours of the day Result In seconds, the HPCC sorted through months of network traffic to identify patterns and suspicious behavior WHT/082311 Risk Solutions 16
  • 17. Case Example 5: Suspicious Equity Stripping Cluster http://hpccsystems.com Scenario Is mortgage fraud just an activity of isolated individuals, or an industry? Task Rank the nature, connectedness, and proximity of suspicious property transactions for every identity in the U.S. Result Florida Proof of Concept Highest ranked influencers Identified known ringleaders in flipping and equity stripping schemes. Typically not connected directly to suspicious transactions. Clusters with high levels of potential collusion. Clusters offloading property, generating defaults WHT/082311 Risk Solutions 17
  • 18. Case Example 6: Mining Wikipedia Interest http://hpccsystems.com Senario 21 Billion Rows of Wikipedia Hourly Pageview Statistics for a year. Task Generate meaningful statistics to understand aggregated global interest in all Wikipedia pages across the 24 hours of the day built off all english wikipedia pageview logs for 12 months. Result Produces page statistics that can be queried in seconds to visualize which key times of day each Wikipedia page is more actively viewed. Helps gain insight into both regional and time of day key interest periods in certain topics. This result can be leveraged with Machine Learning to cluster pages with similar 24hr Fingerprints. WHT/082311 Risk Solutions HPCC Systems 18
  • 19. Conclusion http://hpccsystems.com • The HPCC platform is built from inexpensive commodity hardware. • ECL, a declarative data flow language, provides automatic parallelization and work distribution. • Data Linking, Clustering, and Mining applications leveraging billions of records are very feasible. • The HPCC Platform is now Open Source • The HPCC Platform is Enterprise ready, with many years of use as the Big Data platform for LexisNexis Risk Solutions Questions? WHT/082311 Risk Solutions 19