SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Traversing our way through
Apache Spark GraphFrames
and
GraphX
Mo Patel
Data Day Texas 2017
A bit about me
• Currently Deep Learning Practice Director atTeradata
– Road Object Detection & Scene Labeling
– Visual Product Search
– Chatbots
• Previously
– Analytics @ Social Sharing Startup
– Analytics @ Intelligence Community
– Distributed Systems @ Satellite Operations Company
– Software Engineering @ Defense Communications Program
• Research Interests: Distributed Systems for Analytics
• Love snowboarding and in general outdoor sports and working out to keep doing those things
mopatel
What is this talk about?
• What are Graphs and what are some interesting
things about Graphs?
• What are some Graph Analytics Examples?
• What are GraphFrames?
• What is GraphX?
• How can Graph Analytics help financial
companies fight Synthetic Identity Fraud?
What is a Graph?
Natural Artificial
Wikipedia
Wikipedia
Power of Graphs
Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
Power of Graphs
• Good: Facebook,Twitter,WhatApp…most
popular social networks
• Bad: MySpace, Friendster, Orkut…“Nobody
goes there anymore. It's too crowded” –Yogi
Berra
• Data Growth: Recall Metcalfe’s (n2) and Reed’s
Law (2n)
• Memory Intensive
• Processing Intensive
Graph Databases cost money,
Graph Analytics make money!
Graph Databases cost money,
Graph Analytics make money!
• Page Rank, EigenCentrality
• Modularity, Clustering Coefficient,
Betweenness, Closeness
• Loopy Belief Propogation, SALSA
Node Score in a Graph
• Usecase: Find out how important an entity is
in a graph
– Entity Fraud Detection
– Influencers
– Crime Bosses
• Methods: PageRank, EigenCentrality
PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph)
EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
Communities in a Graph
• Usecase: Detect similar nodes
– Behavioral Segmentation
– Crime Rings
– Product Strength &Weakness
• Methods: Modularity, Clustering Coefficient,
Betweenness, Closeness
Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi)
Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf
(Implemented: Spark, iGraph)
Growth in Graph
• Usecase: Predict where will the graph grow or
suggest new edges
– Event Prediction
– Product Recommendation
• Methods: Loopy Belief Propagation, Belief
Networks, SALSA
Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian)
SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
GraphX
• Apache Spark Library for conducting Graph
Analytics
• Graph Operations: num[Edges,Vertices],
degress, collectNeighbors
• Graph Analytics:
– PageRank
– Connected Components
– Triangle Counter
http://spark.apache.org/graphx/
Property Graph
GraphFrame
• SQL like context is very popular
• Lots of ways to work with Graphs: Cypher, SPARQL,
Gremlin..
• Spark introduced DataFrame in February 2015
• Goal: Make it easy for DataFrame users to work with
Graphs
• GraphFrame: GraphX & DataFrame Operations
https://graphframes.github.io/index.html
GraphFrame
Vertices DataFrame
val vertices =
sqlContext.createDataFrame(
List(
(“a1", “Wine", “Beverage”),
(“b2", "Beer", “Beverage”),
(“c3", “Pretzel", “Snack”),
(“d4", "Cheese", “Snack”)
)).toDF("id", "name", “type")
Edges DataFrame GraphFrame
val edges =
sqlContext.createDataFrame(
List(
("a1", “d4", 15455),
("b2", “c3", 4849),
(“a1", “c3", 40),
(“b2”, “d4”, 134)
)).toDF(“item1", “item2", “count")
val productsGraphFrame =
GraphFrame(vertices, edges)
productsGraphFrame.
vertices.filter(“type == Snack")
productsGraphFrame. numEdges
What is Synthetic Identity Fraud?
http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
Why has Synthetic Identity Fraud
emerged as a big problem?
Verafin
How are Synthetic IDs created?
Verafin
Verafin
How are Financial Companies exploited?
Verafin
What is the impact of Synthetic Identity Fraud?
Verafin
Verafin
How can Graph Analytics helps
solve Synthetic Identity Problem?
Customer Address DataFrame
val customerAddresses =
sqlContext.createDataFrame(
List(
(“a1", “123 Main Street", “123abc456efg”),
(“b2", ”345 High Street", “123abc456efg”),
(“c3", “789 Park Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
vertices.
Add Fake Address
val fakeAddress = sqlContext.createDataFrame(
List(
(“d4", “999 Ocean Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
val tempCustomerAddresses =
customerAddresses.union(fakeAddress)
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How can Graph Analytics helps
solve Synthetic Identity Problem?
Master Address Connection Edges
DataFrame
val masterAddressConnections = sqlContext.createDataFrame(
List(
("b2", "a1"),
("e5", "c3"),
("c3", "b2"),
("a1", "c3"),
("e5", "d4")
…
)).toDF("src", "dst")
val toEdgeMatches = masterAddressConnections.join(customerAddresses,
masterAddressConnections("to") ===
customerAddresses("address")).select("to","from")
val fromEdgeMatches =
masterAddressConnections.join(customerAddresses,
masterAddressConnections("from") ===
customerAddresses("address")).select("to","from")
val checkEdges = fromEdgeMatches.union(toEdgeMatches)
Detection GraphFrame
PageRank
val detectionGraphFrame =
GraphFrame(tempCustomerAddresses ,
checkEdges)
//PageRank
val resultRanks =
detectionGraphFrame.pageRank.resetProbability(0.
15).tol(0.01).run()
//Personalized PageRank
val d4Ranks =
detectionGraphFrame.pageRank.resetProbability(0.
15).maxIter(10).sourceId("d4").run()
resultRanks.vertices.select("id", "pagerank").show()
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How do we decide if this address is
fraud or not?
PageRank
id pagerank
a1 0.9463535901944437
b2 0.9463535901944437
c3 0.9463535901944437
d4 0.15
Personalized PageRank
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
a1
id pagerank
a1 0.33343371928623045
c3 0.28341866139329586
b2 0.21580437563085933
d4 0.0
b2
id pagerank
b2 0.33343371928623045
a1 0.28341866139329586
c3 0.21580437563085933
d4 0.0
c2
id pagerank
c3 0.33343371928623045
b2 0.28341866139329586
a1 0.21580437563085933
d4 0.0
d4
id pagerank
d4 0.15
a1 0.0
b2 0.0
c3 0.0
Future Directions and Thoughts
• Focus on delivering value over tools and
technologies
• Will we settle on a language for Graph Analytics?
• More algorithms in GraphX?
• Large scale Graph Analytics is still not scalable
Apache Spark GraphX: http://spark.apache.org/graphx/
Follow me on Twitter (@mopatel) for interesting Deep Learning and
Analytics tweets

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsDatabricks
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMongoDB
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 

Was ist angesagt? (20)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 

Ähnlich wie Traversing our way through Apache Spark GraphFrames and GraphX

Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jIvan Zoratti
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
From Developer to Data Scientist
From Developer to Data ScientistFrom Developer to Data Scientist
From Developer to Data ScientistGaines Kergosien
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databasesjexp
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data ScienceSanghamitra Deb
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science DemystifiedEmily Robinson
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
ADV Slides: Graph Databases on the Edge
ADV Slides: Graph Databases on the EdgeADV Slides: Graph Databases on the Edge
ADV Slides: Graph Databases on the EdgeDATAVERSITY
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2Glenn Jones
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesKonstantinos Xirogiannopoulos
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesPyData
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 

Ähnlich wie Traversing our way through Apache Spark GraphFrames and GraphX (20)

Offensive OSINT
Offensive OSINTOffensive OSINT
Offensive OSINT
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
From Developer to Data Scientist
From Developer to Data ScientistFrom Developer to Data Scientist
From Developer to Data Scientist
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databases
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Tf gsds
Tf gsdsTf gsds
Tf gsds
 
ADV Slides: Graph Databases on the Edge
ADV Slides: Graph Databases on the EdgeADV Slides: Graph Databases on the Edge
ADV Slides: Graph Databases on the Edge
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 

Kürzlich hochgeladen

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 

Kürzlich hochgeladen (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 

Traversing our way through Apache Spark GraphFrames and GraphX

  • 1. Traversing our way through Apache Spark GraphFrames and GraphX Mo Patel Data Day Texas 2017
  • 2. A bit about me • Currently Deep Learning Practice Director atTeradata – Road Object Detection & Scene Labeling – Visual Product Search – Chatbots • Previously – Analytics @ Social Sharing Startup – Analytics @ Intelligence Community – Distributed Systems @ Satellite Operations Company – Software Engineering @ Defense Communications Program • Research Interests: Distributed Systems for Analytics • Love snowboarding and in general outdoor sports and working out to keep doing those things mopatel
  • 3. What is this talk about? • What are Graphs and what are some interesting things about Graphs? • What are some Graph Analytics Examples? • What are GraphFrames? • What is GraphX? • How can Graph Analytics help financial companies fight Synthetic Identity Fraud?
  • 4. What is a Graph? Natural Artificial Wikipedia Wikipedia
  • 5. Power of Graphs Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
  • 6. Power of Graphs • Good: Facebook,Twitter,WhatApp…most popular social networks • Bad: MySpace, Friendster, Orkut…“Nobody goes there anymore. It's too crowded” –Yogi Berra
  • 7. • Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n) • Memory Intensive • Processing Intensive Graph Databases cost money, Graph Analytics make money!
  • 8. Graph Databases cost money, Graph Analytics make money! • Page Rank, EigenCentrality • Modularity, Clustering Coefficient, Betweenness, Closeness • Loopy Belief Propogation, SALSA
  • 9. Node Score in a Graph • Usecase: Find out how important an entity is in a graph – Entity Fraud Detection – Influencers – Crime Bosses • Methods: PageRank, EigenCentrality PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
  • 10. Communities in a Graph • Usecase: Detect similar nodes – Behavioral Segmentation – Crime Rings – Product Strength &Weakness • Methods: Modularity, Clustering Coefficient, Betweenness, Closeness Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
  • 11. Growth in Graph • Usecase: Predict where will the graph grow or suggest new edges – Event Prediction – Product Recommendation • Methods: Loopy Belief Propagation, Belief Networks, SALSA Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
  • 12. GraphX • Apache Spark Library for conducting Graph Analytics • Graph Operations: num[Edges,Vertices], degress, collectNeighbors • Graph Analytics: – PageRank – Connected Components – Triangle Counter http://spark.apache.org/graphx/
  • 14. GraphFrame • SQL like context is very popular • Lots of ways to work with Graphs: Cypher, SPARQL, Gremlin.. • Spark introduced DataFrame in February 2015 • Goal: Make it easy for DataFrame users to work with Graphs • GraphFrame: GraphX & DataFrame Operations https://graphframes.github.io/index.html
  • 15. GraphFrame Vertices DataFrame val vertices = sqlContext.createDataFrame( List( (“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”) )).toDF("id", "name", “type") Edges DataFrame GraphFrame val edges = sqlContext.createDataFrame( List( ("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40), (“b2”, “d4”, 134) )).toDF(“item1", “item2", “count") val productsGraphFrame = GraphFrame(vertices, edges) productsGraphFrame. vertices.filter(“type == Snack") productsGraphFrame. numEdges
  • 16. What is Synthetic Identity Fraud? http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
  • 17. Why has Synthetic Identity Fraud emerged as a big problem? Verafin
  • 18. How are Synthetic IDs created? Verafin Verafin
  • 19. How are Financial Companies exploited? Verafin
  • 20. What is the impact of Synthetic Identity Fraud? Verafin Verafin
  • 21. How can Graph Analytics helps solve Synthetic Identity Problem? Customer Address DataFrame val customerAddresses = sqlContext.createDataFrame( List( (“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”) )).toDF("id", ”address", “customerid") vertices. Add Fake Address val fakeAddress = sqlContext.createDataFrame( List( (“d4", “999 Ocean Ave", “123abc456efg”) )).toDF("id", ”address", “customerid") val tempCustomerAddresses = customerAddresses.union(fakeAddress) DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
  • 22. How can Graph Analytics helps solve Synthetic Identity Problem? Master Address Connection Edges DataFrame val masterAddressConnections = sqlContext.createDataFrame( List( ("b2", "a1"), ("e5", "c3"), ("c3", "b2"), ("a1", "c3"), ("e5", "d4") … )).toDF("src", "dst") val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from") val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from") val checkEdges = fromEdgeMatches.union(toEdgeMatches) Detection GraphFrame PageRank val detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges) //PageRank val resultRanks = detectionGraphFrame.pageRank.resetProbability(0. 15).tol(0.01).run() //Personalized PageRank val d4Ranks = detectionGraphFrame.pageRank.resetProbability(0. 15).maxIter(10).sourceId("d4").run() resultRanks.vertices.select("id", "pagerank").show() DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
  • 23. How do we decide if this address is fraud or not? PageRank id pagerank a1 0.9463535901944437 b2 0.9463535901944437 c3 0.9463535901944437 d4 0.15 Personalized PageRank DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx a1 id pagerank a1 0.33343371928623045 c3 0.28341866139329586 b2 0.21580437563085933 d4 0.0 b2 id pagerank b2 0.33343371928623045 a1 0.28341866139329586 c3 0.21580437563085933 d4 0.0 c2 id pagerank c3 0.33343371928623045 b2 0.28341866139329586 a1 0.21580437563085933 d4 0.0 d4 id pagerank d4 0.15 a1 0.0 b2 0.0 c3 0.0
  • 24. Future Directions and Thoughts • Focus on delivering value over tools and technologies • Will we settle on a language for Graph Analytics? • More algorithms in GraphX? • Large scale Graph Analytics is still not scalable
  • 25. Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets