This document summarizes a presentation about analyzing graphs using Apache Spark's GraphFrames and GraphX libraries. It begins with an introduction of the speaker and their interests. It then discusses what graphs are and provides examples of graph analytics like node scoring and community detection. It introduces GraphX and GraphFrames, how they allow working with property graphs and integrating graph operations with DataFrames. It also provides an example of how financial institutions can use graph analytics to detect synthetic identity fraud by analyzing relationships between customer addresses.
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
Traversing our way through Apache Spark GraphFrames and GraphX
1. Traversing our way through
Apache Spark GraphFrames
and
GraphX
Mo Patel
Data Day Texas 2017
2. A bit about me
• Currently Deep Learning Practice Director atTeradata
– Road Object Detection & Scene Labeling
– Visual Product Search
– Chatbots
• Previously
– Analytics @ Social Sharing Startup
– Analytics @ Intelligence Community
– Distributed Systems @ Satellite Operations Company
– Software Engineering @ Defense Communications Program
• Research Interests: Distributed Systems for Analytics
• Love snowboarding and in general outdoor sports and working out to keep doing those things
mopatel
3. What is this talk about?
• What are Graphs and what are some interesting
things about Graphs?
• What are some Graph Analytics Examples?
• What are GraphFrames?
• What is GraphX?
• How can Graph Analytics help financial
companies fight Synthetic Identity Fraud?
4. What is a Graph?
Natural Artificial
Wikipedia
Wikipedia
5. Power of Graphs
Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
6. Power of Graphs
• Good: Facebook,Twitter,WhatApp…most
popular social networks
• Bad: MySpace, Friendster, Orkut…“Nobody
goes there anymore. It's too crowded” –Yogi
Berra
7. • Data Growth: Recall Metcalfe’s (n2) and Reed’s
Law (2n)
• Memory Intensive
• Processing Intensive
Graph Databases cost money,
Graph Analytics make money!
9. Node Score in a Graph
• Usecase: Find out how important an entity is
in a graph
– Entity Fraud Detection
– Influencers
– Crime Bosses
• Methods: PageRank, EigenCentrality
PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph)
EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
14. GraphFrame
• SQL like context is very popular
• Lots of ways to work with Graphs: Cypher, SPARQL,
Gremlin..
• Spark introduced DataFrame in February 2015
• Goal: Make it easy for DataFrame users to work with
Graphs
• GraphFrame: GraphX & DataFrame Operations
https://graphframes.github.io/index.html
20. What is the impact of Synthetic Identity Fraud?
Verafin
Verafin
21. How can Graph Analytics helps
solve Synthetic Identity Problem?
Customer Address DataFrame
val customerAddresses =
sqlContext.createDataFrame(
List(
(“a1", “123 Main Street", “123abc456efg”),
(“b2", ”345 High Street", “123abc456efg”),
(“c3", “789 Park Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
vertices.
Add Fake Address
val fakeAddress = sqlContext.createDataFrame(
List(
(“d4", “999 Ocean Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
val tempCustomerAddresses =
customerAddresses.union(fakeAddress)
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
22. How can Graph Analytics helps
solve Synthetic Identity Problem?
Master Address Connection Edges
DataFrame
val masterAddressConnections = sqlContext.createDataFrame(
List(
("b2", "a1"),
("e5", "c3"),
("c3", "b2"),
("a1", "c3"),
("e5", "d4")
…
)).toDF("src", "dst")
val toEdgeMatches = masterAddressConnections.join(customerAddresses,
masterAddressConnections("to") ===
customerAddresses("address")).select("to","from")
val fromEdgeMatches =
masterAddressConnections.join(customerAddresses,
masterAddressConnections("from") ===
customerAddresses("address")).select("to","from")
val checkEdges = fromEdgeMatches.union(toEdgeMatches)
Detection GraphFrame
PageRank
val detectionGraphFrame =
GraphFrame(tempCustomerAddresses ,
checkEdges)
//PageRank
val resultRanks =
detectionGraphFrame.pageRank.resetProbability(0.
15).tol(0.01).run()
//Personalized PageRank
val d4Ranks =
detectionGraphFrame.pageRank.resetProbability(0.
15).maxIter(10).sourceId("d4").run()
resultRanks.vertices.select("id", "pagerank").show()
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
23. How do we decide if this address is
fraud or not?
PageRank
id pagerank
a1 0.9463535901944437
b2 0.9463535901944437
c3 0.9463535901944437
d4 0.15
Personalized PageRank
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
a1
id pagerank
a1 0.33343371928623045
c3 0.28341866139329586
b2 0.21580437563085933
d4 0.0
b2
id pagerank
b2 0.33343371928623045
a1 0.28341866139329586
c3 0.21580437563085933
d4 0.0
c2
id pagerank
c3 0.33343371928623045
b2 0.28341866139329586
a1 0.21580437563085933
d4 0.0
d4
id pagerank
d4 0.15
a1 0.0
b2 0.0
c3 0.0
24. Future Directions and Thoughts
• Focus on delivering value over tools and
technologies
• Will we settle on a language for Graph Analytics?
• More algorithms in GraphX?
• Large scale Graph Analytics is still not scalable
25. Apache Spark GraphX: http://spark.apache.org/graphx/
Follow me on Twitter (@mopatel) for interesting Deep Learning and
Analytics tweets