modul pembelajaran robotic Workshop _ by Slidesgo.pptx
How LinkedIn Uses Scalding for Data Driven Product Development
1. Using Scalding for Data-Driven
Product Development
Sasha Ovsankin
LinkedIn
2. http://linkedin.com/in/sashao
• Studied Mathematical Physics at
Moscow University
• Software Engineering background
• Work at LinkedIn on Email Experience
• Publish open source at
https://github.com/SashaOv
• Publish music at SoundCloud
3. /home
Scalding is a must-have tool in your arsenal of
Hadoop development.
– Hadoop ecosystem at LinkedIn
– Hadoop development tools
– Scalding: why and how
– What we do with Scalding, code examples.
5. /linkedin/hadoop/practices
• All online data end up in HDFS
– Mostly encoded in Avro
• Production Process
– CI/Automatic Build
• More info forthcoming
– Production Review
– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production
• More info at http://lnkd.in/big-data-ecosystem
8. /hadoop/dev-tools/Java
• Java MR
– Maximum flexibility with Hadoop API
– Verbose
• Cascading
– Retain (some) Java flexibility
– Less verbose
9. /hadoop/dev-tools/Scalding
http://github.com/twitter/scalding
• Scala-based DSL
• Built on Cascading, stable and mature framework
• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => line.split("""s+""") }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
}
• Succinct and powerful
• High level of abstraction
10. …/tools/comparison
PIG Java/Scala
Debugging: stack traces No* Yes
Code reuse Macros, jobs Classes, packages,
modules, frameworks…
Custom data
structures/algorithms
UDF Native
Packaging Fat jars Thin jars
Avro support Partial Native
Unit testing PigUnit (in Java) Standard unit testing
frameworks:
JUNIT/TestMg/MRUnit,
Scalding tests
PIG Java MR Scalding
LOC count Small* Large Small
11. …/tools/buyers-guide
If you need… Then use…
Quick-and-dirty simple scripts,
existing UDFs
PIG, Hive
Complex flows, full access to
Avro, debugging, unit testing,
productization
Scalding
Full flexibility of Hadoop API
but not too complex processing
Java MR
12. /linkedin/email-experience
• Goal
– Improve messaging users’ experience
• Plan
– Track
– Experiment
– Optimize
– Personalize
• Implementation
– Generate messages offline
– Apply sophisticated relevance algorithms
– Shorten the release cycle to facilitate fast iteration
14. …/email-experience/why-scalding
• Scala + Map Reduce = match made in heaven
scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }
res20: Int = 333833500
• Stack traces (yeah!)
• Native Avro support
• Integrates well with CI/build system
17. /linkedin/…/scalding/status
• Started >1 year ago
• Thousands of production LOC written in Scalding by
our team
– Pretty happy with readability and maintainability
• ~10 flows are currently in production, and counting
• Currently ~12 people are coding in Scalding
• Created Scalding user group
• Growing interest
• Learning:
– Scala[Scalding] < Scala[ _ ]
19. /linkedin/…/scalding/what-to-improve
• Better Scala language IDE tools
• One-click development (-> demo)
• Monitoring and troubleshooting
– Counters – implemented in 0.9
– Better troubleshooting of the
ser/de process
• Better tools for tuning of jobs
– setting #of mappers and reducers
• Best practices
20. /home
Scalding is a must-have tool in your arsenal of
Hadoop development.
– Hadoop ecosystem at LinkedIn
– Hadoop development tools
– Scalding: why and how
– What we do with Scalding, code examples.
21. /linkedin/join-us
• Work on unique and interesting problems
• Be part of great engineering community
• Use latest tools and technologies
• Help connect the world’s professionals to help
them become more productive and successful
• We are looking for amazing people interested in
Data Science and Software Engineering
Questions?