2. 2
Perficient is a leading information technology consulting firm serving clients
throughout North America.
We help clients implement business-driven technology solutions that integrate
business processes, improve worker productivity, increase customer loyalty and create
a more agile enterprise to better respond to new business opportunities.
About Perficient
3. 3
• Founded in 1997
• Public, NASDAQ: PRFT
• 2012 revenue of $327 million
• Major market locations throughout North America
• Atlanta, Austin, Boston, Charlotte, Chicago, Cincinnati, Cleveland, Columbus,
Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis, Minneapolis, New
Orleans, New York, Northern California, Philadelphia, Southern California, St.
Louis , Toronto, and Washington, D.C.
• Global delivery centers in China, Europe and India
• ~2,000 colleagues
• Dedicated solution practices
• ~85% repeat business rate
• Alliance partnerships with major technology vendors
• Multiple vendor/industry technology and growth awards
Perficient Profile
4. 4
Business Solutions
• Business Intelligence
• Business Process Management
• Customer Experience and CRM
• Enterprise Performance Management
• Enterprise Resource Planning
• Experience Design (XD)
• Management Consulting
Technology Solutions
• Business Integration/SOA
• Cloud Services
• Commerce
• Content Management
• Custom Application Development
• Education
• Information Management
• Mobile Platforms
• Platform Integration
• Portal & Social
Our Solutions Expertise
5. Speakers
Randall Gayle
• Data Management Director for Perficient
• 30+ years of data management experience
• Helps companies develop solutions around master data
management, data quality, data governance and data
integration.
• Provides data management expertise to industries including
oil and gas, financial services, banking, healthcare,
government, retail and manufacturing.
John Haddad
• Senior Director of Big Data Product Marketing for Informatica
• 25+ years of experience developing and marketing
enterprise applications.
• Advises organizations on Big Data best practices from a
management and technology perspective.
5
6. Interesting Facts about BIG Data
1. It took from the dawn of civilization to the year 2003 for the world to generate 1.8
zettabytes (10 to the 12th gigabytes) of data. In 2011 it took two days on average to
generate the same amount of data.
2. If you stacked a pile of CD-ROMs on top of one another until you’d reached the current
global storage capacity for digital information – about 295 Exabyte – if would stretch
80,000 km beyond the moon.
3. Every hour, enough information is consumed by internet traffic to fill 7 million DVDs.
Side by side, they’d scale Mount Everest 95 times.
4. 247 billion e-mail messages are sent each day… up to 80% of them are spam.
5. 48 hours of video are uploaded to YouTube every minute, resulting in 8 years’ worth of
digital content each day
6. The world’s data doubles every two years
7. There are nearly as many bits of information in the digital universe as there are stars in
our actual universe.
8. There are 30 billion pieces of content shared on Facebook every day and 750 million
photos uploaded every two days
6
7. Agenda
• Innovation vs. Cost
• PowerCenter Big Data Edition
• What else does Informatica offer for Big Data?
• What Are Customers Doing with Informatica and Big
Data?
• Next Steps
• Q&A
7
9. Business
CEO and VP/Director of
Sales & Marketing,
Customer Service,
Product Development
INNOVATION
How do you balance innovation and
cost?
10. IT
CIO and VP/Director of
Information Management,
BI / Data Warehousing,
Enterprise Architecture
Business
CEO and VP/Director of
Sales & Marketing,
Customer Service,
Product Development
COSTINNOVATION
How do you balance innovation and
cost?
11. Financial Services Retail & Telco Media & Entertainment
Public SectorManufacturing Healthcare & Pharma
Business is connecting innovation to Big Data
12. Risk & Portfolio
Analysis,
Investment
Recommendations
Proactive Customer
Engagement,
Location Based
Services
Financial Services Retail & Telco
Public SectorManufacturing Healthcare & Pharma
Media & Entertainment
Online & In-Game
Behavior
Customer X/Up-Sell
Business is connecting innovation to Big Data
13. Risk & Portfolio
Analysis,
Investment
Recommendations
Connected Vehicle,
Predictive Maintenance
Health Insurance
Exchanges,
Public Safety,
Tax Optimization
Fraud Detection
Predicting Patient
Outcomes,
Total Cost of Care
Drug Discovery
Proactive Customer
Engagement,
Location Based
Services
Financial Services Retail & Telco
Public SectorManufacturing Healthcare & Pharma
Media & Entertainment
Online & In-Game
Behavior
Customer X/Up-Sell
Business is connecting innovation to Big Data
14. IT is struggling with the cost of Big Data
• Growing data volume is
quickly consuming capacity
15. • Growing data volume is
quickly consuming capacity
• Need to onboard, store, &
process new types of data
IT is struggling with the cost of Big Data
16. • Growing data volume is
quickly consuming capacity
• Need to onboard, store, &
process new types of data
• High expense and lack of
big data skills
IT is struggling with the cost of Big Data
20. T i m e
a v a i l a b
l e f o r
d a t a
a n a l y s i
s
T i m e s p e n t o n d a t a
p r e p a r a t i o n (p a r s e ,
p r o f i l e , c l e a n s e ,
t r a n s f o r m , m a t c h )
Without
PowerCenter
Big Data Edition
21. T i m e
a v a i l a b
l e f o r
d a t a
a n a l y s i
s
T i m e s p e n t o n d a t a
p r e p a r a t i o n (p a r s e ,
p r o f i l e , c l e a n s e ,
t r a n s f o r m , m a t c h )
Without
PowerCenter
Big Data Edition
With
PowerCenter
Big Data Edition
22. Informatica + Hadoop
PowerCenter Developers are Now Hadoop Developers
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
Analytics & Op
Dashboards
Mobile
Apps
Real-Time
Alerts
Archive Profile Parse ETL Cleanse Match
23. 23
The Vibe Virtual Data Machine
Optimizer
Virtual Data Machine
Executor
Connectors
Transformation Library
Defines logic
Deploys most efficiently
based on data, logic and
execution environment
Run-time physical
execution
Connectivity to data
sources
24. 24
Virtual Data Machine
Information
Exchange
Master Data
Management
3rd Party
Solutions
Data Integration Data Quality
Information
Lifecycle
Infrastructureservices
Role-basedtools
INFORMATION
SOLUTIONS
AND DATA
SERVICES
Vibe Virtual Data Machine
Map Once. Deploy Anywhere.
DEPLOY
ANYWHERE
Cloud
Embedded
DQ in apps
Data
Virtualization
ServerDesktop HADOOP
Data
IntegrationHub
25. PowerCenter Big Data Edition
The Safe On-Ramp To Big Data
Big Transaction Data Big Interaction Data
Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server
…
Cloud
Salesforce.com
Concur
Google App Engine
Amazon
…
Other Interaction Data
Clickstream
image/Text
Scientific
Genomoic/pharma
Medical
Medical/Device
Sensors/meters
RFID tags
CDR/mobile
…
Social Media & Web Data
Facebook
Twitter
Linkedin
Youtube
…
Big Data Processing
Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata
HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel …
Web applications
Blogs
Discussion forums
Communities
Partner portals
…
26. PowerCenter Big Data Edition
The Safe On-Ramp To Big Data
Big Transaction Data Big Interaction Data
Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server
…
Cloud
Salesforce.com
Concur
Google App Engine
Amazon
…
Other Interaction Data
Clickstream
image/Text
Scientific
Genomoic/pharma
Medical
Medical/Device
Sensors/meters
RFID tags
CDR/mobile
…
Social Media & Web Data
Facebook
Twitter
Linkedin
Youtube
…
Big Data Processing
Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata
HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel …
Web applications
Blogs
Discussion forums
Communities
Partner portals
…
Universal Data Access
High-Speed Data
Ingestion and
Extraction
ETL on Hadoop
Profiling on Hadoop
Complex Data
Parsing on Hadoop
Entity Extraction and
Data Classification on
Hadoop
No-Code Productivity
Business-IT
Collaboration
Unified Administration
the VibeTM virtual
data machine
PowerCenter
Big Data Edition
27. PowerCenter Big Data Edition
Lower Costs
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
EDW
Data
Mart
Data
Mart
Optimize processing on
low cost hardware
Increase productivity up to 5X
Traditional Grid
28. Traditional Grid
Deploy On-Premise or
in the Cloud
Quickly staff projects
with trained experts
Map Once. Deploy AnywhereTM
PowerCenter Big Data Edition
Minimize Risk
29. PowerCenter Big Data Edition
Innovate Faster
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
Analytics & Op
Dashboards
Mobile
Apps
Real-Time
Alerts
Onboard and analyze any type of
data to gain big data insights
Discover insights faster through
rapid development and collaboration
Operationalize big data insights to
generate new revenue streams
30. • Currently using Hadoop?
• Plan to implement Hadoop in 3-6 months
• Plan to implement Hadoop in 6-12 months
• No plans for Hadoop
30
What are your plans for Hadoop? (select one)
Poll Question #1
32. Inactive data
Active data
Performance
T I M E
DATABASESIZE
Enterprise Data
Warehouse
Transactions,
OLTP, OLAP
• Identify dormant data
• Archive inactive data to low-cost storage
Lower Data Management Costs
33. Active data
T I M E
DATABASESIZE
Enterprise Data
Warehouse
Low-Cost
Storage
Archive
Transactions,
OLTP, OLAP
Low-Cost
Storage
Archive
• Identify dormant data
• Archive inactive data to low-cost storage
Lower Data Management Costs
35. EDW
BI Reports /
Dashboards
ODS MDM
• Avoid copies of data and augment the data warehouse using
data virtualization
• Role-based fine-grained secure access
Minimize Risk
Dynamic Data Masking
Data Virtualization
37. Apply
Data
Governance
Apply
Measure
and
Monitor
Define
Discover
Discover
• Data discovery
• Data profiling
• Data inventories
• Process inventories
• CRUD analysis
• Capabilities assessment
Define
• Business glossary creation
• Data classifications
• Data relationships
• Reference data
• Business rules
• Data governance policies
• Other dependent policies
Measure and Monitor
• Proactive monitoring
• Operational dashboards
• Reactive operational DQ audits
• Dashboard monitoring/audits
• Data lineage analysis
• Program performance
• Business value/ROI
Apply
• Automated rules
• Manual rules
• End to end workflows
• Business/IT collaboration
Innovate Faster With Big Data
38. • Enrich master data to proactively engage
customers & improve products and services
Innovate Faster With Big Data
39. • Analyze data in real-time using event-based
processing and proactive monitoring
Innovate Faster With Big Data
Customer
Business Rules
Social Data
Alert
Geo-location
Data
Transaction Data
Merchant
Offers
40. • Data archiving
• Data masking
• Data virtualization
• Data quality
• Data discovery
• MDM
• Real-time event based processing
40
What other data management technologies are you
considering within the next 12 months? (check all that apply)
Poll Question #2
42. The Challenge. Data volumes growing at 3-5 times over the next 2-3 years
The Solution The Result
• Manage data integration
and load of 10+ billion
records from multiple
disparate data sources
• Flexible data integration
architecture to support
changing business
requirements in a
heterogeneous data
management
environment
Flexible architecture to support rapid changes
EDW
Mainframe
DataVirtualization
RDBMS
Unstructured
Data
Business
Reports
Traditional Grid
Large Government Agency
DW
DW
43. The Challenge. Data warehouse exploding with over 200TB of data. User activity
generating up to 5 million queries a day impacting query performance
The Solution The Result
• Saved $20M + $2-3M
on-going by archiving
& optimization
• Reduced project
timeline from
6 months to 2 weeks
• Improved
performance by 25%
• Return on investment
in less than 6 months
Lower costs of Big Data projects
ERP
CRM
Custom
Business
Reports
Archived
DataInteraction Data
Large Global Financial Institution
EDW
Archived
Data
44. Web Logs
Traditional Grid
Near Real-Time
The Challenge. Increasing demand for faster data driven decision making and analytics
as data volumes and processing loads rapidly increase
The Solution The Result
• Cost-effectively scale
performance
• Lower hardware costs
• Increased agility by
standardizing on one
data integration
platform
• Leverage new data
sources for faster
innovation
Lower costs and minimize risk
Datamarts
Data
Warehouse
RDBMS
RDBMS
Large Global Financial Institution
45. The Challenge. Collect data in real-time from all cars by end of the year for
“Connected Car” program
The Solution The Result
• Helps enable goals of
connected vehicle program:
• Embedding mobile
technologies to enhance
customer experience
• Predictive maintenance
and improved fuel
efficiency
• On call roadside
assistance and auto
scheduling service
Create Innovative Products and Services
Connected Vehicle Program
Business
Reports
Large Global Automotive Manufacturer
EDW
Complex
Event
Processing
47. What should you be doing?
• Tomorrow
– Identify a business goal where data can have a significant impact
– Identify the skills you need to build a big data analytics team
• 3 months
– Identify and prioritize the data you need to achieve your business
goals
– Put a business plan and reference architecture together to optimize
your enterprise information management infrastructure
– Execute a quick win big data project with measurable ROI
• 1 year
– Extend data governance to include more data and more types of data
that impacts the business
– Consider a shared-services model to promote best practices and
further lower infrastructure and labor costs
Cost saving/ control of growing data environmentData management cost optimizationBusiness specific big data analyticsBig data integration to support analytics and new data products and services
Cost saving/ control of growing data environmentData management cost optimizationBusiness specific big data analyticsBig data integration to support analytics and new data products and services
Cost saving/ control of growing data environmentData management cost optimizationBusiness specific big data analyticsBig data integration to support analytics and new data products and services
Challenges & Problems Customers are facing with Big DataGrowing data volumes, expensive data warehouse upgradesVariety of data, onboarding new types of dataLack of Big Data skillsBuilding the business case for a big data projectDon’t know where to beginRegulatory compliance and security (e.g. data privacy, data sharing)Speaker Notes:There are several challenges related to Big Data AnalyticsAs data volumes continue to grow how can you continue to meet your SLAs for existing projects while controlling costs?It’s estimated that Big Transaction Data alone is growing at 50-60% per yearApplication databases are growing to the point where not only the hardware and software costs are rising but application performance is adversely affectedData warehouses are also growing too fast using up the capacity of current infrastructure investments. And with Big Interaction Data exploding who can afford to store all this information in their enterprise data warehouse. In fact one financial institution estimated that it costs $180K to manage 1 TB of data in their data warehouse over a 3-year periodAs more and more users demand information, organizations also experience a proliferation of datamarts that further increases hardware and database costs. A large healthcare insurance provider had over 30,000 datamarts and spreadmarts across the companyWith data volumes growing exponentially its becoming difficult to process all the data required for the data warehouse during the nightly batch windows.If you continue to just throw expensive hardware and database licenses at the Big Data problem your costs will spiral out of controlMore and more organizations would like to leverage the massive amounts of interaction data such as social media and machine device data to attract and retain customers, improve business operations, and their competitive advantage. But because so much of this data is multi-structured and being generated at a rate that is akin to drinking through a fire hose they find accessing, storing, and processing interaction data can be extremely difficult.Another challenge with big data is that because there is so much new data being generated and stored, it is difficult for organizations to find, understand, and trust the data
Challenges & Problems Customers are facing with Big DataGrowing data volumes, expensive data warehouse upgradesVariety of data, onboarding new types of dataLack of Big Data skillsBuilding the business case for a big data projectDon’t know where to beginRegulatory compliance and security (e.g. data privacy, data sharing)Speaker Notes:There are several challenges related to Big Data AnalyticsAs data volumes continue to grow how can you continue to meet your SLAs for existing projects while controlling costs?It’s estimated that Big Transaction Data alone is growing at 50-60% per yearApplication databases are growing to the point where not only the hardware and software costs are rising but application performance is adversely affectedData warehouses are also growing too fast using up the capacity of current infrastructure investments. And with Big Interaction Data exploding who can afford to store all this information in their enterprise data warehouse. In fact one financial institution estimated that it costs $180K to manage 1 TB of data in their data warehouse over a 3-year periodAs more and more users demand information, organizations also experience a proliferation of datamarts that further increases hardware and database costs. A large healthcare insurance provider had over 30,000 datamarts and spreadmarts across the companyWith data volumes growing exponentially its becoming difficult to process all the data required for the data warehouse during the nightly batch windows.If you continue to just throw expensive hardware and database licenses at the Big Data problem your costs will spiral out of controlMore and more organizations would like to leverage the massive amounts of interaction data such as social media and machine device data to attract and retain customers, improve business operations, and their competitive advantage. But because so much of this data is multi-structured and being generated at a rate that is akin to drinking through a fire hose they find accessing, storing, and processing interaction data can be extremely difficult.Another challenge with big data is that because there is so much new data being generated and stored, it is difficult for organizations to find, understand, and trust the data
Challenges & Problems Customers are facing with Big DataGrowing data volumes, expensive data warehouse upgradesVariety of data, onboarding new types of dataLack of Big Data skillsBuilding the business case for a big data projectDon’t know where to beginRegulatory compliance and security (e.g. data privacy, data sharing)Speaker Notes:There are several challenges related to Big Data AnalyticsAs data volumes continue to grow how can you continue to meet your SLAs for existing projects while controlling costs?It’s estimated that Big Transaction Data alone is growing at 50-60% per yearApplication databases are growing to the point where not only the hardware and software costs are rising but application performance is adversely affectedData warehouses are also growing too fast using up the capacity of current infrastructure investments. And with Big Interaction Data exploding who can afford to store all this information in their enterprise data warehouse. In fact one financial institution estimated that it costs $180K to manage 1 TB of data in their data warehouse over a 3-year periodAs more and more users demand information, organizations also experience a proliferation of datamarts that further increases hardware and database costs. A large healthcare insurance provider had over 30,000 datamarts and spreadmarts across the companyWith data volumes growing exponentially its becoming difficult to process all the data required for the data warehouse during the nightly batch windows.If you continue to just throw expensive hardware and database licenses at the Big Data problem your costs will spiral out of controlMore and more organizations would like to leverage the massive amounts of interaction data such as social media and machine device data to attract and retain customers, improve business operations, and their competitive advantage. But because so much of this data is multi-structured and being generated at a rate that is akin to drinking through a fire hose they find accessing, storing, and processing interaction data can be extremely difficult.Another challenge with big data is that because there is so much new data being generated and stored, it is difficult for organizations to find, understand, and trust the data
You don’t want expensive data scientists ($300K FTE) doing this work. JPMC – hand coding 3 weeks and INFA was 3 dayIn a recent Information Week article – Meet The Elusive Data Scientist – CatalinCiobanu, a physicist who spent ten years at Fermi National Accelerator Laboratory (Fermilab) and is now senior manager-BI at Carlson Wagonlit Travel, said “70% of my value is an ability to pull the data, 20% of my value is using data-science methods and asking the right questions, and 10% of my value is knowing the tools”. DJ Patil, Data Scientist in Residence at Greylock Partners (formerly Chief Data Scientist at LinkedIn) states in his book "Data Jujitsu" that “80% of the work in any data project is in cleaning the data.” In a recent study that surveyed 35 data scientists across 25 companies (Kandel, et al. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Visual Analytics Science and Technology (VAST), 2012) a couple of data scientists expressed their frustration in preparing data for analysis: “I spend more than half my time integrating, cleansing, and transforming data without doing any actual analysis. Most of the time I’m lucky if I get to do any ‘analysis’ at all.”, and another data scientist informs us that “most of the time once you transform the data … the insights can be scarily obvious.”44% of big data projects are cancelled versus 25% for IT projects in general and many more fail to achieve project objectives according to a 2012 Enterprise Big Data Survey – Infochimps/SSWUG Enterprise Big Data Survey 2012, Synamic Markets Enterprise IT Survey 2008http://www.slideshare.net/infochimps/top-strategies-for-successful-big-data-projectsWhy do projects fail?BusinessInaccurate scope – not enough time / deadlines bustedNon-cooperation between departmentsHaving the right talent / lack of expertiseTechnicalTechnical or roll-out roadblocks – gathering data from different sourcesFinding and understanding tools, platforms, technologies
You don’t want expensive data scientists ($300K FTE) doing this work. JPMC – hand coding 3 weeks and INFA was 3 dayIn a recent Information Week article – Meet The Elusive Data Scientist – CatalinCiobanu, a physicist who spent ten years at Fermi National Accelerator Laboratory (Fermilab) and is now senior manager-BI at Carlson Wagonlit Travel, said “70% of my value is an ability to pull the data, 20% of my value is using data-science methods and asking the right questions, and 10% of my value is knowing the tools”. DJ Patil, Data Scientist in Residence at Greylock Partners (formerly Chief Data Scientist at LinkedIn) states in his book "Data Jujitsu" that “80% of the work in any data project is in cleaning the data.” In a recent study that surveyed 35 data scientists across 25 companies (Kandel, et al. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Visual Analytics Science and Technology (VAST), 2012) a couple of data scientists expressed their frustration in preparing data for analysis: “I spend more than half my time integrating, cleansing, and transforming data without doing any actual analysis. Most of the time I’m lucky if I get to do any ‘analysis’ at all.”, and another data scientist informs us that “most of the time once you transform the data … the insights can be scarily obvious.”44% of big data projects are cancelled versus 25% for IT projects in general and many more fail to achieve project objectives according to a 2012 Enterprise Big Data Survey – Infochimps/SSWUG Enterprise Big Data Survey 2012, Synamic Markets Enterprise IT Survey 2008http://www.slideshare.net/infochimps/top-strategies-for-successful-big-data-projectsWhy do projects fail?BusinessInaccurate scope – not enough time / deadlines bustedNon-cooperation between departmentsHaving the right talent / lack of expertiseTechnicalTechnical or roll-out roadblocks – gathering data from different sourcesFinding and understanding tools, platforms, technologies
Use the Two slide version of thisLower CostsLower HW/SW costsOptimized end-to-end performanceRich pre-built connectors, library of transforms for ETL, data quality, parsing, profilingIncreased ProductivityUp to 5x productivity gains with no-code visual development environmentNo need for Hadoop expertise for data integrationProven Path to Innovation5000+ customers, 500+ partners, 100,000+ trained Informatica developersEnterprise scalability, security, & support
Use the Two slide version of thisLower CostsLower HW/SW costsOptimized end-to-end performanceRich pre-built connectors, library of transforms for ETL, data quality, parsing, profilingIncreased ProductivityUp to 5x productivity gains with no-code visual development environmentNo need for Hadoop expertise for data integrationProven Path to Innovation5000+ customers, 500+ partners, 100,000+ trained Informatica developersEnterprise scalability, security, & support
Use the Two slide version of thisLower CostsLower HW/SW costsOptimized end-to-end performanceRich pre-built connectors, library of transforms for ETL, data quality, parsing, profilingIncreased ProductivityUp to 5x productivity gains with no-code visual development environmentNo need for Hadoop expertise for data integrationProven Path to Innovation5000+ customers, 500+ partners, 100,000+ trained Informatica developersEnterprise scalability, security, & support
The Vibe VDM works by receiving a set of instructions that describe the data source(s) from which it will extract data, the rules and flow by which that data will be transformed, analyzed, masked, archived, matched, or cleansed, and ultimately where that data will be loaded when the processing is finished. Vibe consists of a number of fundamental components (see Figure 2):Transformation Library: This is a collectionof useful, prebuilt transformations that the engine calls to combine, transform, cleanse, match, and mask data. For those familiar with PowerCenter or Informatica Data Quality, this library is represented by the icons that the developer can drag and drop onto the canvas to perform actions on data. Optimizer: The Optimizer compiles data processing logic into internal representation to ensure effective resource usage and efficient run time based on data characteristics and execution environment configurations. Executor:This is a run-time execution engine that orchestrates the data logic using the appropriate transformations. The engine reads/writes data from an adapter or directly streams the data from an application. Connectors:Informatica’s connectivity extensions provide data access from various data sources. This is what allows Informatica Platform users to connect to almost any data source or application for use by a variety of data movement technologies and modes, including batch, request/response, and publish/subscribe.
The Vibe virtual data machine, although critical, is not sufficient by itself to solve the wide spectrum of data integration challenges. Vibe lets you master complexity and change, and it makes all data accessible. But in many places where data lives, especially some of the emerging data sources, the data is unfiltered. Unstandardized. Uncleansed. Unrelated. Some of it is even unnecessary. It takes a considerable amount of work and expertise to understand how to transform raw data into information that provides insight and value.So in addition to the enabling capabilities that Vibe delivers, you also need to layer on the data services and information solutions from a fully integrated information platform that ensures that data is: Complete: Insight comes from a complete picture, not from fragments. You have to integrate the data fragments so you are looking at a whole— a whole person, a whole account, a whole product, a whole business process, a whole organization, a whole nation —rather than pieces or parts. Timely: Different consumers and different use cases require data at different times and frequencies. You want one platform that accelerates the delivery of data when, where, and how it is needed, whether it is via messaging, bulk delivery, or through a virtual view. Trusted: If data is incomplete, inaccurate, or unrelated, it’s not of much use. You need data quality services that let you diagnose problems and then cleanse the data in a sustainable, efficient way. Authoritative: You also need master data management services to master the data and relationships that constitute the “whole” for your key business entities, even as the data fragments feeding into the “whole” continually change. Actionable: Ultimately, data needs to serve a user—whether it is a human or a machine. The platform needs to help the user understand when it needs to pay attention to an event, investigate an issue, or act. Secure: With the exponential rise in combinations of people accessing data across different systems, the potential for a security breach also rises exponentially. You must be able to secure data consistently and universally, no matter where it resides or how it is used.But it is not sufficient for an information platform to merely have a long checklist of information services. Only an information platform powered by a VDM provides the interoperability required to easily combine services on the fly to meet your specific business requirements. Only an information platform powered by a VDM can provide the right tools and capabilities for the simplest entry‐level uses to the most complex cross‐enterprise initiatives, allowing you to share work across that entire span without recoding. And only an information platform powered by a VDM has the flexibility to be deployed stand‐alone in the data center, as a cloud service, or embedded into applications, middleware infrastructure, and devices.
Informatica announced the launch of the PowerCenter Big Data Edition at Hadoop World with general availability in December.The PowerCenter Big Data Edition provides a proven path to innovation that lowers data management costs with benefits that include:Bringing innovative products and services to market faster and improve business operationsReducing big data management costs while handling growing data volumes and complexity Realizing performance and costs benefits by expanding adoption of Hadoop across projects Minimizing risk by investing in proven data integration software that hides the complexity of emerging technologiesPowerCenter Big Data Edition Key Features include:Universal Data AccessYour IT team can access all types of big transaction data, including RDBMS, OLTP, OLAP, ERP, CRM, mainframe, cloud, and others. You can also access all types of big interaction data, including social media data, log files, machine sensor data, Web sites, blogs, documents, emails, and other unstructured or multi-structured data. High-Speed Data Ingestion and ExtractionYou can access, load, replicate, transform, and extract big data between source and target systems or directly into Hadoop or your data warehouse. High performance connectivity through native APIs to source and target systems with parallel processing ensures high-speed data ingestion and extraction.No-Code ProductivityRemoving hand-coding within Hadoop through the visual Informatica development environment. Develop and scale data flows with no specialized hand-coding in order to maximize reuse. Users can build once and deploy anywhere Unlimited ScalabilityYour IT organization can process all types of data at any scale—from terabytes to petabytes—with no specialized coding on distributed computing platforms such as Hadoop.Optimized Performance for Lowest CostBased on data volumes, data type, latency requirements, and available hardware, PowerCenter Big Data Edition deploys big data processing on the highest performance and most cost-effective data processing platforms. You get the most out of your current investments and capacity whether you deploy data processing on SMP machines, traditional grid clusters, distributed computing platforms like Hadoop, or data warehouse appliancesETL on HadoopThis edition provides an extensive library of prebuilt transformation capabilities on Hadoop, including data type conversions and string manipulations, high-performance cache-enabled lookups, joiners, sorters, routers, aggregations, and many more. Your IT team can rapidly develop data flows on Hadoop using a codeless graphical development environment that increases productivity and promotes reuse.Profiling on HadoopData on Hadoop can be profiled through the Informatica developer tool and a browser-based analyst tool. This makes it easy for developers, analysts, and data scientists to understand the data, identify data quality issues earlier, collaborate on data flow specifications, and validate mapping transformation and rules logic.Design Once and Deploy AnywhereETL developers can focus on the data and transformation logic without having to worry where the ETL process is deployed—on Hadoop or traditional data processing platforms. Developers can design once, without any specialized knowledge of Hadoop concepts and languages, and easily deploy data flows on Hadoop or traditional systems. Complex Data Parsing on HadoopThis edition makes it easy to access and parse complex, multi-structured, unstructured, and industry standards data such as Web logs, JSON, XML, and machine device data. Prebuilt parsers for market data and industry standards like FIX, SWIFT, ACORD, HL7, HIPAA, and EDI are also available and licensed separately.Entity Extraction and Data Classification on HadoopUsing a list of keywords or phrases, entities related to your customers and products can be easily extracted and classified from unstructured data such as emails, social media data, and documents. You can enrich master data with insights into customer behavior or product information such as competitive pricing.Mixed WorkflowsYour IT team can easily coordinate, schedule, monitor, and manage all interrelated processes and workflows across your traditional and Hadoop environment to simplify operations and meet your SLAs. You can also drill down into individual Hadoop jobs. High AvailabilityThis edition provides 24x7 high availability with seamless failover, flexible recovery, and connection resilience. When it comes time to develop new products and services using big data insights, you can rest assured that they will scale and be available 24x7 for mission-critical operations.
Informatica announced the launch of the PowerCenter Big Data Edition at Hadoop World with general availability in December.The PowerCenter Big Data Edition provides a proven path to innovation that lowers data management costs with benefits that include:Bringing innovative products and services to market faster and improve business operationsReducing big data management costs while handling growing data volumes and complexity Realizing performance and costs benefits by expanding adoption of Hadoop across projects Minimizing risk by investing in proven data integration software that hides the complexity of emerging technologiesPowerCenter Big Data Edition Key Features include:Universal Data AccessYour IT team can access all types of big transaction data, including RDBMS, OLTP, OLAP, ERP, CRM, mainframe, cloud, and others. You can also access all types of big interaction data, including social media data, log files, machine sensor data, Web sites, blogs, documents, emails, and other unstructured or multi-structured data. High-Speed Data Ingestion and ExtractionYou can access, load, replicate, transform, and extract big data between source and target systems or directly into Hadoop or your data warehouse. High performance connectivity through native APIs to source and target systems with parallel processing ensures high-speed data ingestion and extraction.No-Code ProductivityRemoving hand-coding within Hadoop through the visual Informatica development environment. Develop and scale data flows with no specialized hand-coding in order to maximize reuse. Users can build once and deploy anywhere Unlimited ScalabilityYour IT organization can process all types of data at any scale—from terabytes to petabytes—with no specialized coding on distributed computing platforms such as Hadoop.Optimized Performance for Lowest CostBased on data volumes, data type, latency requirements, and available hardware, PowerCenter Big Data Edition deploys big data processing on the highest performance and most cost-effective data processing platforms. You get the most out of your current investments and capacity whether you deploy data processing on SMP machines, traditional grid clusters, distributed computing platforms like Hadoop, or data warehouse appliancesETL on HadoopThis edition provides an extensive library of prebuilt transformation capabilities on Hadoop, including data type conversions and string manipulations, high-performance cache-enabled lookups, joiners, sorters, routers, aggregations, and many more. Your IT team can rapidly develop data flows on Hadoop using a codeless graphical development environment that increases productivity and promotes reuse.Profiling on HadoopData on Hadoop can be profiled through the Informatica developer tool and a browser-based analyst tool. This makes it easy for developers, analysts, and data scientists to understand the data, identify data quality issues earlier, collaborate on data flow specifications, and validate mapping transformation and rules logic.Design Once and Deploy AnywhereETL developers can focus on the data and transformation logic without having to worry where the ETL process is deployed—on Hadoop or traditional data processing platforms. Developers can design once, without any specialized knowledge of Hadoop concepts and languages, and easily deploy data flows on Hadoop or traditional systems. Complex Data Parsing on HadoopThis edition makes it easy to access and parse complex, multi-structured, unstructured, and industry standards data such as Web logs, JSON, XML, and machine device data. Prebuilt parsers for market data and industry standards like FIX, SWIFT, ACORD, HL7, HIPAA, and EDI are also available and licensed separately.Entity Extraction and Data Classification on HadoopUsing a list of keywords or phrases, entities related to your customers and products can be easily extracted and classified from unstructured data such as emails, social media data, and documents. You can enrich master data with insights into customer behavior or product information such as competitive pricing.Mixed WorkflowsYour IT team can easily coordinate, schedule, monitor, and manage all interrelated processes and workflows across your traditional and Hadoop environment to simplify operations and meet your SLAs. You can also drill down into individual Hadoop jobs. High AvailabilityThis edition provides 24x7 high availability with seamless failover, flexible recovery, and connection resilience. When it comes time to develop new products and services using big data insights, you can rest assured that they will scale and be available 24x7 for mission-critical operations.
ETL, parsing, data quality, profiling, NLPFor talend and pentaho you need to code and now MRPowerCenter Big Data Edition reduces big data costs. Your IT team can manage twice the data volume with your existing analytics environment. You can offload data from your warehouse and source systems and offload processing to low-cost commodity hardware. High-Speed Data Ingestion and ExtractionLoad, process and extract big data across heterogeneous environments to optimize the end-to-end flow of data between Hadoop and traditional data management infrastructure.Near-Universal Data Access and Comprehensive ETL on HadoopReliably access a variety of types and sources of data using a rich library of pre-built ETL transforms for both transaction and interaction data that run on Hadoop or traditional grid infrastructure
By moving away from hand coding to proven data integration productivity tools, you triple your productivity—you no longer need an army of developers. This edition provides unified administration for all data integration projects. You can build it once and deploy it anywhere, which keeps costs down by optimizing data processing utilization across both existing data platforms and emerging technologies like HadoopNo-code Development EnvironmentRemoves hand-coding within Hadoop through a visual development environmentDevelop and scale data flows with no specialized hand-coding in order to maximize reuse.Virtual Data MachineBuild transformation logic once, and deploy at any scale on Hadoop or traditional ETL grid infrastructureAt a recent TDWI Big Data Summit last summer, eHarmony presented their Informatica Hadoop implementation. There was a question from the audience that asked, “How many new resources did you need to hire to implement this on Hadoop?”. The Director of IT at eHarmony, said “None”
Informatica® PowerCenter® Big Data Edition is the safe on-ramp to big data that works with both emerging technologies and traditional data management infrastructure. With this edition, your IT organization can rapidly create innovative products and services by integrating and analyzing new types and sources of data. It provides a proven path of innovation while lowering big data management costs and minimizing risk. With Big Data you don’t always know what you are looking for. Instead of being given the requirements from the business for a report you are instead tasked with a business goal such as increase customer acquisition and retention or improve fraud detection.With this goal in mind and a wealth of big transaction data, big interaction data, and big data processing technologies how can you achieve this goal cost-effectively?Let’s consider an online retailer having several big data projects at various stages of implementation to increase customer acquisition & retention, increase profitability and improve fraud detectionSince we don’t necessarily have a well-defined set of requirements we need to create a sandbox environment where data science teams can play and experiment with big data.A team of data scientists, analysts, developers, architects, and LOB stakeholders collaborate within the sandbox to discover insights that will achieve the goals of each project.This requires us to access and ingest in this case customer transaction information from the ERP and CRM systems, web logs from our online store, social data from twitter, facebook, and blogs, and geo-location data from mobile devices.The data science team goes through an iterative process of accessing, preparing, profiling, and analyzing data sets to discover patterns and insights that could help achieve the business goals of the project.However, what many people fail to acknowledge is that 70-80% of this work is accessing and preparing the data sets for analysis. This includes parsing, transforming, and integrating a variety of disparate data sets coming from different platforms, in different formats, and at different latencies.DJ Patil, Data Scientist in Residence at Greylock (a VC firm) stated in his book – “Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn’t something that gets in the way of solving the problem: it is the problem.”The data science team may discover a few insights which they need to test, validate, and measure the impact to the business. They might apply techniques such as A/B testing to determine which algorithms and data flows produce the best results for a stated goal like increase customer share of wallet with next best offer recommendations or increase profitability through pricing optimization or identify trends & reduce false positives related to fraud detection.Once organizations overcome the hurdles of accessing, preparing, and integrating data sets to discover these insights they then face the challenge of operationalizing the insights. This is where organizations seem to struggle in turning insights into real business value.To turn insights into business value means the insight needs to be delivered reliably to the point of use whether it be a report, enterprise or web app, or part of an automated workflowFor example…The Fraud department needs to be notified in real-time if fraud is suspected or if there is a spike in a particular region that is seeing an upward trend of fraudCustomers shopping on an eCommerce website need to see next best offers in real-time as they click through the websiteThe customer service rep needs to immediately know if a customer is likely to churn when a customer calls or files an online complaintPricing optimization needs to be delivered directly to the sales rep via a CRM mobile app based on customer location, purchase history, demographics, etc.Too many organizations end up rebuilding and hand-coding the data flows created during design and analysis when it comes time to deploy to production Informatica is a metadata driven development environment that provides near universal data access, hundreds of pre-built parsers, transformations, and data quality rules and the data flows created during design and analysis can be easily extended and deployed for production use.Another benefit is that datasets and data flows can easily be shared across projectsThis helps an organization to be agile and rapidly innovate with big dataThe datasets used in design and analysis may not have been optimized for a production data flow. PowerCenter Big Data Edition provides highly scalable throughput data integration to handle all volumes of data processing with high performancePowerCenter Big Data Edition separates the design from deployment and as we have seen enables organizations to have a consistent & repeatable process, reuse data flows to maximize productivity, scale performance using high performance distributed computing platforms, ensure 24x7 operations with high availability, flexiblity to change as data volumes continue to grow and as new data sources are added, deliver and analyze data at various latencies, easier to support and maintain than hand-coding, all while controlling costs and cost-effectively optimizing processing
Subset production data and mask sensitive data in non-production systems
But once you discover thesetreasures of data can you trust its origins and the transformations that have been applied?On this island we have found a treasure of valuable data combined with hazardous waste of bad data. How can we extract the gold and get rid of the waste?Do you know where the treasure of data came from? Is it authentic?In order to trust the data you need to know where it came from and what was done to the dataIts also unfortunate that data management teams end up recreating datasets over and over that have already been normalized, cleansed, and curated using up a lot of storage and resources.<CLICK> I’d like to recommend that you commit to data governance to improve your business processes, decisions and interactionsWe talk about managing data as asset but what does that really mean. You need a process to effectively govern your data so you can deliver trusted and reliable data.First you need to decide the cost of bad data – for example the cost of having bad customer addresses or duplicate parts could be millions<CLICK> The process starts iteratively with the discovery and definition of data so that you know what you have in terms of data definitions, domains, relationships, and business rules, etc.For example one company had people showing up to meetings with different numbers related to claims payments. The problem wasn’t in the data. The problem was that the people were from 3 different departments with three different definitions of the dataTherefore the Business and IT require a process and tools to efficiently collaborate and automate steps that continuously improve the quality of data over time.In order to manage data governance effectively and support continuous improvement requires KPI dashboards, proactive monitoring, and a clear ROI
Enrich master data with customer behavior insights, relationships, and key influencers so you can proactively engage with customers and increase upsell/cross-sell opportunities
Identify unused and unnecessary data to drive data retention policiesAssess data usage and performance metrics to focus optimizationArchived 13 TB of data in first 2 months and continue to retire data monthlyPhase 2: Offload data and processing to Hadoop
Why INFAEase of use for developers and administratorsEasy to scale performance and at a comparatively lower cost using PowerCenter grid with commodity hardwareCould standardize on one data integration platform for all data movement use-cases and requirementsComprehensive data integration platform for big data processing, batch and real-time data movement, metadata management, profiling, test data management, and protecting sensitive dataThe Challenge: Company is growing fast with data volumes and processing loads increasing. Ever increasing demand for data driven decision making , analytics and reporting. Inability to scale legacy systems due to cost as well as time factors (not being plug and play). Needed to standardize on a single platform/vendor to meet the various data related needs – ETL, Metadata, Masking, Subsetting, Real Time. (need to wordsmith this)The Result: Ability to scale easily by adding incremental nodes in comparatively shorter time periods. Reduction in hardware cost due to commodity hardware stack.Phase 1:PowerCenter Grid and HA implementationSeveral site facing OLTP Oracle DBsSeveral Oracle Data Marts and a Petabyte scale Teradata EDWTransactional Data, Behavioral Data , Web LogsProcess a few terabytes of incremental data every day through PowerCenter GridImplemented a single domain dual data center PowerCenter Grid (Primary vs. DR)Currently Active/Passive, will eventually become active/active and expand with further node additions.Commodity Linux machines with 64GB memory with shared NFS file system mounted across all nodes within a data center.Multiple Integration services assigned to the GRID, with the repository DB running on a dedicated DB.Grid requirementsHighly available data movement/data integration environment.Ability to scale horizontally with out having to extensively re-architect application design.Ability to automatically load balance.Ability to recover automatically in case of system errors.Phase 2:Grow PowerCenter grid to increase processing capacity to meet growing data volumes/ reduced processing times.Current BenefitsAbility to scale easily by adding incremental nodes in comparatively shorter time periods.Reduction in hardware cost due to commodity hardware stack.Future BenefitsExpect to reduce time to perform impact/lineage analysis when we implement metadata solution.Expect to re-use profiling information when we implement profiling solution.Expect to perform more comprehensive testing much faster when we implement masking/ sub setting.Expect to reduce batch loads from 30 minutes to a few seconds for fraud detection when we implement Ultra MessagingParticipate in PowerCenter on Hadoop Beta Testing program.Today use PERL scripts to process Web logs and move results into TeradataCurrently looking at utilizing Hadoop for various text and log data mining and analysis capabilities for things such as Risk Monitoring, Behavior Tracking and for various marketing related activities. We believe that using Hadoop for low cost big data analysis/processing alongside capabilities of Infa Grid to deliver mission critical data to our datamarts would be complementary to each other while allowing us to maintain metadata and other operational capabilities within a single integrated platform.
Continuously collect all data from all carsAll cars by end of year will transmit data to central Teradata data warehouseReal-time data integration using PowerCenter, CDC, CEP
Do we want these after the final slide (currently Connect with Perficient)?