SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Data Profiling:
The First Step to Big Data Quality
Harald Smith, Dir. Product Marketing
Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• Our team will reach out to you to answer them following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.
Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus on
data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog author: “Data Democratized”
Only 35%of senior executives have a
high level of trust in the
accuracy of their Big Data
Analytics
KPMG 2016 Global CEO Outlook
92% of
executives are concerned
about the negative impact of
data and analytics on
corporate reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling
due to poor data quality
Dimensional Research, 2019
Big Data Needs
Data Quality
“Societal trust in business is
arguably at an all-time low
and, in a world increasingly
driven by data and
technology,
reputations and brands are
ever harder to protect.”
EY “Trust in Data and Why it Matters”, 2017.
The importance of data
quality in the enterprise:
• Decision making
• Customer centricity
• Compliance
• Machine learning & AI
“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Data Quality Challenges with Machine Learning
Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” –
Mistakes and errors are almost never the patterns you’re looking for in
a data set. Sparse data generates other issues. Correcting and
standardizing will tend to boost the signal, but must account for bias.
Missing context – Many data sources lack context around location or
population segments. Unless enriched with other data sets, (e.g.
geospatial, demographics, or firmographics data), some ML algorithms
will not be usable.
Multiple copies – If your data comes from many sources, as it often
does, it may contain multiple records of information about the same
person, company, product or other entity. Removing duplicates and
enhancing the overall depth and accuracy of knowledge about a single
entity can make a huge difference.
Spurious correlations – Just as missing context may hinder some ML
algorithms, inclusion of already correlated data (e.g. city and postal
code) may result in overfitting of ML algorithms.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
But data analysts may not be aware of
specific data quality issues that must be
addressed to support machine learning.
Traditional data quality processes are
an effective method to identify defects.
Understanding Big Data Quality
Data Profiling
The set of analytical techniques that
evaluate actual data content (vs.
metadata) to provide a complete view
of each data element in a data source.
Provides summarized inferences, and
details of value and pattern frequencies
to quickly gain data insights.
Business Rules
The data quality or validation rules that
help ensure that data is “fit for use” in
its intended operational and decision-
making contexts.
Covers the accuracy, completeness,
consistency, relevance, timeliness and
validity of data.
Five Key Steps to effective Data Profiling
These are not new, but good to reiterate in the
context of Big Data:
1. How you want to analyze the data?
2. What should you review? (there's a lot of stuff)
3. What should you look for? (based on data “type”)
4. When should you build rules? (laser-focus; CDE’s)
5. What needs to be communicated?
1. How do you want to analyze the data?
Universal DQ best practices:
Understand the End Goal
• How does the business intend to
use the data (i.e. what’s the use
case)?
• Empower users (“Who”) to gain
new clarity into the core problem
(“Why”)
• What will the data be used for?
• What defines the Fitness for your
Purpose?
Establish Scope
• Ask the “right questions” about the
use case and the data (not just
“what” and “how”)
• What data is relevant to the effort?
• Big Data or other, you need to set
boundaries for the work
Understand Context
• How does the business define the
data?
• What are the important
characteristics and context of the
data?
• What are the Critical Data
Elements?
• What qualities will you need to
address, or leave alone?
• “High-quality data” definition will
vary by business problem“If you don’t know what you want to
get out of the data, how can you
know what data you need – and
what insight you’re looking for?”
Wolf Ruzicka, Chairman of the Board at EastBanc Technologies,
Blog post: June 1, 2017, “Grow A Data Tree Out Of The “Big Data”
Swamp”
“
”
Never lead with a data set;
lead with a question.
Anthony Scriffignano, Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017, “The Data Differentiator”
To Sample or not to Sample?
Sampling helps with:
• Data Integration
• Source-to-target mapping
• Data Modeling
• Discovering Correlations
When the focus is on the structure of the data
❖ REMEMBER: your target is a statistically
valid sample!
❖ ~16k records gives you 99% confidence
with a margin of error of 1% for 100B
records
❖ ~66k records gives you 99% confidence
with a margin of error of .5% for same
Full Volume needed with:
• Data Quality
• Data Governance
• Regulatory Compliance
• Finding Outliers and Issues
with Content
• “Needles in the haystack”
When the focus is on the quality of or risks
within the data
❖ Focus on critical data elements and
leverage tools that scale to data volume
Big Data at scale distributes data across many
nodes – not necessarily with other relevant data!
• Processing routines must apply same approach and logic each
time
• Implications for profiling, joining, sorting, and matching data,
whether for enrichment, verification against trusted sources, or a
consolidated single view
Data Quality functions must be performed in a consistent manner,
no matter where actual processing takes place, how the data is
segmented, and what the data volume is.
• Data quality cleansing and preparation routines have to be
reproduced at scale, both to get the data ready to train machine
learning models, and to comply with business regulations.
• Critical to establishing, building, and maintaining trust
Scaling Data Quality best practices:
Consistent processing at scale
Source: HP Analyst Briefing
2. What do you want to review?
Common Data Quality Measurements
What measures can we take advantage of?
1. Completeness – Are the relevant fields populated?
2. Integrity – Does the data maintain an internal structural
integrity or a relational integrity across sources
3. Uniqueness – Are keys or records unique?
4. Validity – Does the data have the correct values?
• Code and reference values
• Valid ranges
• Valid value combinations
5. Consistency – Is the data at consistent levels of
aggregation or does it have consistent valid values
over time?
6. Timeliness – Did the data arrive in a time period
that makes it useful or usable?
New data, new data quality challenges
• 3rd Party and external data with unknown provenance or relevance
• Bias in the data – whether in collection, extraction, or other processing
• Data without standardized structure or formatting
• Continuously streaming data
• Disjointed data (e.g. gaps in receipt)
• Consistency and verification of data sources
• Changes and transformation applied to data (i.e. does it really
represent the original input)
New Data Quality Problems
“34 percent of bankers in our survey report that their organization
has been the target of adversarial AI at least once, and 78 percent
believe automated systems create new risks, such as fake data,
external data manipulation, and inherent bias.”
Accenture Banking Technology Vision 2018
• Contextual visualizations
• Value and pattern distributions
• Attribute summaries and metadata
• Sort and filter to quickly find data
of interest
• Detail drilldowns to any content
Let Data Profiling guide you
3. What should you look for?
Common Data Types
What variances do you need awareness of?
1. Identifiers – data that uniquely identifies something
2. Indicators – data that flags a specific condition
3. Dates – data that identifies a point in time
4. Quantities – data that identifies an amount or value of something
5. Codes – data that segments other data
6. Text – data that describes or names something
Identifiers
Use cases:
• Business Operations
• 360 View of Entity
• BI Reporting (incl. EDW)
• Analytics
• AI/ML
Examples:
• Customer ID
• National ID / Passport #
• Social Security # / Tax ID
• Product ID
What to look for:
• 100% Complete
• All Unique values
• Anomalous patterns
• Numeric vs. String
Notes:
• Needs full volume assessment
Indicators (aka Flags)
Use cases:
• Business Operations
• 360 View of Entity
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• True / False (or T/F)
• Yes / No (or Y/N)
• 1 / 0
What to look for:
• Binary Values only
• Consistent pattern
• No mixing of “Y” vs “YES”
• If NULL occurs, it must be
one of the binary values
• Skews in frequency
distributions
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify discrepancies
• Often are triggers for other conditions –
look for use in business rules, but likely
occur downstream
Codes
Use cases:
• Business Operations
• 360 View of Entity
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Account Status
• Credit Rating
• Diagnosis/Procedure Codes
• Order Status
• Postal Code
What to look for:
• Expected values
• Consistent patterns
• No mixing of “A” vs “active”
• NULL values
• Skews in frequency
distributions
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify discrepancies
• Often are triggers for or from other
conditions – look for use in business rules
• May correlate to other fields
Dates
Use cases:
• Business Operations
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Birth Date
• Departure Date
• Order Date
• Shipping Date
• Timestamp
What to look for:
• Skews in frequency
distributions
• E.g. 01/01/2001
• Anomalous patterns
• Numeric vs. String
• Unusual values
• Missing values and gaps
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify
Quantities
Use cases:
• Business Operations
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Amount (e.g. item count, amount due)
• Price
• Sales
• Total (e.g. order total)
What to look for:
• Skews in frequency
distributions
• Anomalous patterns
• Excessively high (or low)
values
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify
Text
Use cases:
• Business Operations
• Building blocks for other
identifiers!
• 360 View of Entity
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Name
• Address
• Product Description
• Claim Description
What to look for:
• Missing Values
• Frequency of patterns /
Anomalous patterns
• Existence of numerics
• Values <= 5 characters
• Compound values
• Unusual, recurring values
• “Do not use”
Notes:
• Look for correlations with Code values
that indicate specific conditions (e.g.
values used for testing purposes)
4. When do you build rules?
Focus on:
• Critical Data Elements (data quality dimensions)
• Policy-based conditions (e.g. regulatory
compliance)
• Correlated data conditions (e.g. If x, then y)
• Filtering and segmenting data (refining
evaluations; investigating root cause)
Build Rules for Defined Conditions
• Validate critical requirements within or
across data sources
• Build common rules that can be readily
tested and shared
• Evaluate and remediate issues
• Take action on incorrect data and defaults
• Create flags for subsequent use in marking
or remediating data
• Filter result sets and export for additional
use
Benefits of Business Rules
5. What should you communicate?
Culture of Data Literacy
• “Democratization of Data” requires cultural support
• Empowered to ask questions about the data
• Trained to understand and use data
• Trained to understand approaching and evaluating data quality
• Traditional data, new data, machine learning requirements, …
• Understand the business context of the data
Program of Data Governance
• Provide the processes and practices necessary for success
• Measure, monitor, and improve
• Continous iteration and development
Center of Excellence/Knowledge Base
• Where do you go to find answers?
• Who can help show you how?
Communicate!
• Annotate what you’ve found
• Identify the subject and add a description that is meaningful
• Utilize flags, tags, and other indicators to help others distinguish
types and severity of issues
• Integrate into data governance and BI tools for maximum visibility
Annotate Results with Findings
Summary
Evaluating Big Data
It is challenging to keep the end
goal in mind
• Data comes from multiple
disparate systems & sources
• The number of touchpoints for
policies and rules has grown
• There is a higher demand and
expectation for seeing data
quality in context.
• You need to assess and measure
the data content if you
5 Key Steps
• Remember the end goal – ask
questions, use best practices,
and establish scope & context
• Consider what criteria and
dimensions are needed
• Focus your attention based on
the type of data and the use case
• Build rules when necessary to
get laser-focused
• Determine what needs to be
communicated and delivered
Gaining insight and measurement of data quality is more critical than ever!
Data Profiling: The First Step to Big Data Quality

Weitere ähnliche Inhalte

Was ist angesagt?

Real-World Data Governance: Data Governance Expectations
Real-World Data Governance: Data Governance ExpectationsReal-World Data Governance: Data Governance Expectations
Real-World Data Governance: Data Governance ExpectationsDATAVERSITY
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureDATAVERSITY
 
Introduction to Data Governance
Introduction to Data GovernanceIntroduction to Data Governance
Introduction to Data GovernanceJohn Bao Vuu
 
Data Modeling is Data Governance
Data Modeling is Data GovernanceData Modeling is Data Governance
Data Modeling is Data GovernanceDATAVERSITY
 
Data Modeling & Metadata Management
Data Modeling & Metadata ManagementData Modeling & Metadata Management
Data Modeling & Metadata ManagementDATAVERSITY
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data GovernanceTuba Yaman Him
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Data Governance
Data GovernanceData Governance
Data GovernanceBoris Otto
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best PracticesBoris Otto
 
Data Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherData Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherDATAVERSITY
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityDATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
Data Quality Strategies
Data Quality StrategiesData Quality Strategies
Data Quality StrategiesDATAVERSITY
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 

Was ist angesagt? (20)

Real-World Data Governance: Data Governance Expectations
Real-World Data Governance: Data Governance ExpectationsReal-World Data Governance: Data Governance Expectations
Real-World Data Governance: Data Governance Expectations
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
 
Introduction to Data Governance
Introduction to Data GovernanceIntroduction to Data Governance
Introduction to Data Governance
 
Data Modeling is Data Governance
Data Modeling is Data GovernanceData Modeling is Data Governance
Data Modeling is Data Governance
 
Data Modeling & Metadata Management
Data Modeling & Metadata ManagementData Modeling & Metadata Management
Data Modeling & Metadata Management
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Governance
Data GovernanceData Governance
Data Governance
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
 
Data Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherData Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working Together
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Data Quality Strategies
Data Quality StrategiesData Quality Strategies
Data Quality Strategies
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Data Quality
Data QualityData Quality
Data Quality
 

Ähnlich wie Data Profiling: The First Step to Big Data Quality

Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataPrecisely
 
Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality Precisely
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackPrecisely
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
Data-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData Blueprint
 
Data-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingData-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingDATAVERSITY
 
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraFrom Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraMolly Alexander
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overviewjkvr
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataDATAVERSITY
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataPrecisely
 
From Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernanceFrom Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernancePrecisely
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...DATAVERSITY
 
DC Salesforce1 Tour Data Governance Lunch Best Practices deck
DC Salesforce1 Tour Data Governance Lunch Best Practices deckDC Salesforce1 Tour Data Governance Lunch Best Practices deck
DC Salesforce1 Tour Data Governance Lunch Best Practices deckBeth Fitzpatrick
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataDATAVERSITY
 

Ähnlich wie Data Profiling: The First Step to Big Data Quality (20)

Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality 
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Data-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData-Ed: Trends in Data Modeling
Data-Ed: Trends in Data Modeling
 
Data-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingData-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data Modeling
 
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraFrom Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big Data
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big Data
 
From Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernanceFrom Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data Governance
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
 
DC Salesforce1 Tour Data Governance Lunch Best Practices deck
DC Salesforce1 Tour Data Governance Lunch Best Practices deckDC Salesforce1 Tour Data Governance Lunch Best Practices deck
DC Salesforce1 Tour Data Governance Lunch Best Practices deck
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
uae views on big data
  uae views on  big data  uae views on  big data
uae views on big data
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in Data
 

Mehr von Precisely

Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenPrecisely
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfPrecisely
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Precisely
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Precisely
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Precisely
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fPrecisely
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsPrecisely
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPPrecisely
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenPrecisely
 
Automatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsAutomatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsPrecisely
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyPrecisely
 
Effective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowEffective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowPrecisely
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellencePrecisely
 
5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation ManagementPrecisely
 
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowUnlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowPrecisely
 
Navigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckNavigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckPrecisely
 
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformanceMainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformancePrecisely
 

Mehr von Precisely (20)

Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity Trends
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAP
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
 
Automatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsAutomatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIs
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and Precisely
 
Effective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowEffective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to Know
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center Excellence
 
5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management
 
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowUnlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
 
Navigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckNavigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar Deck
 
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformanceMainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Data Profiling: The First Step to Big Data Quality

  • 1. Data Profiling: The First Step to Big Data Quality Harald Smith, Dir. Product Marketing
  • 2. Housekeeping Webcast Audio • Today’s webcast audio is streamed through your computer speakers. • If you need technical assistance with the web interface or audio, please reach out to us using the chat window. Questions Welcome • Submit your questions at any time during the presentation using the chat window. • Our team will reach out to you to answer them following the presentation. Recording and slides • This webcast is being recorded. You will receive an email following the webcast with a link to download both the recording and the slides.
  • 3. Speaker Harald Smith • Director of Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blog author: “Data Democratized”
  • 4. Only 35%of senior executives have a high level of trust in the accuracy of their Big Data Analytics KPMG 2016 Global CEO Outlook 92% of executives are concerned about the negative impact of data and analytics on corporate reputation KPMG 2017 Global CEO Outlook 80%of AI/ML projects are stalling due to poor data quality Dimensional Research, 2019 Big Data Needs Data Quality “Societal trust in business is arguably at an all-time low and, in a world increasingly driven by data and technology, reputations and brands are ever harder to protect.” EY “Trust in Data and Why it Matters”, 2017. The importance of data quality in the enterprise: • Decision making • Customer centricity • Compliance • Machine learning & AI
  • 5. “ ” The magic of machine learning is that you build a statistical model based on the most valid dataset for the domain of interest. If the data is junk, then you’ll be building a junk model that will not be able to do its job. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 6. Data Quality Challenges with Machine Learning Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Sparse data generates other issues. Correcting and standardizing will tend to boost the signal, but must account for bias. Missing context – Many data sources lack context around location or population segments. Unless enriched with other data sets, (e.g. geospatial, demographics, or firmographics data), some ML algorithms will not be usable. Multiple copies – If your data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Spurious correlations – Just as missing context may hinder some ML algorithms, inclusion of already correlated data (e.g. city and postal code) may result in overfitting of ML algorithms. Correcting data problems vastly increases a data set’s usefulness for machine learning. But data analysts may not be aware of specific data quality issues that must be addressed to support machine learning. Traditional data quality processes are an effective method to identify defects.
  • 7. Understanding Big Data Quality Data Profiling The set of analytical techniques that evaluate actual data content (vs. metadata) to provide a complete view of each data element in a data source. Provides summarized inferences, and details of value and pattern frequencies to quickly gain data insights. Business Rules The data quality or validation rules that help ensure that data is “fit for use” in its intended operational and decision- making contexts. Covers the accuracy, completeness, consistency, relevance, timeliness and validity of data.
  • 8. Five Key Steps to effective Data Profiling These are not new, but good to reiterate in the context of Big Data: 1. How you want to analyze the data? 2. What should you review? (there's a lot of stuff) 3. What should you look for? (based on data “type”) 4. When should you build rules? (laser-focus; CDE’s) 5. What needs to be communicated?
  • 9. 1. How do you want to analyze the data?
  • 10. Universal DQ best practices: Understand the End Goal • How does the business intend to use the data (i.e. what’s the use case)? • Empower users (“Who”) to gain new clarity into the core problem (“Why”) • What will the data be used for? • What defines the Fitness for your Purpose? Establish Scope • Ask the “right questions” about the use case and the data (not just “what” and “how”) • What data is relevant to the effort? • Big Data or other, you need to set boundaries for the work Understand Context • How does the business define the data? • What are the important characteristics and context of the data? • What are the Critical Data Elements? • What qualities will you need to address, or leave alone? • “High-quality data” definition will vary by business problem“If you don’t know what you want to get out of the data, how can you know what data you need – and what insight you’re looking for?” Wolf Ruzicka, Chairman of the Board at EastBanc Technologies, Blog post: June 1, 2017, “Grow A Data Tree Out Of The “Big Data” Swamp”
  • 11. “ ” Never lead with a data set; lead with a question. Anthony Scriffignano, Chief Data Scientist, Dun & Bradstreet Forbes Insights, May 31, 2017, “The Data Differentiator”
  • 12. To Sample or not to Sample? Sampling helps with: • Data Integration • Source-to-target mapping • Data Modeling • Discovering Correlations When the focus is on the structure of the data ❖ REMEMBER: your target is a statistically valid sample! ❖ ~16k records gives you 99% confidence with a margin of error of 1% for 100B records ❖ ~66k records gives you 99% confidence with a margin of error of .5% for same Full Volume needed with: • Data Quality • Data Governance • Regulatory Compliance • Finding Outliers and Issues with Content • “Needles in the haystack” When the focus is on the quality of or risks within the data ❖ Focus on critical data elements and leverage tools that scale to data volume
  • 13. Big Data at scale distributes data across many nodes – not necessarily with other relevant data! • Processing routines must apply same approach and logic each time • Implications for profiling, joining, sorting, and matching data, whether for enrichment, verification against trusted sources, or a consolidated single view Data Quality functions must be performed in a consistent manner, no matter where actual processing takes place, how the data is segmented, and what the data volume is. • Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train machine learning models, and to comply with business regulations. • Critical to establishing, building, and maintaining trust Scaling Data Quality best practices: Consistent processing at scale Source: HP Analyst Briefing
  • 14. 2. What do you want to review?
  • 15. Common Data Quality Measurements What measures can we take advantage of? 1. Completeness – Are the relevant fields populated? 2. Integrity – Does the data maintain an internal structural integrity or a relational integrity across sources 3. Uniqueness – Are keys or records unique? 4. Validity – Does the data have the correct values? • Code and reference values • Valid ranges • Valid value combinations 5. Consistency – Is the data at consistent levels of aggregation or does it have consistent valid values over time? 6. Timeliness – Did the data arrive in a time period that makes it useful or usable?
  • 16. New data, new data quality challenges • 3rd Party and external data with unknown provenance or relevance • Bias in the data – whether in collection, extraction, or other processing • Data without standardized structure or formatting • Continuously streaming data • Disjointed data (e.g. gaps in receipt) • Consistency and verification of data sources • Changes and transformation applied to data (i.e. does it really represent the original input) New Data Quality Problems “34 percent of bankers in our survey report that their organization has been the target of adversarial AI at least once, and 78 percent believe automated systems create new risks, such as fake data, external data manipulation, and inherent bias.” Accenture Banking Technology Vision 2018
  • 17. • Contextual visualizations • Value and pattern distributions • Attribute summaries and metadata • Sort and filter to quickly find data of interest • Detail drilldowns to any content Let Data Profiling guide you
  • 18. 3. What should you look for?
  • 19. Common Data Types What variances do you need awareness of? 1. Identifiers – data that uniquely identifies something 2. Indicators – data that flags a specific condition 3. Dates – data that identifies a point in time 4. Quantities – data that identifies an amount or value of something 5. Codes – data that segments other data 6. Text – data that describes or names something
  • 20. Identifiers Use cases: • Business Operations • 360 View of Entity • BI Reporting (incl. EDW) • Analytics • AI/ML Examples: • Customer ID • National ID / Passport # • Social Security # / Tax ID • Product ID What to look for: • 100% Complete • All Unique values • Anomalous patterns • Numeric vs. String Notes: • Needs full volume assessment
  • 21. Indicators (aka Flags) Use cases: • Business Operations • 360 View of Entity • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • True / False (or T/F) • Yes / No (or Y/N) • 1 / 0 What to look for: • Binary Values only • Consistent pattern • No mixing of “Y” vs “YES” • If NULL occurs, it must be one of the binary values • Skews in frequency distributions Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify discrepancies • Often are triggers for other conditions – look for use in business rules, but likely occur downstream
  • 22. Codes Use cases: • Business Operations • 360 View of Entity • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • Account Status • Credit Rating • Diagnosis/Procedure Codes • Order Status • Postal Code What to look for: • Expected values • Consistent patterns • No mixing of “A” vs “active” • NULL values • Skews in frequency distributions Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify discrepancies • Often are triggers for or from other conditions – look for use in business rules • May correlate to other fields
  • 23. Dates Use cases: • Business Operations • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • Birth Date • Departure Date • Order Date • Shipping Date • Timestamp What to look for: • Skews in frequency distributions • E.g. 01/01/2001 • Anomalous patterns • Numeric vs. String • Unusual values • Missing values and gaps Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify
  • 24. Quantities Use cases: • Business Operations • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • Amount (e.g. item count, amount due) • Price • Sales • Total (e.g. order total) What to look for: • Skews in frequency distributions • Anomalous patterns • Excessively high (or low) values Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify
  • 25. Text Use cases: • Business Operations • Building blocks for other identifiers! • 360 View of Entity • Governance and Compliance • Analytics • AI/ML Examples: • Name • Address • Product Description • Claim Description What to look for: • Missing Values • Frequency of patterns / Anomalous patterns • Existence of numerics • Values <= 5 characters • Compound values • Unusual, recurring values • “Do not use” Notes: • Look for correlations with Code values that indicate specific conditions (e.g. values used for testing purposes)
  • 26. 4. When do you build rules?
  • 27. Focus on: • Critical Data Elements (data quality dimensions) • Policy-based conditions (e.g. regulatory compliance) • Correlated data conditions (e.g. If x, then y) • Filtering and segmenting data (refining evaluations; investigating root cause) Build Rules for Defined Conditions
  • 28. • Validate critical requirements within or across data sources • Build common rules that can be readily tested and shared • Evaluate and remediate issues • Take action on incorrect data and defaults • Create flags for subsequent use in marking or remediating data • Filter result sets and export for additional use Benefits of Business Rules
  • 29. 5. What should you communicate?
  • 30. Culture of Data Literacy • “Democratization of Data” requires cultural support • Empowered to ask questions about the data • Trained to understand and use data • Trained to understand approaching and evaluating data quality • Traditional data, new data, machine learning requirements, … • Understand the business context of the data Program of Data Governance • Provide the processes and practices necessary for success • Measure, monitor, and improve • Continous iteration and development Center of Excellence/Knowledge Base • Where do you go to find answers? • Who can help show you how? Communicate!
  • 31. • Annotate what you’ve found • Identify the subject and add a description that is meaningful • Utilize flags, tags, and other indicators to help others distinguish types and severity of issues • Integrate into data governance and BI tools for maximum visibility Annotate Results with Findings
  • 32. Summary Evaluating Big Data It is challenging to keep the end goal in mind • Data comes from multiple disparate systems & sources • The number of touchpoints for policies and rules has grown • There is a higher demand and expectation for seeing data quality in context. • You need to assess and measure the data content if you 5 Key Steps • Remember the end goal – ask questions, use best practices, and establish scope & context • Consider what criteria and dimensions are needed • Focus your attention based on the type of data and the use case • Build rules when necessary to get laser-focused • Determine what needs to be communicated and delivered Gaining insight and measurement of data quality is more critical than ever!