SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Data Profiling, Data Catalogs
and Metadata
Harmonisation
Alan McSweeney
http://ie.linkedin.com/in/alanmcsweeney
https://www.amazon.com/dp/1797567616
Data Profiling, Data Catalogs and Metadata
Harmonisation
May 3, 2021 2
Data Profiling
Understand Your Data
Data Catalog
Database of Data Assets
Metadata
Harmonisation
Standardisation of Data
Descriptions
Data Profiling
• The preparation of data prior to it being in a usable and
analysable format can consume up to 80% of the resources
of a data project
May 3, 2021 3
Data Profiling
• Process for discovering and examining the data available in existing
data sources
• Essential initial activity
− Understand the structure and contents of data
− Evaluate data quality and data conformance with standards
− Identify terms and metadata used to describe data
− Identify data relationships and dependencies
− Enable the creation of a master data view across all data sources
− Understand and define data integration requirements
• Be able to understand data issues, problems, challenges at the start
of a data project:
− Data cleansing
− Data analytics
− Master data management
− Data catalog
− Data migration
May 3, 2021 4
Data Profiling – Wider Context
May 3, 2021 5
Source System
Data
Profiling
Common Data
Model, Data
Storage and Access
Platform
Visualisation and
Reporting
Analysis
Data Access
Data Integration
Common
Data
Integration
Understand
System Data
Structures,
Values
Profiling Assists with Building
Long-Term Data Model
Profiling Assists with Building Data
Dictionary/Catalog to Enable Data
Subject Access and Data Discovery
1
2
3
5
Assists with Data
Extraction and
Integration Definition
4
Data Catalog
Data
Virtualisation
Layer
6
Data Catalog is an
Enabler of Data
Virtualisation
Importance Of Data Profiling
May 3, 2021 6
• Data profiling is
a central activity
that is key to
downstream
and long-term
data usability
and has impact
on topics of
Data Quality,
Data
Lineage/Data
Provenance and
Master Data
Management
Data Profiling
Activity
Data Quality
Data
Lineage/Data
Provenance
Contributes
to and
Ensures Data
Quality
Allows
Tracking of
Data Lineage
• Data lineage and data provenance involving tracking
data origins, what happens to the data and how it flows
between systems over time
• Data lineage provides data visibility
• It simplifies tracing data errors that may occur in data
reporting, visualisation and analytics
Master Data
Management
Enabled the
Implementation of
Master Data
Management
Data Profiling Toolset Options – Partial List
May 3, 2021 7
• Large number of data
profiling tool options
• You can investigate
these tools to
understand which are
the most suitable and
which functions are
important prior to any
formal tool selection
process
− Download and use trial
versions of commercial
tools
• This work will require
resources
Free/Open Source Tools
Aggregate Profiler
Quadient DataCleaner
Talend Open Studio
Commercial Tools
Atlan
Collibra Data Stewardship Manager
IBM InfoSphere Information Analyser
Informatica Data Explorer
Melissa Data Profiler
Oracle Enterprise Data Quality
SAP Business Objects Data Services (BODS)
SAS DataFlux
TIBCO Clarity
WinPure
Data Profiling Stages
Data Access and
Retrieval
•Defining the data sources
to be profiled
Performing the
Profiling
•Working through the
programme of data
profiling activities
Understanding
and Interpreting
the Results
•Collating, documenting
and using the results
May 3, 2021 8
Layers Of Data Profiling Activities
• Profiling starts with
individual data
fields/columns and
then extends
outwards to
tables/files, then
data
stores/databases to
the upstream data
sources and
downstream data
targets and finally
the entire set of
organisation data
entities
May 3, 2021 9
Organisation Data
Landscape
Data Sources and
Targets
Data Store
Individual Data
Structures (Tables)
Individual Data
Fields
Data Profiling Across The Organisation Data
Landscape
May 3, 2021 10
Data
Profiling
Activity Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data Profiling Across The Organisation Data
Landscape
• Data profiling activities are normally performed on single
data entities
• The organisation data landscape consists of multiple,
generally heterogenous, loosely interconnected data
entities between which where data moves
• Data breathes life into the organisation’s solution
landscape
• One profiled data entity can take its data from a number of
upstream data sources and in turn be the source for a
number of downstream data targets
• Profiling may involve tracing data lineage across a number
of data entities to create and end-to-end data provenance
May 3, 2021 11
Individual Data Profiling Exercise Can Leak Into
Other Data Domains
May 3, 2021 12
Data
Profiling
Activity Data
Profiling
Activity
Data
Profiling
Activity
Core Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Data
Profiling
Activity
Upstream Data
Profiling Activities
Downstream Data
Profiling Activities
Data Profiling Activities
Data Profiling
Activities
Individual Field
Analysis
Data Type, Length,
Input Validation,
Constraints
Number and Count
of Values,
Null/Missing Values,
Maxima, Minima,
Ranges,
Distributions
Data Categories,
Values, Data
Dictionaries,
Reference Sources
Data Value Patterns
Data Structures
Data Aggregations
Keys
Data Indexes
Triggers
Inter Field Linkages,
Relationships,
Correlations and
Dependencies
Unique
Combinations of
Field Values
Functional Field
Dependencies
Inclusion
Dependencies
Cross Field
Inconsistent Values
Data Completeness,
Consistency and
Accuracy
Missing and
Incomplete Series
Values and Gaps
Inconsistent Data
Values
Inaccurate Data
Values
Duplicate Values
Distribution and
Occurrence Checking
Data Context
Data Sources
Data Processing and
Transformation,
Business Rules
Data Description
and Documentation
Metadata Definition
and Creation
Data Targets and
Usage
Data Criticality
Data Security
Data Statistics
Data Capacity
Statistics
Data Usage Statistics
Data Update
Statistics
Data Growth
Statistics
Data Processing
Statistics
Data Overheads
Data Audit Logging
Data Infrastructure
Data Storage
Infrastructure
Data Locations
Data Processing
Infrastructure
Data Operations
Backup and
Recovery
Replication
Availability and,
Continuity
Data Maintenance
and Housekeeping
Activities
Service Levels
Data Incident
History
Data Technologies
Data Integration
Data Storage
Data Access
Problem
Identification and
Remediation
Identify data
Problems
Identify
Remediation
Activities
May 3, 2021 13
Data Profiling Activities
• This represents a set of data profiling tasks to create a
complete view of the data contents of a data entity
• This allows a realistic programme of work required to
complete the data profiling activity
− Resource requirements can be quantified
− Duration can be estimated
− Informed decision can be made on what activities to include or
exclude
May 3, 2021 14
Data Profiling – Individual Field Analysis
• Analyse individuals data fields or columns
− Data Type, Length, Input Validation, Constraints
• Classify the field formats
− Number and Count of Values, Null/Missing Values, Maxima,
Minima, Ranges, Distributions
• Analyse and document the field values and determine any errors and
inconsistencies
− Data Categories, Values, Data Dictionaries, Reference Sources
• Identify lists of values used in fields and their sources and determine any
− Data Value Patterns
• Seek to identify patterns in field values
May 3, 2021 15
Data Profiling – Data Structures
• Analyse data structures – tables or files
− Data Aggregations
• Analyse data structures – number of fields/columns, frequencies of values
across lines/rows
− Keys
• Identify data structure keys, their values, frequencies, relevance and
usefulness for data access
− Data Indexes
• Analyse data structure indexes, their values and their usefulness for data
retrieval
− Triggers
• Determine if triggers have been defined for fields and analyse their
purpose, frequency, efficiency and utility
May 3, 2021 16
Data Profiling – Inter Field Linkages, Relationships,
Correlations and Dependencies
• Identify relationships between fields/columns of data
structures/tables
• Relationship and dependency identification can be complex
because of data volumes and large number of data values and
combinations
− Unique Combinations of Field Values
• Identify combinations of fields/columns that uniquely identify lines/rows
− Functional Field Dependencies
• Identify circumstances where one field/volume value affects others
− Inclusion Dependencies
• Identify where some field/column values are contained in others (such as
foreign keys)
− Cross Field Inconsistent Values
• Identify field/column values across separate data structures that are
inconsistent
May 3, 2021 17
Data Profiling – Inter Field Linkages, Relationships,
Correlations and Dependencies
− Unique Combinations of Field
Values
• DUCC
• GORDIAN
• HCA
• HyUCC
• SWAN
− Functional Field Dependencies
• DEP-MINER
• DFD
• FDEP
• FDMINE
• FASTFDs
• FUN
• HyFD
− Inclusion Dependencies
• B&B
• BINDER
• CLIM
• DeMARCH
• MIND
• MIND2
• S-INDD
• SPIDER
• ZIGZAG
May 3, 2021 18
• There are many algorithms that can be used to simplify the activity
of identifying cross-field dependencies
• These are frequently included in data profiling tools
Data Profiling – Data Completeness, Consistency and
Accuracy
• Analyse data within data structures to identify any gaps
and inaccuracies
− Missing and Incomplete Series Values and Gaps
• Determine any missing values in data series
− Inconsistent Data Values
• Examine data values for inconsistencies
− Inaccurate Data Values
• Examine data values for inaccuracy
− Duplicate Values
• Identify potential duplicate values
− Distribution and Occurrence Checking
• Create and analyse data value distributions
May 3, 2021 19
Data Profiling – Data Context
• Analyse the wider context of the data being profiled
− Data Sources
• Identify the sources of the data (and their sources)
− Data Processing and Transformation, Business Rules
• Determine how the data is created from its sources
− Data Description and Documentation
• Describe and document the data
− Metadata Definition and Creation
• Identify any existing metadata and create/update
− Data Targets and Usage
• Identify how the data is used by downstream targets and activities
− Data Criticality
• Identify the importance and criticality of the data to business operations
− Data Security
• Identify the current and required data security and access control requirements
May 3, 2021 20
Data Profiling – Data Statistics
• Collect and analyse statistics on data
− Data Capacity Statistics
• Collect and analyse the volumes of data being stored in the structures within the data
entity
− Data Usage Statistics
• Collect and analyse the rate of usage of data
− Data Update Statistics
• Collect and analyse the rate, frequency and extent of data changes
− Data Growth Statistics
• Collect and analyse the current and projected rates of growth of data volumes and data
usage
− Data Processing Statistics
• Collect and analyse data processing statistics – time to update
− Data Overheads
• Collect and analyse data resource overheads associated with activities such as indexes
and log shipping,
− Data Audit Logging
• Collect and analyse details on logging configuration and on data activity and usage data
May 3, 2021 21
Data Profiling – Data Infrastructure
• Analyse the underlying data infrastructure including data
service providers
− Data Storage Infrastructure
• Document the current data storage infrastructure and platforms
− Data Locations
• Document the data storage locations
− Data Processing Infrastructure
• Document the infrastructure and platforms used to process data including
any performance and throughput bottlenecks
May 3, 2021 22
Data Profiling – Data Operations
• Analyse current data operations activities and processes and
technologies being used
− Backup and Recovery
• Document data entity backup and recovery including any testing and validation of
processes
− Replication
• Document data entity replication to other locations including any testing and validation
of processes
− Availability and, Continuity
• Document actual and desired data availability and continuity of access
− Data Maintenance and Housekeeping Activities
• Document processes and activities relating the maintenance and housekeeping of the
data entity
− Service Levels
• Document actual and desired data service levels across access and usage
− Data Incident History
• Analyse service and incident history relating to the data entity including frequency,
severity, impact and time to resolve and the impact on overall data reliability
May 3, 2021 23
Data Profiling – Data Technologies
• Analyse the technologies in use for the data being profiled
− Data Integration
• Document and analyse data integration technologies
− Data Storage
• Document and analyse data storage technologies
− Data Access
• Document and analyse data access technologies
May 3, 2021 24
Data Profiling – Problem Identification and
Remediation
• Collate information on any problems and issues identified
during the data profiling activities
− Identify Data Problems
• Document and analyse the problems and issued
− Identify Remediation Activities
• Identify remediation activities and define programme of work
May 3, 2021 25
Data Profiling Complexity
• Do not underestimate the complexity, effort and resources
required for data profiling
• A products can make the task easier but it is not a panacea
• Data profiling can be a continuous activity as data changes
and the target data catalog needs to be maintained and
udpdated
May 3, 2021 26
Data Catalog
• Set of information (metadata) containing details on organisation
information resources - datasets
• Data catalog can be static or semi-static data structure created and
maintained manually
• Metadata is structured, consistent and indexed for fast and easy access
and use
• Contains descriptions of data resources
• Enables user self-service data discovery and usage
• Provides data discovery tools and facilities
• Data catalog assists with implementing FAIR (Findable, Accessible,
Interoperable, Reusable) data
− Findable – details on data available on specific topics and subjects can be found
easily and quickly
− Accessible – underlying data can be accessed
− Interoperable – metadata ensures can be aggregated and integrated across data
types
− Reusable – detailed metadata ensures data can be reused in the future
May 3, 2021 27
FAIR (Findable, Accessible, Interoperable, Reusable)
• https://fairsharing.org/ - sample data collections
• https://www.go-fair.org/ - implementation of FAIR data
principles - https://www.go-fair.org/fair-principles/
• https://www.schema.org/ - contains sample metadata
schemas
• Strong academic focus but the principle can be applied
elsewhere
May 3, 2021 28
Data Catalog Functionality Complexity
May 3, 2021 29
Registry
•Simple registry of data
sources with links to their
location and access
mechanisms
Metadata Content
•Contains descriptions of
the contents of the data
sources
Structured and
Processable
Metadata
•Metadata is held in a
structured and queryable
format
Data Relationships
•Holds details on metadata
and data concepts/themes
with relationships between
data sources
Content and Meaning
Relationships
•Semantic mappings (visual
representation of linkages)
and relationships among
domains of different
datasets
• Data catalogs can be simple or complex
• Greater complexity requires more effort and the use of tools
• Greater complexity ensures greater data usability and usefulness
• Catalog can be constructed (semi) automatically using data profiling tools
• The data catalog must be constantly updated as data changes
Data Catalogs, Master Data Management, Data
Profiling And Data Quality Relationships
May 3, 2021 30
Data Catalog
Structured information about data
sources, contents and access
methods
Master Data
Management
Layer above operational systems
dynamically linking data together
Data Profiling
Discovery and documentation of
data sources, types, dictionaries,
values, relationships, usage
Data Quality
Defining, monitoring and improving
data quality, accuracy, cleansing,
consistency and fitness to use
MDM
Operationalises
the Data Catalog
Quality
Underpins
Data Catalog
MDM
Ensures
Data
Quality
Data Profiling
Necessary to Build
a Data Catalog MDM Tools
Can Automate
Data Profiling
Data Catalog Vocabulary (DCAT)
• See https://www.w3.org/TR/vocab-dcat-2/
• Resource Description Framework (RDF) metadata data
model
• DCAT is a standard for describing datasets in a data catalog
May 3, 2021 31
Related Concepts
• Business Glossary – defines terms and concepts across a
business domain providing an authoritative source for
business operations
• Data Dictionary – collection of names, definitions and
attributes about data items that are being used or stored
in a database
May 3, 2021 32
Data Catalog Tools
• Many commercial data catalog tools – many overlap with
master data management
• Open source options
− CKAN - https://ckan.org/
− Dataverse - https://dataverse.org/
− Invenio - https://inveniosoftware.org/
− QuiltData - https://quiltdata.com/
− Zenodo - https://zenodo.org/
− Kylo - https://kylo.io/
• Can use to test concept before investing in commercial tools
• Can also use trial version of Azure Data Catalog -
https://docs.microsoft.com/en-us/azure/data-
catalog/overview
May 3, 2021 33
Metadata
• Data that provides information about other data resources
that enables relevant data be discovered, understood and
managed reliably and consistently
• There are various classifications of metadata types
May 3, 2021 34
Possible Metadata Structure And Organisation
May 3, 2021 35
Types of
Metadata
Descriptive
Information
about the data
resource
contained in a set
of metadata
fields,
Language
How data can be
discovered
Business
What the data is,
its sources,
meaning and
relationships with
other data
Location
Ownership,
Authorship
Structural
How the data is
organised and
how versions are
maintained?
Formats,
contents,
dictionaries
Administrative
/Process
How the data
should be
managed and
administered
through its
lifecycle stages
Who can perform
what operations
on the metadata
Security and
access
restrictions and
rights
Data preservation
and retention
Legal constraints
and compliance
requirements
Statistical
Information on
actual data
creation and
usage and other
volumetrics
Reference
Sets of values for
structured
metadata fields
Content
Automatically
generated
(unstructured)
metadata from
content
Technical
Infrastructural
requirements
Exchange and
interface
requirements,
interoperability
API requirements
and usage
Metadata Harmonisation
• Metadata Harmonisation can mean:
1. The ability of interaction data systems to exchange their
individual sets of metadata (that may comply with
different metadata standards/approaches/schemas) and
to consistently and coherently interpret and understand
the exchanged metadata
2. The conversion of existing metadata held in different
systems to a common standard
• Harmonised metadata makes finding and comparing
information easier
May 3, 2021 36
Key Metadata Harmonisation Principles
• Evaluation – Source, target metadata structures/schemas and
the underlying data should be profiled before any target
metadata schema design work starts
• Matching – Match existing metadata structures involving
extraction and analysis of data from source systems
• Transformation – Map the source schemas and geometry to
the common target schema
• Validation – Assess the conformance of metadata
• Publication – Make transformed metadata schema available
• Management – Ongoing management, administration and
maintenance
May 3, 2021 37
Metadata Concerns
• No consistent schema and nomenclature being used
• Each system will maintain different sets of metadata
• No consistent set of values (vocabulary/dictionary/code
lists) for metadata fields
• Difficult to perform reliable comparisons across metadata
May 3, 2021 38
Metadata Data Catalog
• Set of information (metadata) containing details on organisation
information resources - datasets
• Data catalog can be static or semi-static data structure created and
maintained manually
• Metadata is structured, consistent and indexed for fast and easy access
and use
• Contains descriptions of data resources
• Enables user self-service data discovery and usage
• Provides data discovery tools and facilities
• Data catalog assists with implementing FAIR (Findable, Accessible,
Interoperable, Reusable) data
− Findable – details on data available on specific topics and subjects can be found
easily and quickly
− Accessible – underlying data can be accessed
− Interoperable – metadata ensures can be aggregated and integrated across data
types
− Reusable – detailed metadata ensures data can be reused in the future
May 3, 2021 39
Scope Of Wider Data Management
May 3, 2021 40
Data Management
Data Governance Data Architecture Management
Data Development Data Operations Management
Data Security Management Data Quality Management
Data Integration Management Reference and Master Data Management
Data Warehousing and Business Intelligence
Management
Document and Content Management
Metadata Management
Reference And Master Data Management
• Reference and Master Data Management is the ongoing
reconciliation and maintenance of reference data and master data
− Reference Data Management is control over defined domain values (also
known as vocabularies), including control over standardised terms, code values
and other unique identifiers, business definitions for each value, business
relationships within and across domain value lists, and the consistent, shared
use of accurate, timely and relevant reference data values to classify and
categorise data
− Master Data Management is control over master data values to enable
consistent, shared, contextual use across systems, of the most accurate,
timely, and relevant version of truth about essential business entities
• Reference data and master data provide the context for transaction
data
May 3, 2021 41
Reference and Master Data Management –
Definition and Goals
• Definition
− Planning, implementation, and control activities to ensure
consistency with a golden version of contextual data values
• Goals
− Provide authoritative source of reconciled, high-quality master
and reference data
− Lower cost and complexity through reuse and leverage of
standards
− Support business intelligence and information integration efforts
May 3, 2021 42
May 3, 2021 43
•Business Drivers
•Data Requirements Policy and
Regulations
•Standards
•Code Sets
•Master Data
•Transactional Data
Inputs
•Steering Committees
•Business Data Stewards
•Subject Matter Experts
•Data Consumers
•Standards Organisations
•Data Providers
Suppliers
•Reference Data Management
Applications
•Master Data Management
Applications
•Data Modeling Tools
•Process Modeling Tools
•Metadata Repositories
•Data Profiling Tools
•Data Cleansing Tools
•Data Integration Tools
•Business Process and Rule Engines
Change Management Tools
Tools
•Data Stewards
•Subject Matter Experts
•Data Architects
•Data Analysts
•Application Architects
•Data Governance Council
•Data Providers
•Other IT Professionals
Participants
•Master and Reference Data
Requirements
•Data Models and Documentation
•Reliable Reference and Master Data
•Golden Record Data Lineage
•Data Quality Metrics and Reports
•Data Cleansing Services
Primary Deliverables
•Reference and Master Data Quality
•Change Activity
•Issues, Costs, Volume
•Use and Re-Use
•Availability
•Data Steward Coverage
Metrics
Reference and
Master Data
Management
•Application Users
•BI and Reporting Users
•Application Developers and Architects
•Data Integration Developers and
Architects
•BI Developers and Architects
•Vendors, Customers, and Partners
Consumers
Reference And Master Data Management –
Principles
• Shared reference and master data belongs to the organisation, not to a
particular application or department
• Reference and master data management is an on-going data quality
improvement program; its goals cannot be achieved by one project alone
• Business data stewards are the authorities accountable for controlling
reference data values. Business data stewards work with data
professionals to improve the quality of reference and master data
• Golden data values represent the organisation’s best efforts at
determining the most accurate, current, and relevant data values for
contextual use. New data may prove earlier assumptions to be false.
Therefore, apply matching rules with caution, and ensure that any changes
that are made are reversible
• Replicate master data values only from the database of record
• Request, communicate, and, in some cases, approve of changes to
reference data values before implementation
May 3, 2021 44
Reference Data
• Reference data is data used to classify or categorise other
data
• Business rules usually dictate that reference data values
conform to one of several allowed values
• In all organisations, reference data exists in virtually every
database
• Reference tables link via foreign keys into other relational
database tables, and the referential integrity functions
within the database management system ensure only valid
values from the reference tables are used in other tables
May 3, 2021 45
Master Data
• Master data is data about the business entities that
provide context for business transactions
• Master data is the authoritative, most accurate data
available about key business entities, used to establish the
context for transactional data
• Master data values are considered golden
• Master Data Management is the process of defining and
maintaining how master data will be created, integrated,
maintained, and used throughout the enterprise
May 3, 2021 46
Master Data Challenges
• What are the important roles, organisations, places, and things referenced repeatedly?
• What data is describing the same person, organisation, place, or thing?
• Where is this data stored? What is the source for the data?
• Which data is more accurate? Which data source is more reliable and credible? Which data
is most current?
• What data is relevant for specific needs? How do these needs overlap or conflict?
• What data from multiple sources can be integrated to create a more complete view and
provide a more comprehensive understanding of the person, organisation, place or thing?
• What business rules can be established to automate master data quality improvement by
accurately matching and merging data about the same person, organisation, place, or
thing?
• How do we identify and restore data that was inappropriately matched and merged?
• How do we provide our golden data values to other systems across the enterprise?
• How do we identify where and when data other than the golden values is used?
May 3, 2021 47
Understand Reference And Master Data Integration
Needs
• Reference and master data requirements are relatively easy to
discover and understand for a single application
• Potentially much more difficult to develop an understanding of
these needs across applications, especially across the entire
organisation
• Analysing the root causes of a data quality problem usually
uncovers requirements for reference and master data
integration
• Organisations that have successfully managed reference and
master data typically have focused on one subject area at a
time
− Analyse all occurrences of a few business entities, across all physical
databases and for differing usage patterns
May 3, 2021 48
Define and Maintain the Data integration
Architecture
• Effective data integration architecture controls the shared access, replication, and
flow of data to ensure data quality and consistency, particularly for reference and
master data
• Without data integration architecture, local reference and master data
management occurs in application silos, inevitably resulting in redundant and
inconsistent data
• The selected data integration architecture should also provide common data
integration services
− Change request processing, including review and approval
− Data quality checks on externally acquired reference and master data
− Consistent application of data quality rules and matching rules
− Consistent patterns of processing
− Consistent metadata about mappings, transformations, programs and jobs
− Consistent audit, error resolution and performance monitoring data
− Consistent approach to replicating data
• Establishing master data standards can be a time consuming task as it may involve
multiple stakeholders
• Apply the same data standards, regardless of integration technology, to enable
effective standardisation, sharing, and distribution of reference and master data
May 3, 2021 49
Data Integration Services Architecture
May 3, 2021 50
Data Quality Management
Metadata Management
Integration
Metadata
Job Flow and
Statistics
Data Acquisition, File
Management and
Audit
Replication
Management
Data Standardisation
Cleansing and
Matching
Business
Metadata
Source Data
Archives
Rules
Errors
Staging
Reconciled
Master Data
Subscriptions
Implement Reference And Master Data
Management Solutions
• Reference and master data management solutions are
complex
• Given the variety, complexity, and instability of
requirements, no single solution or implementation
project is likely to meet all reference and master data
management needs
• Organisations should expect to implement reference and
master data management solutions iteratively and
incrementally through several related projects and phases
May 3, 2021 51
Define And Maintain Match Rules
• Matching, merging, and linking of data from multiple systems
about the same person, group, place, or thing is a major master
data management challenge
• Matching attempts to remove redundancy, to improve data
quality, and provide information that is more comprehensive
• Data matching is performed by applying inference rules
− Duplicate identification match rules focus on a specific set of fields that
uniquely identify an entity and identify merge opportunities without
taking automatic action
− Match-merge rules match records and merge the data from these
records into a single, unified, reconciled, and comprehensive record
− Match-link rules identify and cross-reference records that appear to
relate to a master record without updating the content of the cross-
referenced record
May 3, 2021 52
Vocabulary Management And Reference Data
• A vocabulary is a collection of terms / concepts and their
relationships
• Vocabulary management is defining, sourcing, importing,
and maintaining a vocabulary and its associated reference
data
− See ANSI/NISO Z39.19 - Guidelines for the Construction, Format,
and Management of Monolingual Controlled Vocabularies -
http://www.niso.org/kst/reports/standards?step=2&gid=&project
_key=7cc9b583cb5a62e8c15d3099e0bb46bbae9cf38a
• Vocabulary management requires the identification of the
standard list of preferred terms and their synonyms
• Vocabulary management requires data governance,
enabling data stewards to assess stakeholder needs
May 3, 2021 53
Vocabulary Management And Reference Data
• Key questions to ask to enable vocabulary management
− What information concepts (data attributes) will this vocabulary
support?
− Who is the audience for this vocabulary? What processes do they
support, and what roles do they play?
− Why is the vocabulary needed? Will it support applications, content
management, analytics, and so on?
− Who identifies and approves the preferred vocabulary and vocabulary
terms?
− What are the current vocabularies different groups use to classify this
information? Where are they located? How were they created? Who
are their subject matter experts? Are there any security or privacy
concerns for any of them?
− Are there existing standards that can be leveraged to fulfil this need?
Are there concerns about using an external standard vs. internal? How
frequently is the standard updated and what is the degree of change of
each update? Are standards accessible in an easy to import / maintain
format in a cost efficient manner?
May 3, 2021 54
Defining Golden Master Data Values
• Golden data values are the data values thought to be the most
accurate, current, and relevant for shared, consistent use
across applications
• Determine golden values by analysing data quality, applying
data quality rules and matching rules, and incorporating data
quality controls into the applications that acquire, create, and
update data
• Establish data quality measurements to set expectations,
measure improvements, and help identify root causes of data
quality problems
• Assess data quality through a combination of data profiling
activities and verification against adherence to business rules
• Once the data is standardised and cleansed, the next step is to
attempt reconciliation of redundant data through application
of matching rules
Define And Maintain Hierarchies And Affiliations
• Vocabularies and their associated reference data sets are
often more than lists of preferred terms and their
synonyms
• Affiliation management is the establishment and
maintenance of relationships between master data
records
Plan And Implement Integration Of New Data
Sources
• Integrating new reference data sources involves
− Receiving and responding to new data acquisition requests from
different groups
− Performing data quality assessment services using data cleansing
and data profiling tools
− Assessing data integration complexity and cost
− Piloting the acquisition of data and its impact on match rules
− Determining who will be responsible for data quality
− Finalising data quality metrics
Replicate And Distribute Reference And Master Data
• Reference and master data may be read directly from a
database of record, or may be replicated from the
database of record to other application databases for
transaction processing, and data warehouses for business
intelligence
• Reference data most commonly appears as pick list values
in applications
• Replication aids maintenance of referential integrity
Manage Changes To Reference And Master Data
• Specific individuals have the role of a business data
steward with the authority to create, update, and retire
reference data
• Formally control changes to controlled vocabularies and
their reference data sets
• Carefully assess the impact of reference data changes
Data Governance And MDM Success Factors
• Master Data Management will support business by providing a
strategy, governance policies and technologies for customer,
product, and entitlement information by following the Master Data
Management Guiding Principles
− Master data management will use (and where needed create) a “single version
of the truth” for customer, product, and asset entitlement master data
consolidated into a single master data system
− Master data management will establish standard data definition and usage will
be consistent to simplify business processes across enterprise systems
− Master data management systems and processes will be flexible and adaptable
to handle domestic and global expansion to support growth in both established
and emerging markets
− Master data management will adhere to a standards governance process to
ensure key data elements are created, maintained, cleansed and converted to
be syndicated across enterprise systems
− Master data management will identify responsibilities and monitor
accountability for customer, product, and entitlement information
Master data management will facilitate cross-functional collaboration and
manage continuous improvement of master data for customer, product, and
entitlement domains
Data Governance is Not A Choice – It Is A Necessity
• “We’ve got to stop having the ‘who owns the data?’
conversation.”
• “We can’t do MDM if we don’t formalise decision-making
processes around our enterprise information.”
• “Fixing the data in a single system is pointless; we don’t
know what the rules are across our systems.”
• “Everyone agrees data quality is poor, but no one can
agree on how to fix it.”
• “Are you kidding? We have multiple versions of the single-
version-of-the truth.”
MDM Program Critical Success Factors
• Strategy
− Drive and promote alignment with corporate strategic initiatives and pillar specific goals
− Definition of criteria and core attributes that define domains and related objects
• Solution
− Alignment with corporate strategic initiatives and pillar specific goals
− Identification of “Quick Wins” that have measurable impact
− Clear definition of metrics for measuring data improvement
− Leading industry practices have been incorporated solution design
• Governance
− Executive Ownership and Governance organisation has been rationalised established to
address federated data management needs
− Data Quality is addressed at all points of processes, as well as customer and product
lifecycles
• End-to-end Roadmap
− Prioritised program roadmap for “Quick Wins”
− Prioritised program roadmap for CDM strategic initiatives
− Fully vetted CBA for each roadmap item
− “No Regrets” actions are rationalised and aligned strategic roadmap
More Information
Alan McSweeney
http://ie.linkedin.com/in/alanmcsweeney
https://www.amazon.com/dp/1797567616
3 May 2021 63

Weitere ähnliche Inhalte

Was ist angesagt?

Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Data Governance
Data GovernanceData Governance
Data GovernanceRob Lux
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)DATAVERSITY
 
Data Modeling is Data Governance
Data Modeling is Data GovernanceData Modeling is Data Governance
Data Modeling is Data GovernanceDATAVERSITY
 
How to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityHow to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityDATAVERSITY
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata ManagementDATAVERSITY
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureDATAVERSITY
 
Building a Data Governance Strategy
Building a Data Governance StrategyBuilding a Data Governance Strategy
Building a Data Governance StrategyAnalytics8
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best PracticesDATAVERSITY
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Element22
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesDATAVERSITY
 
The Business Value of Metadata for Data Governance
The Business Value of Metadata for Data GovernanceThe Business Value of Metadata for Data Governance
The Business Value of Metadata for Data GovernanceRoland Bullivant
 
Shadow IT And The Failure Of IT Architecture
Shadow IT And The Failure Of IT ArchitectureShadow IT And The Failure Of IT Architecture
Shadow IT And The Failure Of IT ArchitectureAlan McSweeney
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data GovernanceDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesLars E Martinsson
 

Was ist angesagt? (20)

Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Data Governance
Data GovernanceData Governance
Data Governance
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)
 
Data Modeling is Data Governance
Data Modeling is Data GovernanceData Modeling is Data Governance
Data Modeling is Data Governance
 
How to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityHow to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data Quality
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
 
Building a Data Governance Strategy
Building a Data Governance StrategyBuilding a Data Governance Strategy
Building a Data Governance Strategy
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
 
The Business Value of Metadata for Data Governance
The Business Value of Metadata for Data GovernanceThe Business Value of Metadata for Data Governance
The Business Value of Metadata for Data Governance
 
Shadow IT And The Failure Of IT Architecture
Shadow IT And The Failure Of IT ArchitectureShadow IT And The Failure Of IT Architecture
Shadow IT And The Failure Of IT Architecture
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data Governance
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 

Ähnlich wie Data Profiling, Data Catalogs and Metadata Harmonisation

DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010ERwin Modeling
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architectureCosta Pissaris
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Anastasija Nikiforova
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...DATAVERSITY
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
chapter11-220725121546-671fc36c.pdf
chapter11-220725121546-671fc36c.pdfchapter11-220725121546-671fc36c.pdf
chapter11-220725121546-671fc36c.pdfMahmoudSOLIMAN380726
 
‏‏‏‏‏‏‏‏Chapter 11: Meta-data Management
‏‏‏‏‏‏‏‏Chapter 11: Meta-data Management‏‏‏‏‏‏‏‏Chapter 11: Meta-data Management
‏‏‏‏‏‏‏‏Chapter 11: Meta-data ManagementAhmed Alorage
 
7 - Enterprise IT in Action
7 - Enterprise IT in Action7 - Enterprise IT in Action
7 - Enterprise IT in ActionRaymond Gao
 
Data Architecture for Solutions.pdf
Data Architecture for Solutions.pdfData Architecture for Solutions.pdf
Data Architecture for Solutions.pdfAlan McSweeney
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptPalaniKumarR2
 
Relational Database explanation with detail.pdf
Relational Database explanation with detail.pdfRelational Database explanation with detail.pdf
Relational Database explanation with detail.pdf9wldv5h8n
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Datadapaasproject
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
Connected development data
Connected development dataConnected development data
Connected development dataRob Worthington
 

Ähnlich wie Data Profiling, Data Catalogs and Metadata Harmonisation (20)

DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
RowanDay4.pptx
RowanDay4.pptxRowanDay4.pptx
RowanDay4.pptx
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
chapter11-220725121546-671fc36c.pdf
chapter11-220725121546-671fc36c.pdfchapter11-220725121546-671fc36c.pdf
chapter11-220725121546-671fc36c.pdf
 
‏‏‏‏‏‏‏‏Chapter 11: Meta-data Management
‏‏‏‏‏‏‏‏Chapter 11: Meta-data Management‏‏‏‏‏‏‏‏Chapter 11: Meta-data Management
‏‏‏‏‏‏‏‏Chapter 11: Meta-data Management
 
7 - Enterprise IT in Action
7 - Enterprise IT in Action7 - Enterprise IT in Action
7 - Enterprise IT in Action
 
Data Architecture for Solutions.pdf
Data Architecture for Solutions.pdfData Architecture for Solutions.pdf
Data Architecture for Solutions.pdf
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Relational Database explanation with detail.pdf
Relational Database explanation with detail.pdfRelational Database explanation with detail.pdf
Relational Database explanation with detail.pdf
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
DISE - Database Concepts
DISE - Database ConceptsDISE - Database Concepts
DISE - Database Concepts
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
Connected development data
Connected development dataConnected development data
Connected development data
 

Mehr von Alan McSweeney

Solution Architecture and Solution Estimation.pdf
Solution Architecture and Solution Estimation.pdfSolution Architecture and Solution Estimation.pdf
Solution Architecture and Solution Estimation.pdfAlan McSweeney
 
Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...
Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...
Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...Alan McSweeney
 
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...Alan McSweeney
 
IT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdfIT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdfAlan McSweeney
 
Solution Architecture And Solution Security
Solution Architecture And Solution SecuritySolution Architecture And Solution Security
Solution Architecture And Solution SecurityAlan McSweeney
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Alan McSweeney
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Alan McSweeney
 
Solution Security Architecture
Solution Security ArchitectureSolution Security Architecture
Solution Security ArchitectureAlan McSweeney
 
Solution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation SolutionsSolution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation SolutionsAlan McSweeney
 
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...Alan McSweeney
 
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...Alan McSweeney
 
Operational Risk Management Data Validation Architecture
Operational Risk Management Data Validation ArchitectureOperational Risk Management Data Validation Architecture
Operational Risk Management Data Validation ArchitectureAlan McSweeney
 
Ireland 2019 and 2020 Compared - Individual Charts
Ireland   2019 and 2020 Compared - Individual ChartsIreland   2019 and 2020 Compared - Individual Charts
Ireland 2019 and 2020 Compared - Individual ChartsAlan McSweeney
 
Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020Alan McSweeney
 
Ireland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In DataIreland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In DataAlan McSweeney
 
Review of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability ModelsReview of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability ModelsAlan McSweeney
 
Critical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference ArchitectureCritical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference ArchitectureAlan McSweeney
 
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020Alan McSweeney
 
Agile Solution Architecture and Design
Agile Solution Architecture and DesignAgile Solution Architecture and Design
Agile Solution Architecture and DesignAlan McSweeney
 
Solution Architecture and Solution Acquisition
Solution Architecture and Solution AcquisitionSolution Architecture and Solution Acquisition
Solution Architecture and Solution AcquisitionAlan McSweeney
 

Mehr von Alan McSweeney (20)

Solution Architecture and Solution Estimation.pdf
Solution Architecture and Solution Estimation.pdfSolution Architecture and Solution Estimation.pdf
Solution Architecture and Solution Estimation.pdf
 
Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...
Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...
Validating COVID-19 Mortality Data and Deaths for Ireland March 2020 – March ...
 
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
 
IT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdfIT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdf
 
Solution Architecture And Solution Security
Solution Architecture And Solution SecuritySolution Architecture And Solution Security
Solution Architecture And Solution Security
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
 
Solution Security Architecture
Solution Security ArchitectureSolution Security Architecture
Solution Security Architecture
 
Solution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation SolutionsSolution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation Solutions
 
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
 
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
 
Operational Risk Management Data Validation Architecture
Operational Risk Management Data Validation ArchitectureOperational Risk Management Data Validation Architecture
Operational Risk Management Data Validation Architecture
 
Ireland 2019 and 2020 Compared - Individual Charts
Ireland   2019 and 2020 Compared - Individual ChartsIreland   2019 and 2020 Compared - Individual Charts
Ireland 2019 and 2020 Compared - Individual Charts
 
Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020
 
Ireland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In DataIreland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In Data
 
Review of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability ModelsReview of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability Models
 
Critical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference ArchitectureCritical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference Architecture
 
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
 
Agile Solution Architecture and Design
Agile Solution Architecture and DesignAgile Solution Architecture and Design
Agile Solution Architecture and Design
 
Solution Architecture and Solution Acquisition
Solution Architecture and Solution AcquisitionSolution Architecture and Solution Acquisition
Solution Architecture and Solution Acquisition
 

Kürzlich hochgeladen

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...boychatmate1
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligencePriyadharshiniG41
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 

Kürzlich hochgeladen (20)

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligence
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 

Data Profiling, Data Catalogs and Metadata Harmonisation

  • 1. Data Profiling, Data Catalogs and Metadata Harmonisation Alan McSweeney http://ie.linkedin.com/in/alanmcsweeney https://www.amazon.com/dp/1797567616
  • 2. Data Profiling, Data Catalogs and Metadata Harmonisation May 3, 2021 2 Data Profiling Understand Your Data Data Catalog Database of Data Assets Metadata Harmonisation Standardisation of Data Descriptions
  • 3. Data Profiling • The preparation of data prior to it being in a usable and analysable format can consume up to 80% of the resources of a data project May 3, 2021 3
  • 4. Data Profiling • Process for discovering and examining the data available in existing data sources • Essential initial activity − Understand the structure and contents of data − Evaluate data quality and data conformance with standards − Identify terms and metadata used to describe data − Identify data relationships and dependencies − Enable the creation of a master data view across all data sources − Understand and define data integration requirements • Be able to understand data issues, problems, challenges at the start of a data project: − Data cleansing − Data analytics − Master data management − Data catalog − Data migration May 3, 2021 4
  • 5. Data Profiling – Wider Context May 3, 2021 5 Source System Data Profiling Common Data Model, Data Storage and Access Platform Visualisation and Reporting Analysis Data Access Data Integration Common Data Integration Understand System Data Structures, Values Profiling Assists with Building Long-Term Data Model Profiling Assists with Building Data Dictionary/Catalog to Enable Data Subject Access and Data Discovery 1 2 3 5 Assists with Data Extraction and Integration Definition 4 Data Catalog Data Virtualisation Layer 6 Data Catalog is an Enabler of Data Virtualisation
  • 6. Importance Of Data Profiling May 3, 2021 6 • Data profiling is a central activity that is key to downstream and long-term data usability and has impact on topics of Data Quality, Data Lineage/Data Provenance and Master Data Management Data Profiling Activity Data Quality Data Lineage/Data Provenance Contributes to and Ensures Data Quality Allows Tracking of Data Lineage • Data lineage and data provenance involving tracking data origins, what happens to the data and how it flows between systems over time • Data lineage provides data visibility • It simplifies tracing data errors that may occur in data reporting, visualisation and analytics Master Data Management Enabled the Implementation of Master Data Management
  • 7. Data Profiling Toolset Options – Partial List May 3, 2021 7 • Large number of data profiling tool options • You can investigate these tools to understand which are the most suitable and which functions are important prior to any formal tool selection process − Download and use trial versions of commercial tools • This work will require resources Free/Open Source Tools Aggregate Profiler Quadient DataCleaner Talend Open Studio Commercial Tools Atlan Collibra Data Stewardship Manager IBM InfoSphere Information Analyser Informatica Data Explorer Melissa Data Profiler Oracle Enterprise Data Quality SAP Business Objects Data Services (BODS) SAS DataFlux TIBCO Clarity WinPure
  • 8. Data Profiling Stages Data Access and Retrieval •Defining the data sources to be profiled Performing the Profiling •Working through the programme of data profiling activities Understanding and Interpreting the Results •Collating, documenting and using the results May 3, 2021 8
  • 9. Layers Of Data Profiling Activities • Profiling starts with individual data fields/columns and then extends outwards to tables/files, then data stores/databases to the upstream data sources and downstream data targets and finally the entire set of organisation data entities May 3, 2021 9 Organisation Data Landscape Data Sources and Targets Data Store Individual Data Structures (Tables) Individual Data Fields
  • 10. Data Profiling Across The Organisation Data Landscape May 3, 2021 10 Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity
  • 11. Data Profiling Across The Organisation Data Landscape • Data profiling activities are normally performed on single data entities • The organisation data landscape consists of multiple, generally heterogenous, loosely interconnected data entities between which where data moves • Data breathes life into the organisation’s solution landscape • One profiled data entity can take its data from a number of upstream data sources and in turn be the source for a number of downstream data targets • Profiling may involve tracing data lineage across a number of data entities to create and end-to-end data provenance May 3, 2021 11
  • 12. Individual Data Profiling Exercise Can Leak Into Other Data Domains May 3, 2021 12 Data Profiling Activity Data Profiling Activity Data Profiling Activity Core Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Upstream Data Profiling Activities Downstream Data Profiling Activities
  • 13. Data Profiling Activities Data Profiling Activities Individual Field Analysis Data Type, Length, Input Validation, Constraints Number and Count of Values, Null/Missing Values, Maxima, Minima, Ranges, Distributions Data Categories, Values, Data Dictionaries, Reference Sources Data Value Patterns Data Structures Data Aggregations Keys Data Indexes Triggers Inter Field Linkages, Relationships, Correlations and Dependencies Unique Combinations of Field Values Functional Field Dependencies Inclusion Dependencies Cross Field Inconsistent Values Data Completeness, Consistency and Accuracy Missing and Incomplete Series Values and Gaps Inconsistent Data Values Inaccurate Data Values Duplicate Values Distribution and Occurrence Checking Data Context Data Sources Data Processing and Transformation, Business Rules Data Description and Documentation Metadata Definition and Creation Data Targets and Usage Data Criticality Data Security Data Statistics Data Capacity Statistics Data Usage Statistics Data Update Statistics Data Growth Statistics Data Processing Statistics Data Overheads Data Audit Logging Data Infrastructure Data Storage Infrastructure Data Locations Data Processing Infrastructure Data Operations Backup and Recovery Replication Availability and, Continuity Data Maintenance and Housekeeping Activities Service Levels Data Incident History Data Technologies Data Integration Data Storage Data Access Problem Identification and Remediation Identify data Problems Identify Remediation Activities May 3, 2021 13
  • 14. Data Profiling Activities • This represents a set of data profiling tasks to create a complete view of the data contents of a data entity • This allows a realistic programme of work required to complete the data profiling activity − Resource requirements can be quantified − Duration can be estimated − Informed decision can be made on what activities to include or exclude May 3, 2021 14
  • 15. Data Profiling – Individual Field Analysis • Analyse individuals data fields or columns − Data Type, Length, Input Validation, Constraints • Classify the field formats − Number and Count of Values, Null/Missing Values, Maxima, Minima, Ranges, Distributions • Analyse and document the field values and determine any errors and inconsistencies − Data Categories, Values, Data Dictionaries, Reference Sources • Identify lists of values used in fields and their sources and determine any − Data Value Patterns • Seek to identify patterns in field values May 3, 2021 15
  • 16. Data Profiling – Data Structures • Analyse data structures – tables or files − Data Aggregations • Analyse data structures – number of fields/columns, frequencies of values across lines/rows − Keys • Identify data structure keys, their values, frequencies, relevance and usefulness for data access − Data Indexes • Analyse data structure indexes, their values and their usefulness for data retrieval − Triggers • Determine if triggers have been defined for fields and analyse their purpose, frequency, efficiency and utility May 3, 2021 16
  • 17. Data Profiling – Inter Field Linkages, Relationships, Correlations and Dependencies • Identify relationships between fields/columns of data structures/tables • Relationship and dependency identification can be complex because of data volumes and large number of data values and combinations − Unique Combinations of Field Values • Identify combinations of fields/columns that uniquely identify lines/rows − Functional Field Dependencies • Identify circumstances where one field/volume value affects others − Inclusion Dependencies • Identify where some field/column values are contained in others (such as foreign keys) − Cross Field Inconsistent Values • Identify field/column values across separate data structures that are inconsistent May 3, 2021 17
  • 18. Data Profiling – Inter Field Linkages, Relationships, Correlations and Dependencies − Unique Combinations of Field Values • DUCC • GORDIAN • HCA • HyUCC • SWAN − Functional Field Dependencies • DEP-MINER • DFD • FDEP • FDMINE • FASTFDs • FUN • HyFD − Inclusion Dependencies • B&B • BINDER • CLIM • DeMARCH • MIND • MIND2 • S-INDD • SPIDER • ZIGZAG May 3, 2021 18 • There are many algorithms that can be used to simplify the activity of identifying cross-field dependencies • These are frequently included in data profiling tools
  • 19. Data Profiling – Data Completeness, Consistency and Accuracy • Analyse data within data structures to identify any gaps and inaccuracies − Missing and Incomplete Series Values and Gaps • Determine any missing values in data series − Inconsistent Data Values • Examine data values for inconsistencies − Inaccurate Data Values • Examine data values for inaccuracy − Duplicate Values • Identify potential duplicate values − Distribution and Occurrence Checking • Create and analyse data value distributions May 3, 2021 19
  • 20. Data Profiling – Data Context • Analyse the wider context of the data being profiled − Data Sources • Identify the sources of the data (and their sources) − Data Processing and Transformation, Business Rules • Determine how the data is created from its sources − Data Description and Documentation • Describe and document the data − Metadata Definition and Creation • Identify any existing metadata and create/update − Data Targets and Usage • Identify how the data is used by downstream targets and activities − Data Criticality • Identify the importance and criticality of the data to business operations − Data Security • Identify the current and required data security and access control requirements May 3, 2021 20
  • 21. Data Profiling – Data Statistics • Collect and analyse statistics on data − Data Capacity Statistics • Collect and analyse the volumes of data being stored in the structures within the data entity − Data Usage Statistics • Collect and analyse the rate of usage of data − Data Update Statistics • Collect and analyse the rate, frequency and extent of data changes − Data Growth Statistics • Collect and analyse the current and projected rates of growth of data volumes and data usage − Data Processing Statistics • Collect and analyse data processing statistics – time to update − Data Overheads • Collect and analyse data resource overheads associated with activities such as indexes and log shipping, − Data Audit Logging • Collect and analyse details on logging configuration and on data activity and usage data May 3, 2021 21
  • 22. Data Profiling – Data Infrastructure • Analyse the underlying data infrastructure including data service providers − Data Storage Infrastructure • Document the current data storage infrastructure and platforms − Data Locations • Document the data storage locations − Data Processing Infrastructure • Document the infrastructure and platforms used to process data including any performance and throughput bottlenecks May 3, 2021 22
  • 23. Data Profiling – Data Operations • Analyse current data operations activities and processes and technologies being used − Backup and Recovery • Document data entity backup and recovery including any testing and validation of processes − Replication • Document data entity replication to other locations including any testing and validation of processes − Availability and, Continuity • Document actual and desired data availability and continuity of access − Data Maintenance and Housekeeping Activities • Document processes and activities relating the maintenance and housekeeping of the data entity − Service Levels • Document actual and desired data service levels across access and usage − Data Incident History • Analyse service and incident history relating to the data entity including frequency, severity, impact and time to resolve and the impact on overall data reliability May 3, 2021 23
  • 24. Data Profiling – Data Technologies • Analyse the technologies in use for the data being profiled − Data Integration • Document and analyse data integration technologies − Data Storage • Document and analyse data storage technologies − Data Access • Document and analyse data access technologies May 3, 2021 24
  • 25. Data Profiling – Problem Identification and Remediation • Collate information on any problems and issues identified during the data profiling activities − Identify Data Problems • Document and analyse the problems and issued − Identify Remediation Activities • Identify remediation activities and define programme of work May 3, 2021 25
  • 26. Data Profiling Complexity • Do not underestimate the complexity, effort and resources required for data profiling • A products can make the task easier but it is not a panacea • Data profiling can be a continuous activity as data changes and the target data catalog needs to be maintained and udpdated May 3, 2021 26
  • 27. Data Catalog • Set of information (metadata) containing details on organisation information resources - datasets • Data catalog can be static or semi-static data structure created and maintained manually • Metadata is structured, consistent and indexed for fast and easy access and use • Contains descriptions of data resources • Enables user self-service data discovery and usage • Provides data discovery tools and facilities • Data catalog assists with implementing FAIR (Findable, Accessible, Interoperable, Reusable) data − Findable – details on data available on specific topics and subjects can be found easily and quickly − Accessible – underlying data can be accessed − Interoperable – metadata ensures can be aggregated and integrated across data types − Reusable – detailed metadata ensures data can be reused in the future May 3, 2021 27
  • 28. FAIR (Findable, Accessible, Interoperable, Reusable) • https://fairsharing.org/ - sample data collections • https://www.go-fair.org/ - implementation of FAIR data principles - https://www.go-fair.org/fair-principles/ • https://www.schema.org/ - contains sample metadata schemas • Strong academic focus but the principle can be applied elsewhere May 3, 2021 28
  • 29. Data Catalog Functionality Complexity May 3, 2021 29 Registry •Simple registry of data sources with links to their location and access mechanisms Metadata Content •Contains descriptions of the contents of the data sources Structured and Processable Metadata •Metadata is held in a structured and queryable format Data Relationships •Holds details on metadata and data concepts/themes with relationships between data sources Content and Meaning Relationships •Semantic mappings (visual representation of linkages) and relationships among domains of different datasets • Data catalogs can be simple or complex • Greater complexity requires more effort and the use of tools • Greater complexity ensures greater data usability and usefulness • Catalog can be constructed (semi) automatically using data profiling tools • The data catalog must be constantly updated as data changes
  • 30. Data Catalogs, Master Data Management, Data Profiling And Data Quality Relationships May 3, 2021 30 Data Catalog Structured information about data sources, contents and access methods Master Data Management Layer above operational systems dynamically linking data together Data Profiling Discovery and documentation of data sources, types, dictionaries, values, relationships, usage Data Quality Defining, monitoring and improving data quality, accuracy, cleansing, consistency and fitness to use MDM Operationalises the Data Catalog Quality Underpins Data Catalog MDM Ensures Data Quality Data Profiling Necessary to Build a Data Catalog MDM Tools Can Automate Data Profiling
  • 31. Data Catalog Vocabulary (DCAT) • See https://www.w3.org/TR/vocab-dcat-2/ • Resource Description Framework (RDF) metadata data model • DCAT is a standard for describing datasets in a data catalog May 3, 2021 31
  • 32. Related Concepts • Business Glossary – defines terms and concepts across a business domain providing an authoritative source for business operations • Data Dictionary – collection of names, definitions and attributes about data items that are being used or stored in a database May 3, 2021 32
  • 33. Data Catalog Tools • Many commercial data catalog tools – many overlap with master data management • Open source options − CKAN - https://ckan.org/ − Dataverse - https://dataverse.org/ − Invenio - https://inveniosoftware.org/ − QuiltData - https://quiltdata.com/ − Zenodo - https://zenodo.org/ − Kylo - https://kylo.io/ • Can use to test concept before investing in commercial tools • Can also use trial version of Azure Data Catalog - https://docs.microsoft.com/en-us/azure/data- catalog/overview May 3, 2021 33
  • 34. Metadata • Data that provides information about other data resources that enables relevant data be discovered, understood and managed reliably and consistently • There are various classifications of metadata types May 3, 2021 34
  • 35. Possible Metadata Structure And Organisation May 3, 2021 35 Types of Metadata Descriptive Information about the data resource contained in a set of metadata fields, Language How data can be discovered Business What the data is, its sources, meaning and relationships with other data Location Ownership, Authorship Structural How the data is organised and how versions are maintained? Formats, contents, dictionaries Administrative /Process How the data should be managed and administered through its lifecycle stages Who can perform what operations on the metadata Security and access restrictions and rights Data preservation and retention Legal constraints and compliance requirements Statistical Information on actual data creation and usage and other volumetrics Reference Sets of values for structured metadata fields Content Automatically generated (unstructured) metadata from content Technical Infrastructural requirements Exchange and interface requirements, interoperability API requirements and usage
  • 36. Metadata Harmonisation • Metadata Harmonisation can mean: 1. The ability of interaction data systems to exchange their individual sets of metadata (that may comply with different metadata standards/approaches/schemas) and to consistently and coherently interpret and understand the exchanged metadata 2. The conversion of existing metadata held in different systems to a common standard • Harmonised metadata makes finding and comparing information easier May 3, 2021 36
  • 37. Key Metadata Harmonisation Principles • Evaluation – Source, target metadata structures/schemas and the underlying data should be profiled before any target metadata schema design work starts • Matching – Match existing metadata structures involving extraction and analysis of data from source systems • Transformation – Map the source schemas and geometry to the common target schema • Validation – Assess the conformance of metadata • Publication – Make transformed metadata schema available • Management – Ongoing management, administration and maintenance May 3, 2021 37
  • 38. Metadata Concerns • No consistent schema and nomenclature being used • Each system will maintain different sets of metadata • No consistent set of values (vocabulary/dictionary/code lists) for metadata fields • Difficult to perform reliable comparisons across metadata May 3, 2021 38
  • 39. Metadata Data Catalog • Set of information (metadata) containing details on organisation information resources - datasets • Data catalog can be static or semi-static data structure created and maintained manually • Metadata is structured, consistent and indexed for fast and easy access and use • Contains descriptions of data resources • Enables user self-service data discovery and usage • Provides data discovery tools and facilities • Data catalog assists with implementing FAIR (Findable, Accessible, Interoperable, Reusable) data − Findable – details on data available on specific topics and subjects can be found easily and quickly − Accessible – underlying data can be accessed − Interoperable – metadata ensures can be aggregated and integrated across data types − Reusable – detailed metadata ensures data can be reused in the future May 3, 2021 39
  • 40. Scope Of Wider Data Management May 3, 2021 40 Data Management Data Governance Data Architecture Management Data Development Data Operations Management Data Security Management Data Quality Management Data Integration Management Reference and Master Data Management Data Warehousing and Business Intelligence Management Document and Content Management Metadata Management
  • 41. Reference And Master Data Management • Reference and Master Data Management is the ongoing reconciliation and maintenance of reference data and master data − Reference Data Management is control over defined domain values (also known as vocabularies), including control over standardised terms, code values and other unique identifiers, business definitions for each value, business relationships within and across domain value lists, and the consistent, shared use of accurate, timely and relevant reference data values to classify and categorise data − Master Data Management is control over master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities • Reference data and master data provide the context for transaction data May 3, 2021 41
  • 42. Reference and Master Data Management – Definition and Goals • Definition − Planning, implementation, and control activities to ensure consistency with a golden version of contextual data values • Goals − Provide authoritative source of reconciled, high-quality master and reference data − Lower cost and complexity through reuse and leverage of standards − Support business intelligence and information integration efforts May 3, 2021 42
  • 43. May 3, 2021 43 •Business Drivers •Data Requirements Policy and Regulations •Standards •Code Sets •Master Data •Transactional Data Inputs •Steering Committees •Business Data Stewards •Subject Matter Experts •Data Consumers •Standards Organisations •Data Providers Suppliers •Reference Data Management Applications •Master Data Management Applications •Data Modeling Tools •Process Modeling Tools •Metadata Repositories •Data Profiling Tools •Data Cleansing Tools •Data Integration Tools •Business Process and Rule Engines Change Management Tools Tools •Data Stewards •Subject Matter Experts •Data Architects •Data Analysts •Application Architects •Data Governance Council •Data Providers •Other IT Professionals Participants •Master and Reference Data Requirements •Data Models and Documentation •Reliable Reference and Master Data •Golden Record Data Lineage •Data Quality Metrics and Reports •Data Cleansing Services Primary Deliverables •Reference and Master Data Quality •Change Activity •Issues, Costs, Volume •Use and Re-Use •Availability •Data Steward Coverage Metrics Reference and Master Data Management •Application Users •BI and Reporting Users •Application Developers and Architects •Data Integration Developers and Architects •BI Developers and Architects •Vendors, Customers, and Partners Consumers
  • 44. Reference And Master Data Management – Principles • Shared reference and master data belongs to the organisation, not to a particular application or department • Reference and master data management is an on-going data quality improvement program; its goals cannot be achieved by one project alone • Business data stewards are the authorities accountable for controlling reference data values. Business data stewards work with data professionals to improve the quality of reference and master data • Golden data values represent the organisation’s best efforts at determining the most accurate, current, and relevant data values for contextual use. New data may prove earlier assumptions to be false. Therefore, apply matching rules with caution, and ensure that any changes that are made are reversible • Replicate master data values only from the database of record • Request, communicate, and, in some cases, approve of changes to reference data values before implementation May 3, 2021 44
  • 45. Reference Data • Reference data is data used to classify or categorise other data • Business rules usually dictate that reference data values conform to one of several allowed values • In all organisations, reference data exists in virtually every database • Reference tables link via foreign keys into other relational database tables, and the referential integrity functions within the database management system ensure only valid values from the reference tables are used in other tables May 3, 2021 45
  • 46. Master Data • Master data is data about the business entities that provide context for business transactions • Master data is the authoritative, most accurate data available about key business entities, used to establish the context for transactional data • Master data values are considered golden • Master Data Management is the process of defining and maintaining how master data will be created, integrated, maintained, and used throughout the enterprise May 3, 2021 46
  • 47. Master Data Challenges • What are the important roles, organisations, places, and things referenced repeatedly? • What data is describing the same person, organisation, place, or thing? • Where is this data stored? What is the source for the data? • Which data is more accurate? Which data source is more reliable and credible? Which data is most current? • What data is relevant for specific needs? How do these needs overlap or conflict? • What data from multiple sources can be integrated to create a more complete view and provide a more comprehensive understanding of the person, organisation, place or thing? • What business rules can be established to automate master data quality improvement by accurately matching and merging data about the same person, organisation, place, or thing? • How do we identify and restore data that was inappropriately matched and merged? • How do we provide our golden data values to other systems across the enterprise? • How do we identify where and when data other than the golden values is used? May 3, 2021 47
  • 48. Understand Reference And Master Data Integration Needs • Reference and master data requirements are relatively easy to discover and understand for a single application • Potentially much more difficult to develop an understanding of these needs across applications, especially across the entire organisation • Analysing the root causes of a data quality problem usually uncovers requirements for reference and master data integration • Organisations that have successfully managed reference and master data typically have focused on one subject area at a time − Analyse all occurrences of a few business entities, across all physical databases and for differing usage patterns May 3, 2021 48
  • 49. Define and Maintain the Data integration Architecture • Effective data integration architecture controls the shared access, replication, and flow of data to ensure data quality and consistency, particularly for reference and master data • Without data integration architecture, local reference and master data management occurs in application silos, inevitably resulting in redundant and inconsistent data • The selected data integration architecture should also provide common data integration services − Change request processing, including review and approval − Data quality checks on externally acquired reference and master data − Consistent application of data quality rules and matching rules − Consistent patterns of processing − Consistent metadata about mappings, transformations, programs and jobs − Consistent audit, error resolution and performance monitoring data − Consistent approach to replicating data • Establishing master data standards can be a time consuming task as it may involve multiple stakeholders • Apply the same data standards, regardless of integration technology, to enable effective standardisation, sharing, and distribution of reference and master data May 3, 2021 49
  • 50. Data Integration Services Architecture May 3, 2021 50 Data Quality Management Metadata Management Integration Metadata Job Flow and Statistics Data Acquisition, File Management and Audit Replication Management Data Standardisation Cleansing and Matching Business Metadata Source Data Archives Rules Errors Staging Reconciled Master Data Subscriptions
  • 51. Implement Reference And Master Data Management Solutions • Reference and master data management solutions are complex • Given the variety, complexity, and instability of requirements, no single solution or implementation project is likely to meet all reference and master data management needs • Organisations should expect to implement reference and master data management solutions iteratively and incrementally through several related projects and phases May 3, 2021 51
  • 52. Define And Maintain Match Rules • Matching, merging, and linking of data from multiple systems about the same person, group, place, or thing is a major master data management challenge • Matching attempts to remove redundancy, to improve data quality, and provide information that is more comprehensive • Data matching is performed by applying inference rules − Duplicate identification match rules focus on a specific set of fields that uniquely identify an entity and identify merge opportunities without taking automatic action − Match-merge rules match records and merge the data from these records into a single, unified, reconciled, and comprehensive record − Match-link rules identify and cross-reference records that appear to relate to a master record without updating the content of the cross- referenced record May 3, 2021 52
  • 53. Vocabulary Management And Reference Data • A vocabulary is a collection of terms / concepts and their relationships • Vocabulary management is defining, sourcing, importing, and maintaining a vocabulary and its associated reference data − See ANSI/NISO Z39.19 - Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies - http://www.niso.org/kst/reports/standards?step=2&gid=&project _key=7cc9b583cb5a62e8c15d3099e0bb46bbae9cf38a • Vocabulary management requires the identification of the standard list of preferred terms and their synonyms • Vocabulary management requires data governance, enabling data stewards to assess stakeholder needs May 3, 2021 53
  • 54. Vocabulary Management And Reference Data • Key questions to ask to enable vocabulary management − What information concepts (data attributes) will this vocabulary support? − Who is the audience for this vocabulary? What processes do they support, and what roles do they play? − Why is the vocabulary needed? Will it support applications, content management, analytics, and so on? − Who identifies and approves the preferred vocabulary and vocabulary terms? − What are the current vocabularies different groups use to classify this information? Where are they located? How were they created? Who are their subject matter experts? Are there any security or privacy concerns for any of them? − Are there existing standards that can be leveraged to fulfil this need? Are there concerns about using an external standard vs. internal? How frequently is the standard updated and what is the degree of change of each update? Are standards accessible in an easy to import / maintain format in a cost efficient manner? May 3, 2021 54
  • 55. Defining Golden Master Data Values • Golden data values are the data values thought to be the most accurate, current, and relevant for shared, consistent use across applications • Determine golden values by analysing data quality, applying data quality rules and matching rules, and incorporating data quality controls into the applications that acquire, create, and update data • Establish data quality measurements to set expectations, measure improvements, and help identify root causes of data quality problems • Assess data quality through a combination of data profiling activities and verification against adherence to business rules • Once the data is standardised and cleansed, the next step is to attempt reconciliation of redundant data through application of matching rules
  • 56. Define And Maintain Hierarchies And Affiliations • Vocabularies and their associated reference data sets are often more than lists of preferred terms and their synonyms • Affiliation management is the establishment and maintenance of relationships between master data records
  • 57. Plan And Implement Integration Of New Data Sources • Integrating new reference data sources involves − Receiving and responding to new data acquisition requests from different groups − Performing data quality assessment services using data cleansing and data profiling tools − Assessing data integration complexity and cost − Piloting the acquisition of data and its impact on match rules − Determining who will be responsible for data quality − Finalising data quality metrics
  • 58. Replicate And Distribute Reference And Master Data • Reference and master data may be read directly from a database of record, or may be replicated from the database of record to other application databases for transaction processing, and data warehouses for business intelligence • Reference data most commonly appears as pick list values in applications • Replication aids maintenance of referential integrity
  • 59. Manage Changes To Reference And Master Data • Specific individuals have the role of a business data steward with the authority to create, update, and retire reference data • Formally control changes to controlled vocabularies and their reference data sets • Carefully assess the impact of reference data changes
  • 60. Data Governance And MDM Success Factors • Master Data Management will support business by providing a strategy, governance policies and technologies for customer, product, and entitlement information by following the Master Data Management Guiding Principles − Master data management will use (and where needed create) a “single version of the truth” for customer, product, and asset entitlement master data consolidated into a single master data system − Master data management will establish standard data definition and usage will be consistent to simplify business processes across enterprise systems − Master data management systems and processes will be flexible and adaptable to handle domestic and global expansion to support growth in both established and emerging markets − Master data management will adhere to a standards governance process to ensure key data elements are created, maintained, cleansed and converted to be syndicated across enterprise systems − Master data management will identify responsibilities and monitor accountability for customer, product, and entitlement information Master data management will facilitate cross-functional collaboration and manage continuous improvement of master data for customer, product, and entitlement domains
  • 61. Data Governance is Not A Choice – It Is A Necessity • “We’ve got to stop having the ‘who owns the data?’ conversation.” • “We can’t do MDM if we don’t formalise decision-making processes around our enterprise information.” • “Fixing the data in a single system is pointless; we don’t know what the rules are across our systems.” • “Everyone agrees data quality is poor, but no one can agree on how to fix it.” • “Are you kidding? We have multiple versions of the single- version-of-the truth.”
  • 62. MDM Program Critical Success Factors • Strategy − Drive and promote alignment with corporate strategic initiatives and pillar specific goals − Definition of criteria and core attributes that define domains and related objects • Solution − Alignment with corporate strategic initiatives and pillar specific goals − Identification of “Quick Wins” that have measurable impact − Clear definition of metrics for measuring data improvement − Leading industry practices have been incorporated solution design • Governance − Executive Ownership and Governance organisation has been rationalised established to address federated data management needs − Data Quality is addressed at all points of processes, as well as customer and product lifecycles • End-to-end Roadmap − Prioritised program roadmap for “Quick Wins” − Prioritised program roadmap for CDM strategic initiatives − Fully vetted CBA for each roadmap item − “No Regrets” actions are rationalised and aligned strategic roadmap