SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Choosing the Right
Data Architecture
for Your Big Data Projects
Presentation 1
“There isn’t a cluster big enough to hold your ego!”
Presentation 1
Choosing the Right Data Architecture
for Your Big Data Projects
AGENDA
Choosing the Right Data Architecture
for Your Big Data Projects
Acknowledgements
Planning Your Enterprise Data Strategy
John Ladley
President
IMCue Solutions
Metrics for Information Management
Business Analysis Techniques for Data Professionals
Alec Sharp
Senior Consultant
Clariteq Systems Consulting
Steps to a Successful Enterprise Information Management
ProgramMichael F. Jennings
Executive Director - Data Governance
Walgreens
Meta Data Requirements for the Enterprise
David Loshin
President
Knowledge Integrity
Advanced MDM: Moving to the Next Level of MDM Success
Choosing the Right Data Architecture
for Your Big Data Projects
Acknowledgements
Choosing a Big Data Platform
Big Data
Platform
Relational(SQL)
Big Data
Platform
Choosing a Big Data Platform
d
Data
Grid
Graph
NewS
QL
Analy
tics/
MPP
Big Data
Platform
http://arnon.me/2012/11/nosql-landscape-diagrams/
Key Ideas
One Big Data database cannot accommodate all the Big Data types
One size DOES NOT fit all.
You need to know the data type and data architecture to select the most
appropriate Big Data database.
Choosing a Big Data Architecture
Big Data
Platform
Big Data
Architecture
What is Big Data?
Big Data is about textual analytics (deriving data from unstructured content)
[not dimension or fact tables]
Web data
click stream data
social network data
Semi-structured data email
Unstructured content comments
Sensor data
Vertical industries structured transaction data tweets , text messages
Choosing a Big Data Architecture
Analysis
Type
Choosing a Big Data Architecture
What do we need to consider when classifying Big Data?
Real
Time
Batch
Processing
Methodology
Predictive
Analytics
Analytical
Querying
&
Reporting
Misc.
Data Type
Meta Data
Master
Data
Historical Transactional
Data Frequency
On
Demand
Feeds
Continuous
Feeds
Real Time
Feeds
Time Series
Structured
Un-
Structured
Semi-
Structured
Web and
Social
Media
Machine
Generated
Human
Generated
Internal
Data
Sources
Transaction
Data
Biometric
Data
Via Data
Providers
Via Data
Originators
Data Consumers
Human
Business
Process
Other
Enterprise
Applications
Other Data
Repositories
Hardware
Commodity
Hardware
State of the Art
Hardware
Choosing a Big Data Architecture
Choosing a Big Data Architecture
Choosing a Big Data Architecture
Classify Big Data Type According to the Business Needs
Big data business problems by type
Business problem
Big Data
Type Description
Utility companies have rolled out smart meters to measure the consumption of water, gas, and electricity at regular intervals of
one hour or less. These smart meters generate huge volumes of interval data that needs to be analyzed.
Utilities also run big, expensive, and complicated systems to generate power. Each grid includes sophisticated sensors that
monitor voltage, current, frequency, and?other important operating characteristics.
To gain operating efficiency, the company must monitor the data delivered by the sensor. A big data solution can analyze power
generation (supply) and power consumption (demand) data using smart meters.
Web and
social data
Telecommunications operators need to build detailed customer churn models that include social media and transaction data,
such as CDRs, to keep up with the competition.
The value of the churn models depends on the quality of customer attributes (customer master data such as date of birth, gender,
location, and income) and the social behavior of customers.
Transaction
data
Telecommunications providers who implement a predictive analytics strategy can manage and predict churn by analyzing the
calling patterns of subscribers.
Marketing departments use Twitter feeds to conduct sentiment analysis to determine what users are saying about the company
and its products or services, especially after a new product or release is launched.
Customer sentiment must be integrated with customer profile data to derive meaningful results. Customer feedback may vary
according to customer demographics.
Utilities: Predict power
consumption
Machine-
generated
data
Telecommunications:
Customer churn
analytics
Marketing: Sentiment
analysis
Web and
social data
Choosing a Big Data Architecture
Big data business problems by type
Business problem
Big Data
Type Description
Customer service:
Call monitoring Human-
generated
IT departments are turning to big data solutions to analyze application logs to gain insight that can improve system performance.
Log files from various application vendors are in different formats; they must be standardized before IT departments can use
them.
Web and
social data
Retailers can use facial recognition technology in combination with a photo from social media to make personalized offers to
customers based on buying behavior and location.
Biometrics
This capability could have a tremendous impact on retailers? loyalty programs, but it has serious privacy ramifications. Retailers
would need to make the appropriate privacy disclosures before implementing these applications.
Machine-
generated
data
Retailers can target customers with specific promotions and coupons based location data. Solutions are typically designed to
detect a user's location upon entry to a store or through GPS.
Transaction
data
Location data combined with customer preference data from social networks enable retailers to target online and in-store
marketing campaigns based on buying history. Notifications are delivered through mobile applications, SMS, and email.
Machine-
generated
data
Fraud management predicts the likelihood that a given transaction or customer account is experiencing fraud. Solutions analyze
transactions in real time and generate recommendations for immediate action, which is critical to stopping third-party fraud, first-
party fraud, and deliberate misuse of account privileges.
Solutions are typically designed to detect and prevent myriad fraud and risk types across multiple industries, including:
Transaction
data
Credit and debit payment card fraud
Deposit account fraud
Human-
generated Technical fraud
Bad debt
Healthcare fraud
Medicaid and Medicare fraud
Property and casualty insurance fraud
Worker compensation fraud
Insurance fraud
Telecommunications fraud
Retail and marketing:
Mobile data and
location-based
targeting
FSS, Healthcare:
Fraud detection
Retail: Personalized
messaging based on
facial recognition and
social media
Classify Big Data Type According to the Business Needs
Key Idea
There are guidelines to help suggest the Big Data Types that are
commonly used by each industry.
Choosing a Big Data Architecture
Classify Big Data Type According to the Business Needs
Validate the data being collected has business value.
Critical Success Factor
55% of Big Data projects don’t get completed,
…and many others fall short of their objectives.
http://www.infochimps.com/resources/report-cios-big-data-what-your-it-team-wants-you-to-know-6/
Report: CIOs & Big Data: What Your IT Team Wants You to Know
Choosing a Big Data Architecture
Big Data
Platform
Big Data
Architecture
Big Data
Business
Needs
by type
Ten Big Data Schemas
Big Data
Architecture
Ten Big Data SchemasRelational - Graph
A graph database stores data in a graph, the most generic of data structures, capable of
elegantly representing any kind of data in a highly accessible way.
Graph databases can make a difference in harvesting more value in your data
by looking at its relationships.
Provides index-free adjacency where every element contains a direct pointer to its
adjacent elements and no index lookups are necessary.
Ten Big Data SchemasRelational - Graph
Ten Big Data Schemas
Relational - Analytics / MPP Columnar
Column-oriented storage organization, which increases performance of
sequential record access at the expense of common transactional operations
such as single record retrieval, updates, and deletes
Shared nothing architecture, which reduces system contention for shared
resources and allows gradual degradation of performance in the face of hardware
failure
Ten Big Data Schemas
Relational - Analytics / MPP Columnar
Ten Big Data SchemasRelational - Analytics / MPP
Delivers extreme performance and scalability for all your database applications
including Online Transaction Processing (OTLP), data warehousing (DW) and mixed
workloads
Ten Big Data SchemasRelational - Analytics / MPP
Ten Big Data SchemasRelational - NewSQL
Scale out relational databases by virtualizing a distributed database environment.
Provides organizations the relational data integrity combined with the
scalability and flexibility of a modern distributed, multi-site database to
support an unlimited numbers of users, larger data volumes and extremely
high TPS
Ten Big Data SchemasRelational - NewSQL
Ten Big Data SchemasPolyStructured – Document Indexing
Provides full-text search, hit highlighting, faceted search, dynamic clustering, database
integration, and rich document (e.g., Word, PDF) handling.
Provides distributed search and index replication
Highly scalable
Ten Big Data SchemasPolyStructured – Document Indexing
Ten Big Data SchemasPolyStructured - Document
Document databases completely embrace the web.
Store data with JSON documents.
Access documents and query indexes with web browsers, via HTTP.
Index, combine, and transform documents with JavaScript.
Works well with modern web and mobile apps.
Serve web apps directly.
On-the-fly document transformation and real-time change notifications
Ten Big Data SchemasPolyStructured - Document
Document databases lack a schema, or rigid pre-defined data structures such as tables.
Data stored in document databases commonly use JSON document(s)
JavaScript for MapReduce indexes
Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Data Grid
In-Memory Accelerator for Apache Hadoop, high performance computing, streaming
and database, HDFS and MongoDB
Eliminate MapReduce Overhead
Dynamically caches, partitions, replicates, and manages application data and business
logic across multiple servers.
Fully elastic memory based storage grid. Virtualized the free memory of a potentially
large number of Java virtual machines and makes them behave like a single key
addressable storage pool for application state.
IBM WebSphere eXtreme Scale
Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Data Grid
Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Caching
Run atomic operations like appending to a string; incrementing the value in a hash; pushing
to a list; computing set intersection, union and difference; or getting the member with
highest ranking in a sorted set.
With an in-memory dataset, depending on your use case, you can persist it either by
dumping the dataset to disk every once in a while, or by appending each command to
a log.
Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Caching
Ten Big Data SchemasPolyStructured – Key Value Stored – Columnar
Random, real time read/write access to your Big Data
Hosting of very large tables -- billions of rows X millions of columns --
atop clusters of commodity hardware
Ten Big Data SchemasPolyStructured – Key Value Stored – Columnar
Ten Big Data SchemasPolyStructured – Distributed File System
Storage and large-scale processing of data-sets on clusters of commodity hardware.
Distributed, scalable, and portable file-system
Ten Big Data SchemasPolyStructured – Distributed File System
Key Ideas
Hadoop is the #1 distributed file system used for Big Data Projects
Hadoop is used as the shared data source platform to merge and
standardize big data with legacy data
Data As A Service
Single System Management
API’s
Data as a Service
Applications (API) should be based from a single data source platform.
Web and
Social
Media
Machine
Generated
Human
Generated
Internal
Data
Sources
Transaction
Data
Biometric
Data
Via Data
Providers
Via Data
Originators
Key Ideas
Hadoop is the #1 distributed file system used for Big Data Projects
Hadoop is used as the shared data source platform to merge and
standardize big data with legacy data
Hadoop is an excellent choice to start building your shared data source
platform
Hadoop can become your System of Record (SOR) for Big Data and part of
your Master Data Management system (MDM)
The date time format must be standardized across the
data platform
Critical Success Factors
The time format of International Standard ISO 8601 specifies numeric
representations of date and time.
YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00) is suggested
and preferred.
Unique identifiers (domain keys) must be clearly
described using friendly terminology
For example:
‘ID’ should never be a column name
‘Sales ID’ is too generic
‘Sales Representative Reporting ID’ is friendly
and clearly named
Key Idea
Hadoop is used as the shared analytical platform to merge and
standardize analytics
Single System Management
Analytics should be based from a single data source platform.
Analytics As A Service
IBM WebSphere eXtreme Scale
Analytics
Analytics as a Service
Key Ideas
Hadoop is used as the shared analytical platform to merge and standardize
analytics
There are guidelines to help suggest the analytics, KPI’s and Profit Drivers
for Big Data that are commonly used by each industry.
Examples of tasks Algorithms to use (2)
Predicting a discrete attribute
•Flag the customers in a prospective buyers list as good or poor
prospects.
•Calculate the probability that a server will fail within the next 6
months.
•Categorize patient outcomes and explore related factors.
Decision Trees Algorithm
Naive Bayes Algorithm
Clustering Algorithm
Neural Network Algorithm
Predicting a continuous attribute
•Forecast next year's sales.
•Predict site visitors given past historical and seasonal trends.
•Generate a risk score given demographics.
Decision Trees Algorithm
Time Series Algorithm
Linear Regression Algorithm
Predicting a sequence
•Perform clickstream analysis of a company's Web site.
•Analyze the factors leading to server failure.
•Capture and analyze sequences of activities during outpatient
visits, to formulate best practices around common activities.
Sequence Clustering Algorithm
Finding groups of common items in transactions
•Use market basket analysis to determine product placement.
•Suggest additional products to a customer for purchase.
•Analyze survey data from visitors to an event, to find which
activities or booths were correlated, to plan future activities.
Association Algorithm
Decision Trees Algorithm
Finding groups of similar items
•Create patient risk profiles groups based on attributes such as
demographics and behaviors.
•Analyze users by browsing and buying patterns.
•Identify servers that have similar usage characteristics.
Clustering Algorithm
Sequence Clustering Algorithm
Key Ideas
Hadoop is used as the shared analytical platform to merge and standardize
analytics
There are guidelines to help suggest the analytics, KPI’s and Profit Drivers
for Big Data that are commonly used by each industry.
You do not need to know how the algorithm works or is designed. You
only need to know the parameters needed to run them.
Task Description Algorithms
Market Basket Analysis Discover items sold together to create recommendations on-the-fly
and to determine how product placement can directly contribute to
your bottom line.
Association
Decision Trees
Churn Analysis Anticipate customers who may be considering canceling their service
and identify the benefits that will keep them from leaving.
Decision Trees
Linear Regression
Logistic Regression
Market Analysis Define market segments by automatically grouping similar customers
together. Use these segments to seek profitable customers.
Clustering
Sequence Clustering
Forecasting Predict sales and inventory amounts and learn how they are
interrelated to foresee bottlenecks and improve performance.
Decision Trees
Time Series
Data Exploration Analyze profitability across customers, or compare customers that
prefer different brands of the same product to discover new
opportunities.
Neural Network
Unsupervised Learning Identify previously unknown relationships between various elements
of your business to inform your decisions.
Neural Network
Web Site Analysis Understand how people use your Web site and group similar usage
patterns to offer a better experience.
Sequence Clustering
Campaign Analysis Spend marketing funds more effectively by targeting the customers
most likely to respond to a promotion.
Decision Trees
Naïve Bayes
Clustering
Information Quality Identify and handle anomalies during data entry or data loading to
improve the quality of information.
Linear Regression
Logistic Regression
Text Analysis Analyze feedback to find common themes and trends that concern
your customers or employees, informing decisions with unstructured
input.
Text Mining
Data Mining Tasks (4)
Data Mining Algorithms
(Analysis Services - Data Mining)
Choosing an Algorithm
by Task
To help you select an
algorithm for use with a
specific task, the
following table provides
suggestions for the
types of tasks for which
each algorithm is
traditionally used.
Examples of tasks Microsoft algorithms to use
Predicting a discrete attribute Microsoft Decision Trees Algorithm
Flag the customers in a prospective buyers list as good
or poor prospects. Microsoft Naive Bayes Algorithm
Calculate the probability that a server will fail within
the next 6 months. Microsoft Clustering Algorithm
Categorize patient outcomes and explore related
factors. Microsoft Neural Network Algorithm
Predicting a continuous attribute Microsoft Decision Trees Algorithm
Forecast next year's sales. Microsoft Time Series Algorithm
Predict site visitors given past historical and seasonal
trends. Microsoft Linear Regression Algorithm
Generate a risk score given demographics.
Predicting a sequence Microsoft Sequence Clustering Algorithm
Perform clickstream analysis of a company's Web site.
Analyze the factors leading to server failure.
Capture and analyze sequences of activities during
outpatient visits, to formulate best practices around
common activities.
Finding groups of common items in transactions Microsoft Association Algorithm
Use market basket analysis to determine product
placement. Microsoft Decision Trees Algorithm
Suggest additional products to a customer for
purchase.
Analyze survey data from visitors to an event, to find
which activities or booths were correlated, to plan
future activities.
Finding groups of similar items Microsoft Clustering Algorithm
Create patient risk profiles groups based on attributes
such as demographics and behaviors. Microsoft Sequence Clustering Algorithm
Analyze users by browsing and buying patterns.
Identify servers that have similar usage
characteristics.
Analytic Algorithm Categories
Regression
a powerful and commonly used algorithm that evaluates the relationship of one variable, the
dependent variable, with one or more other variables, called independent variables. By measuring exactly how
large and significant each independent variable has historically been in its relation to the dependent variable,
the future value of the dependent variable can be estimated. Regression models are widely used in applications,
such as seasonal forecasting, quality assurance and credit risk analysis.
Analytic Algorithm Categories
Clustering /
Segmentation
the process of grouping items together to form categories. You might look at a
large collection of shopping baskets and discover that they are clustered corresponding to health food buyers,
convenience food buyers, luxury food buyers, and so on. Once these characteristics have been grouped
together,
they can be used to find other customers with similar characteristics. This algorithm is used to create groups for
applications, such as customers for marketing campaigns, rate groups for insurance products, and crime
statistics
groups for law enforcement.
Analytic Algorithm Categories
Nearest Neighbor
quite similar to clustering, but it will only look at others records in the dataset that are “nearest” to a chosen
unclassified record based on a “similarity” measure. Records that are “near” to each other tend to have similar
predictive values as well. Thus, if you know the prediction value of one of the records, you can predict its
nearest neighbor. This algorithm works similar to the way that people think – by detecting closely matching
examples. Nearest Neighbor applications are often used in retail and life sciences applications.
Analytic Algorithm Categories
Association Rules
detects related items in a dataset. Association analysis identifies and groups together similar
records that would otherwise go unnoticed by a casual observer. This type of analysis is often used for market
basket analysis to find popular bundles of products that are related by transaction, such as low-end digital
cameras being associated with smaller capacity memory sticks to store the digital images.
Analytic Algorithm Categories
Decision Tree
a tree-shaped graphical predictive algorithm that represents alternative sequential decisions and the possible outcomes for
each decision. This algorithm provides alternative actions that are available to the decision maker, the probabilistic events
that follow from and affect these actions, and the outcomes that are associated with each possible scenario of actions and
consequences. Their applications range from credit card scoring to time series predictions of exchange rates.
Analytic Algorithm Categories
Sequence Association
detects causality and association between time-ordered events, although the associated events may be spread
far apart in time and may seem unrelated. Tracking specific time-ordered records and linking these records to a
specific outcome allows companies to predict a possible outcome based on a few occurring events. A sequence
model can be used to reduce the number of clicks customers have to make when navigating a company’s
website.
Analytic Algorithm Categories
Neural Network
a sophisticated pattern detection algorithm that uses machine learning techniques to generate predictions. This technique
models itself after the process of cognitive learning and the neurological functions of the brain capable of predicting new
observations from other known observations. Neural networks are very powerful, complex, and accurate predictive models
that are used in detecting fraudulent behavior, in predicting the movement of stocks and currencies, and in improving the
response rates of direct marketing campaigns.
Choosing a Big Data Architecture
Big Data
Platform
Big Data
Analytical
Platform
Big Data
Analytics
Big Data
Business
Needs
by type
Big Data
Architecture
Analytics Data Sources
Analytics should be based from a single data source platform.
Analytics As A Service
Analytics as a Service
IBM WebSphere eXtreme Scale
Analytics As A Service
When you write data to a traditional database, either through loading external data,
writing the output of a query, doing UPDATE statements, etc., the database has total
control over the storage. The database is the "gatekeeper." An important implication
of this control is that the database can enforce the schema as data is written. This is
called schema on write.
Hive has no such control over the underlying storage. There are many ways to create,
modify, and even damage the data that Hive will query. Therefore, Hive can only
enforce queries on read. This is called schema on read.
So what if the schema doesn’t match the file contents? Hive does the best that it can
to read the data. You will get lots of null values if there aren’t enough fields in each
record to match the schema. If some fields are numbers and Hive encounters
nonnumeric strings, it will return nulls for those fields. Above all else, Hive tries to
recover from all errors as best it can.
http://www.sqlbiinfo.com/2014/02/schema-on-read-vs-schema-on-write.html
Schema on Read vs Schema on Write...
Analytics As A Service
Analytics As A Service
Benefits of schema on write:
• Better type safety and data cleansing done for the data at rest
• Typically more efficient (storage size and computationally) since the data is already parsed
Downsides of schema on write:
• You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)
• Typically you throw away the original data, which could be bad if you have a bug in your ingest process
• It's harder to have different views of the same data
Benefits of schema on read:
• Flexibility in defining how your data is interpreted at load time
• This gives you the ability to evolve your "schema" as time goes on
• This allows you to have different versions of your "schema"
• This allows the original source data format to change without having to consolidate to one data format
• You get to keep your original data
• You can load your data before you know what to do with it (so you don't drop it on the ground)
• Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data
Downsides of schema on read:
• Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with
formats like XML)
• The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)
• More error prone and your analytics have to account for dirty data
 
http://nosql.mypopescu.com/post/48638541973/schema-on-writes-vs-schema-on-reads-apache-hadoop-and
Reporting users
make their own schemas
and naming standards
Reporting users run their own analytics
--- as many times as they want
Key Ideas - Summary
One Big Data database cannot accommodate all the Big Data types
You need to know the data type and data architecture to select the most
appropriate Big Data database.
There are guidelines to help suggest the Big Data Types that are commonly
used by each business type.
Hadoop is used as the shared data source platform to merge and
standardize big data with legacy data
Hadoop is used as the shared analytical platform to merge and standardize
analytics
Hadoop is an excellent choice to start building your shared data source
platform
Hadoop can become your System of Record (SOR) for Big Data and part of
your Master Data Management system (MDM)
Hadoop is used to standardize and centralize the Key Performance
Indicators (KPI) and Profit Drivers for an Enterprise Analytical Platform
There are guidelines to help suggest the analytics, KPI’s and Profit Drivers
for Big Data that are commonly used by each industry.
Schema on read
Critical Success Factors - Summary
Validate the data being collected has business value.
The date time format must be standardized across the data platform.
Unique identifiers (domain keys) must be clearly described using friendly
terminology
1) Pervasive insights produce better business decision opening access to business intelligence by
embedding analytics capabilities into everyday software tools pays substantial dividends.
By Lauren Gibbons Paul
2) Data Mining Algorithms (Analysis Services - Data Mining)
http://msdn.microsoft.com/en-us/library/ms175595.aspx
3) Data Mining Query Task
http://msdn.microsoft.com/en-us/library/ms141728.aspx
4) Predictive Analysis with SQL Server 2008 - White Paper - Microsoft - Published: November 2007
5) Predictive Analytics for the Retail Industry - White Paper - Microsoft - Writer: Matt Adams
Technical Reviewer: Roni Karassik, Published: May 2008
6) Breakthrough Insights using Microsoft SQL Server 2012 - Analysis Services
https://www.microsoftvirtualacademy.com/tracks/breakthrough-insights-using-microsoft-sql-server-2012-a
7) Useful DAX Starter Functions and Expressions
http://thomasivarssonmalmo.wordpress.com/category/powerpivot-and-dax/
8) Stairway to PowerPivot and DAX - Level 1: Getting Started with PowerPivot and DAX
By Bill_Pearson, 2011/12/21
9) Data Mining Tool
http://technet.microsoft.com/en-us/library/ms174467.aspx
10) DAX Cheat Sheet
http://powerpivot-info.com/post/439-dax-cheat-sheet
11) Big Data Landscape - http://arnon.me/2012/11/nosql-landscape-diagrams/
References
On the Internet, the World Wide Web Consortium (W3C) uses ISO 8601 in defining a profile of the standard that restricts the supported date and time formats
to reduce the chance of error and the complexity of software.[19]
RFC 3339 defines a profile of ISO 8601 for use in Internet protocols and standards. It explicitly excludes durations and dates before the common era. The more
complex formats such as week numbers and ordinal days are not permitted.[20]
RFC 3339 deviates from ISO 8601 in allowing a zero timezone offset to be specified as "-00:00", which ISO 8601 forbids. RFC 3339 intends "-00:00" to carry the
connotation that it is not stating a preferred timezone, whereas the conforming "+00:00" or any non-zero offset connotes that the offset being used is
preferred. This convention regarding "-00:00" is derived from earlier RFCs, such as RFC 2822 which uses it for timestamps in email headers. RFC 2822 made no
claim that any part of its timestamp format conforms to ISO 8601, and so was free to use this convention without conflict. RFC 3339 errs in adopting this
convention while also claiming conformance to ISO 8601.
http://www.w3.org/TR/NOTE-datetime
http://stackoverflow.com/questions/16307563/utc-time-explanation
International Standard ISO 8601 specifies numeric representations of date and time.
YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00) where:
YYYY = four-digit year
MM = two-digit month (01=January, etc.)
DD = two-digit day of month (01 through 31)
hh = two digits of hour (00 through 23) (am/pm NOT allowed)
mm = two digits of minute (00 through 59)
ss = two digits of second (00 through 59)
s = one or more digits representing a decimal fraction of a second
TZD = time zone designator (Z or +hh:mm or -hh:mm)
Times are expressed in UTC (Coordinated Universal Time), with a special UTC designator ("Z"). Times are expressed in local time, together with a time zone
offset in hours and minutes. A time zone offset of "+hh:mm" indicates that the date/time uses a local time zone which is "hh" hours and "mm" minutes ahead of
UTC. A time zone offset of "-hh:mm" indicates that the date/time uses a local time zone which is "hh" hours and "mm" minutes behind UTC.
Right Data Architecture Big Data Projects

Weitere ähnliche Inhalte

Was ist angesagt?

Business intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lakeBusiness intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lakeData Science Thailand
 
Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use casesAllied Consultants
 
8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshareJulianna DeLua
 
IBM Governed Data Lake
IBM Governed Data LakeIBM Governed Data Lake
IBM Governed Data LakeKaran Sachdeva
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data BSP Media Group
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsDatameer
 
Big Data and MDM altogether: the winning association
Big Data and MDM altogether: the winning associationBig Data and MDM altogether: the winning association
Big Data and MDM altogether: the winning associationJean-Michel Franco
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationDatabricks
 
Overview of analytics and big data in practice
Overview of analytics and big data in practiceOverview of analytics and big data in practice
Overview of analytics and big data in practiceVivek Murugesan
 
Unlocking Business Value Using Data
Unlocking Business Value Using DataUnlocking Business Value Using Data
Unlocking Business Value Using DataSplunk
 
The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lakeCapgemini
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introductionIBM Analytics
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Denodo
 
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and HortonworksHow to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and HortonworksHortonworks
 
Enabling digital business with governed data lake
Enabling digital business with governed data lakeEnabling digital business with governed data lake
Enabling digital business with governed data lakeKaran Sachdeva
 
EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...
EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...
EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...Capgemini
 
Four Pillars of Business Analytics by Actuate
Four Pillars of Business Analytics by ActuateFour Pillars of Business Analytics by Actuate
Four Pillars of Business Analytics by ActuateEdgar Alejandro Villegas
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemCapgemini
 
Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360
Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360
Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360Databricks
 

Was ist angesagt? (20)

Business intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lakeBusiness intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lake
 
Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use cases
 
8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare
 
IBM Governed Data Lake
IBM Governed Data LakeIBM Governed Data Lake
IBM Governed Data Lake
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Big Data and MDM altogether: the winning association
Big Data and MDM altogether: the winning associationBig Data and MDM altogether: the winning association
Big Data and MDM altogether: the winning association
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Overview of analytics and big data in practice
Overview of analytics and big data in practiceOverview of analytics and big data in practice
Overview of analytics and big data in practice
 
Unlocking Business Value Using Data
Unlocking Business Value Using DataUnlocking Business Value Using Data
Unlocking Business Value Using Data
 
The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lake
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
 
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and HortonworksHow to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
 
Enabling digital business with governed data lake
Enabling digital business with governed data lakeEnabling digital business with governed data lake
Enabling digital business with governed data lake
 
EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...
EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...
EMC World 2014 Breakout: Move to the Business Data Lake – Not as Hard as It S...
 
Four Pillars of Business Analytics by Actuate
Four Pillars of Business Analytics by ActuateFour Pillars of Business Analytics by Actuate
Four Pillars of Business Analytics by Actuate
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake Ecosystem
 
Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360
Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360
Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360
 

Ähnlich wie Right Data Architecture Big Data Projects

What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Big agendas for big data analytics projects
Big agendas for big data analytics projectsBig agendas for big data analytics projects
Big agendas for big data analytics projectsThe Marketing Distillery
 
Maximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data PlatformMaximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data PlatformNeo4j
 
What are Big Data, Data Science, and Data Analytics
 What are Big Data, Data Science, and Data Analytics What are Big Data, Data Science, and Data Analytics
What are Big Data, Data Science, and Data AnalyticsRay Business Technologies
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitecturePalani Kumar
 
C21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptxC21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptxAdityaDeshpande674450
 
Data Mining Services in various types
Data Mining Services in various typesData Mining Services in various types
Data Mining Services in various typesloginworks software
 
Impact of big data on DCMI market
Impact of big data on DCMI marketImpact of big data on DCMI market
Impact of big data on DCMI marketMohsin Baig
 
Rising Significance of Big Data Analytics for Exponential Growth.docx
Rising Significance of Big Data Analytics for Exponential Growth.docxRising Significance of Big Data Analytics for Exponential Growth.docx
Rising Significance of Big Data Analytics for Exponential Growth.docxSG Analytics
 
Big data – A Review
Big data – A ReviewBig data – A Review
Big data – A ReviewIRJET Journal
 
BIG DATA & BUSINESS ANALYTICS
BIG DATA & BUSINESS ANALYTICSBIG DATA & BUSINESS ANALYTICS
BIG DATA & BUSINESS ANALYTICSVikram Joshi
 
Data Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdfData Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdfCiente
 
Data Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdfData Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdfCiente
 

Ähnlich wie Right Data Architecture Big Data Projects (20)

6 Reasons to Use Data Analytics
6 Reasons to Use Data Analytics6 Reasons to Use Data Analytics
6 Reasons to Use Data Analytics
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
new.pptx
new.pptxnew.pptx
new.pptx
 
Big agendas for big data analytics projects
Big agendas for big data analytics projectsBig agendas for big data analytics projects
Big agendas for big data analytics projects
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
uae views on big data
  uae views on  big data  uae views on  big data
uae views on big data
 
Maximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data PlatformMaximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data Platform
 
What are Big Data, Data Science, and Data Analytics
 What are Big Data, Data Science, and Data Analytics What are Big Data, Data Science, and Data Analytics
What are Big Data, Data Science, and Data Analytics
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_Architecture
 
C21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptxC21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptx
 
Big Data analytics best practices
Big Data analytics best practicesBig Data analytics best practices
Big Data analytics best practices
 
Data Mining Services in various types
Data Mining Services in various typesData Mining Services in various types
Data Mining Services in various types
 
Impact of big data on DCMI market
Impact of big data on DCMI marketImpact of big data on DCMI market
Impact of big data on DCMI market
 
Rising Significance of Big Data Analytics for Exponential Growth.docx
Rising Significance of Big Data Analytics for Exponential Growth.docxRising Significance of Big Data Analytics for Exponential Growth.docx
Rising Significance of Big Data Analytics for Exponential Growth.docx
 
Big data – A Review
Big data – A ReviewBig data – A Review
Big data – A Review
 
BIG DATA & BUSINESS ANALYTICS
BIG DATA & BUSINESS ANALYTICSBIG DATA & BUSINESS ANALYTICS
BIG DATA & BUSINESS ANALYTICS
 
Data Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdfData Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdf
 
Data Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdfData Analytics And Business Decision.pdf
Data Analytics And Business Decision.pdf
 

Mehr von Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

Mehr von Chicago Hadoop Users Group (19)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Right Data Architecture Big Data Projects

  • 1. Choosing the Right Data Architecture for Your Big Data Projects Presentation 1
  • 2. “There isn’t a cluster big enough to hold your ego!”
  • 3. Presentation 1 Choosing the Right Data Architecture for Your Big Data Projects AGENDA
  • 4. Choosing the Right Data Architecture for Your Big Data Projects Acknowledgements Planning Your Enterprise Data Strategy John Ladley President IMCue Solutions Metrics for Information Management Business Analysis Techniques for Data Professionals Alec Sharp Senior Consultant Clariteq Systems Consulting Steps to a Successful Enterprise Information Management ProgramMichael F. Jennings Executive Director - Data Governance Walgreens Meta Data Requirements for the Enterprise David Loshin President Knowledge Integrity Advanced MDM: Moving to the Next Level of MDM Success
  • 5. Choosing the Right Data Architecture for Your Big Data Projects Acknowledgements
  • 6. Choosing a Big Data Platform Big Data Platform
  • 10. Key Ideas One Big Data database cannot accommodate all the Big Data types One size DOES NOT fit all. You need to know the data type and data architecture to select the most appropriate Big Data database.
  • 11. Choosing a Big Data Architecture Big Data Platform Big Data Architecture
  • 12. What is Big Data? Big Data is about textual analytics (deriving data from unstructured content) [not dimension or fact tables] Web data click stream data social network data Semi-structured data email Unstructured content comments Sensor data Vertical industries structured transaction data tweets , text messages Choosing a Big Data Architecture
  • 13. Analysis Type Choosing a Big Data Architecture What do we need to consider when classifying Big Data? Real Time Batch Processing Methodology Predictive Analytics Analytical Querying & Reporting Misc. Data Type Meta Data Master Data Historical Transactional Data Frequency On Demand Feeds Continuous Feeds Real Time Feeds Time Series Structured Un- Structured Semi- Structured Web and Social Media Machine Generated Human Generated Internal Data Sources Transaction Data Biometric Data Via Data Providers Via Data Originators Data Consumers Human Business Process Other Enterprise Applications Other Data Repositories Hardware Commodity Hardware State of the Art Hardware
  • 14. Choosing a Big Data Architecture
  • 15. Choosing a Big Data Architecture
  • 16.
  • 17. Choosing a Big Data Architecture Classify Big Data Type According to the Business Needs Big data business problems by type Business problem Big Data Type Description Utility companies have rolled out smart meters to measure the consumption of water, gas, and electricity at regular intervals of one hour or less. These smart meters generate huge volumes of interval data that needs to be analyzed. Utilities also run big, expensive, and complicated systems to generate power. Each grid includes sophisticated sensors that monitor voltage, current, frequency, and?other important operating characteristics. To gain operating efficiency, the company must monitor the data delivered by the sensor. A big data solution can analyze power generation (supply) and power consumption (demand) data using smart meters. Web and social data Telecommunications operators need to build detailed customer churn models that include social media and transaction data, such as CDRs, to keep up with the competition. The value of the churn models depends on the quality of customer attributes (customer master data such as date of birth, gender, location, and income) and the social behavior of customers. Transaction data Telecommunications providers who implement a predictive analytics strategy can manage and predict churn by analyzing the calling patterns of subscribers. Marketing departments use Twitter feeds to conduct sentiment analysis to determine what users are saying about the company and its products or services, especially after a new product or release is launched. Customer sentiment must be integrated with customer profile data to derive meaningful results. Customer feedback may vary according to customer demographics. Utilities: Predict power consumption Machine- generated data Telecommunications: Customer churn analytics Marketing: Sentiment analysis Web and social data
  • 18. Choosing a Big Data Architecture Big data business problems by type Business problem Big Data Type Description Customer service: Call monitoring Human- generated IT departments are turning to big data solutions to analyze application logs to gain insight that can improve system performance. Log files from various application vendors are in different formats; they must be standardized before IT departments can use them. Web and social data Retailers can use facial recognition technology in combination with a photo from social media to make personalized offers to customers based on buying behavior and location. Biometrics This capability could have a tremendous impact on retailers? loyalty programs, but it has serious privacy ramifications. Retailers would need to make the appropriate privacy disclosures before implementing these applications. Machine- generated data Retailers can target customers with specific promotions and coupons based location data. Solutions are typically designed to detect a user's location upon entry to a store or through GPS. Transaction data Location data combined with customer preference data from social networks enable retailers to target online and in-store marketing campaigns based on buying history. Notifications are delivered through mobile applications, SMS, and email. Machine- generated data Fraud management predicts the likelihood that a given transaction or customer account is experiencing fraud. Solutions analyze transactions in real time and generate recommendations for immediate action, which is critical to stopping third-party fraud, first- party fraud, and deliberate misuse of account privileges. Solutions are typically designed to detect and prevent myriad fraud and risk types across multiple industries, including: Transaction data Credit and debit payment card fraud Deposit account fraud Human- generated Technical fraud Bad debt Healthcare fraud Medicaid and Medicare fraud Property and casualty insurance fraud Worker compensation fraud Insurance fraud Telecommunications fraud Retail and marketing: Mobile data and location-based targeting FSS, Healthcare: Fraud detection Retail: Personalized messaging based on facial recognition and social media Classify Big Data Type According to the Business Needs
  • 19. Key Idea There are guidelines to help suggest the Big Data Types that are commonly used by each industry.
  • 20. Choosing a Big Data Architecture Classify Big Data Type According to the Business Needs
  • 21. Validate the data being collected has business value. Critical Success Factor 55% of Big Data projects don’t get completed, …and many others fall short of their objectives. http://www.infochimps.com/resources/report-cios-big-data-what-your-it-team-wants-you-to-know-6/ Report: CIOs & Big Data: What Your IT Team Wants You to Know
  • 22. Choosing a Big Data Architecture Big Data Platform Big Data Architecture Big Data Business Needs by type
  • 23. Ten Big Data Schemas Big Data Architecture
  • 24. Ten Big Data SchemasRelational - Graph A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. Graph databases can make a difference in harvesting more value in your data by looking at its relationships. Provides index-free adjacency where every element contains a direct pointer to its adjacent elements and no index lookups are necessary.
  • 25. Ten Big Data SchemasRelational - Graph
  • 26. Ten Big Data Schemas Relational - Analytics / MPP Columnar Column-oriented storage organization, which increases performance of sequential record access at the expense of common transactional operations such as single record retrieval, updates, and deletes Shared nothing architecture, which reduces system contention for shared resources and allows gradual degradation of performance in the face of hardware failure
  • 27. Ten Big Data Schemas Relational - Analytics / MPP Columnar
  • 28. Ten Big Data SchemasRelational - Analytics / MPP Delivers extreme performance and scalability for all your database applications including Online Transaction Processing (OTLP), data warehousing (DW) and mixed workloads
  • 29. Ten Big Data SchemasRelational - Analytics / MPP
  • 30. Ten Big Data SchemasRelational - NewSQL Scale out relational databases by virtualizing a distributed database environment. Provides organizations the relational data integrity combined with the scalability and flexibility of a modern distributed, multi-site database to support an unlimited numbers of users, larger data volumes and extremely high TPS
  • 31. Ten Big Data SchemasRelational - NewSQL
  • 32. Ten Big Data SchemasPolyStructured – Document Indexing Provides full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Provides distributed search and index replication Highly scalable
  • 33. Ten Big Data SchemasPolyStructured – Document Indexing
  • 34. Ten Big Data SchemasPolyStructured - Document Document databases completely embrace the web. Store data with JSON documents. Access documents and query indexes with web browsers, via HTTP. Index, combine, and transform documents with JavaScript. Works well with modern web and mobile apps. Serve web apps directly. On-the-fly document transformation and real-time change notifications
  • 35. Ten Big Data SchemasPolyStructured - Document Document databases lack a schema, or rigid pre-defined data structures such as tables. Data stored in document databases commonly use JSON document(s) JavaScript for MapReduce indexes
  • 36. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Data Grid In-Memory Accelerator for Apache Hadoop, high performance computing, streaming and database, HDFS and MongoDB Eliminate MapReduce Overhead Dynamically caches, partitions, replicates, and manages application data and business logic across multiple servers. Fully elastic memory based storage grid. Virtualized the free memory of a potentially large number of Java virtual machines and makes them behave like a single key addressable storage pool for application state. IBM WebSphere eXtreme Scale
  • 37. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Data Grid
  • 38. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Caching Run atomic operations like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set. With an in-memory dataset, depending on your use case, you can persist it either by dumping the dataset to disk every once in a while, or by appending each command to a log.
  • 39. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Caching
  • 40. Ten Big Data SchemasPolyStructured – Key Value Stored – Columnar Random, real time read/write access to your Big Data Hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware
  • 41. Ten Big Data SchemasPolyStructured – Key Value Stored – Columnar
  • 42. Ten Big Data SchemasPolyStructured – Distributed File System Storage and large-scale processing of data-sets on clusters of commodity hardware. Distributed, scalable, and portable file-system
  • 43. Ten Big Data SchemasPolyStructured – Distributed File System
  • 44. Key Ideas Hadoop is the #1 distributed file system used for Big Data Projects Hadoop is used as the shared data source platform to merge and standardize big data with legacy data
  • 45. Data As A Service Single System Management API’s Data as a Service Applications (API) should be based from a single data source platform. Web and Social Media Machine Generated Human Generated Internal Data Sources Transaction Data Biometric Data Via Data Providers Via Data Originators
  • 46. Key Ideas Hadoop is the #1 distributed file system used for Big Data Projects Hadoop is used as the shared data source platform to merge and standardize big data with legacy data Hadoop is an excellent choice to start building your shared data source platform Hadoop can become your System of Record (SOR) for Big Data and part of your Master Data Management system (MDM)
  • 47. The date time format must be standardized across the data platform Critical Success Factors The time format of International Standard ISO 8601 specifies numeric representations of date and time. YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00) is suggested and preferred. Unique identifiers (domain keys) must be clearly described using friendly terminology For example: ‘ID’ should never be a column name ‘Sales ID’ is too generic ‘Sales Representative Reporting ID’ is friendly and clearly named
  • 48. Key Idea Hadoop is used as the shared analytical platform to merge and standardize analytics
  • 49. Single System Management Analytics should be based from a single data source platform. Analytics As A Service IBM WebSphere eXtreme Scale Analytics Analytics as a Service
  • 50. Key Ideas Hadoop is used as the shared analytical platform to merge and standardize analytics There are guidelines to help suggest the analytics, KPI’s and Profit Drivers for Big Data that are commonly used by each industry.
  • 51. Examples of tasks Algorithms to use (2) Predicting a discrete attribute •Flag the customers in a prospective buyers list as good or poor prospects. •Calculate the probability that a server will fail within the next 6 months. •Categorize patient outcomes and explore related factors. Decision Trees Algorithm Naive Bayes Algorithm Clustering Algorithm Neural Network Algorithm Predicting a continuous attribute •Forecast next year's sales. •Predict site visitors given past historical and seasonal trends. •Generate a risk score given demographics. Decision Trees Algorithm Time Series Algorithm Linear Regression Algorithm Predicting a sequence •Perform clickstream analysis of a company's Web site. •Analyze the factors leading to server failure. •Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities. Sequence Clustering Algorithm Finding groups of common items in transactions •Use market basket analysis to determine product placement. •Suggest additional products to a customer for purchase. •Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities. Association Algorithm Decision Trees Algorithm Finding groups of similar items •Create patient risk profiles groups based on attributes such as demographics and behaviors. •Analyze users by browsing and buying patterns. •Identify servers that have similar usage characteristics. Clustering Algorithm Sequence Clustering Algorithm
  • 52. Key Ideas Hadoop is used as the shared analytical platform to merge and standardize analytics There are guidelines to help suggest the analytics, KPI’s and Profit Drivers for Big Data that are commonly used by each industry. You do not need to know how the algorithm works or is designed. You only need to know the parameters needed to run them.
  • 53. Task Description Algorithms Market Basket Analysis Discover items sold together to create recommendations on-the-fly and to determine how product placement can directly contribute to your bottom line. Association Decision Trees Churn Analysis Anticipate customers who may be considering canceling their service and identify the benefits that will keep them from leaving. Decision Trees Linear Regression Logistic Regression Market Analysis Define market segments by automatically grouping similar customers together. Use these segments to seek profitable customers. Clustering Sequence Clustering Forecasting Predict sales and inventory amounts and learn how they are interrelated to foresee bottlenecks and improve performance. Decision Trees Time Series Data Exploration Analyze profitability across customers, or compare customers that prefer different brands of the same product to discover new opportunities. Neural Network Unsupervised Learning Identify previously unknown relationships between various elements of your business to inform your decisions. Neural Network Web Site Analysis Understand how people use your Web site and group similar usage patterns to offer a better experience. Sequence Clustering Campaign Analysis Spend marketing funds more effectively by targeting the customers most likely to respond to a promotion. Decision Trees Naïve Bayes Clustering Information Quality Identify and handle anomalies during data entry or data loading to improve the quality of information. Linear Regression Logistic Regression Text Analysis Analyze feedback to find common themes and trends that concern your customers or employees, informing decisions with unstructured input. Text Mining Data Mining Tasks (4)
  • 54. Data Mining Algorithms (Analysis Services - Data Mining) Choosing an Algorithm by Task To help you select an algorithm for use with a specific task, the following table provides suggestions for the types of tasks for which each algorithm is traditionally used. Examples of tasks Microsoft algorithms to use Predicting a discrete attribute Microsoft Decision Trees Algorithm Flag the customers in a prospective buyers list as good or poor prospects. Microsoft Naive Bayes Algorithm Calculate the probability that a server will fail within the next 6 months. Microsoft Clustering Algorithm Categorize patient outcomes and explore related factors. Microsoft Neural Network Algorithm Predicting a continuous attribute Microsoft Decision Trees Algorithm Forecast next year's sales. Microsoft Time Series Algorithm Predict site visitors given past historical and seasonal trends. Microsoft Linear Regression Algorithm Generate a risk score given demographics. Predicting a sequence Microsoft Sequence Clustering Algorithm Perform clickstream analysis of a company's Web site. Analyze the factors leading to server failure. Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities. Finding groups of common items in transactions Microsoft Association Algorithm Use market basket analysis to determine product placement. Microsoft Decision Trees Algorithm Suggest additional products to a customer for purchase. Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities. Finding groups of similar items Microsoft Clustering Algorithm Create patient risk profiles groups based on attributes such as demographics and behaviors. Microsoft Sequence Clustering Algorithm Analyze users by browsing and buying patterns. Identify servers that have similar usage characteristics.
  • 55. Analytic Algorithm Categories Regression a powerful and commonly used algorithm that evaluates the relationship of one variable, the dependent variable, with one or more other variables, called independent variables. By measuring exactly how large and significant each independent variable has historically been in its relation to the dependent variable, the future value of the dependent variable can be estimated. Regression models are widely used in applications, such as seasonal forecasting, quality assurance and credit risk analysis.
  • 56. Analytic Algorithm Categories Clustering / Segmentation the process of grouping items together to form categories. You might look at a large collection of shopping baskets and discover that they are clustered corresponding to health food buyers, convenience food buyers, luxury food buyers, and so on. Once these characteristics have been grouped together, they can be used to find other customers with similar characteristics. This algorithm is used to create groups for applications, such as customers for marketing campaigns, rate groups for insurance products, and crime statistics groups for law enforcement.
  • 57. Analytic Algorithm Categories Nearest Neighbor quite similar to clustering, but it will only look at others records in the dataset that are “nearest” to a chosen unclassified record based on a “similarity” measure. Records that are “near” to each other tend to have similar predictive values as well. Thus, if you know the prediction value of one of the records, you can predict its nearest neighbor. This algorithm works similar to the way that people think – by detecting closely matching examples. Nearest Neighbor applications are often used in retail and life sciences applications.
  • 58. Analytic Algorithm Categories Association Rules detects related items in a dataset. Association analysis identifies and groups together similar records that would otherwise go unnoticed by a casual observer. This type of analysis is often used for market basket analysis to find popular bundles of products that are related by transaction, such as low-end digital cameras being associated with smaller capacity memory sticks to store the digital images.
  • 59. Analytic Algorithm Categories Decision Tree a tree-shaped graphical predictive algorithm that represents alternative sequential decisions and the possible outcomes for each decision. This algorithm provides alternative actions that are available to the decision maker, the probabilistic events that follow from and affect these actions, and the outcomes that are associated with each possible scenario of actions and consequences. Their applications range from credit card scoring to time series predictions of exchange rates.
  • 60. Analytic Algorithm Categories Sequence Association detects causality and association between time-ordered events, although the associated events may be spread far apart in time and may seem unrelated. Tracking specific time-ordered records and linking these records to a specific outcome allows companies to predict a possible outcome based on a few occurring events. A sequence model can be used to reduce the number of clicks customers have to make when navigating a company’s website.
  • 61. Analytic Algorithm Categories Neural Network a sophisticated pattern detection algorithm that uses machine learning techniques to generate predictions. This technique models itself after the process of cognitive learning and the neurological functions of the brain capable of predicting new observations from other known observations. Neural networks are very powerful, complex, and accurate predictive models that are used in detecting fraudulent behavior, in predicting the movement of stocks and currencies, and in improving the response rates of direct marketing campaigns.
  • 62. Choosing a Big Data Architecture Big Data Platform Big Data Analytical Platform Big Data Analytics Big Data Business Needs by type Big Data Architecture
  • 63. Analytics Data Sources Analytics should be based from a single data source platform. Analytics As A Service Analytics as a Service IBM WebSphere eXtreme Scale
  • 64. Analytics As A Service When you write data to a traditional database, either through loading external data, writing the output of a query, doing UPDATE statements, etc., the database has total control over the storage. The database is the "gatekeeper." An important implication of this control is that the database can enforce the schema as data is written. This is called schema on write. Hive has no such control over the underlying storage. There are many ways to create, modify, and even damage the data that Hive will query. Therefore, Hive can only enforce queries on read. This is called schema on read. So what if the schema doesn’t match the file contents? Hive does the best that it can to read the data. You will get lots of null values if there aren’t enough fields in each record to match the schema. If some fields are numbers and Hive encounters nonnumeric strings, it will return nulls for those fields. Above all else, Hive tries to recover from all errors as best it can.
  • 66. Analytics As A Service Benefits of schema on write: • Better type safety and data cleansing done for the data at rest • Typically more efficient (storage size and computationally) since the data is already parsed Downsides of schema on write: • You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL) • Typically you throw away the original data, which could be bad if you have a bug in your ingest process • It's harder to have different views of the same data Benefits of schema on read: • Flexibility in defining how your data is interpreted at load time • This gives you the ability to evolve your "schema" as time goes on • This allows you to have different versions of your "schema" • This allows the original source data format to change without having to consolidate to one data format • You get to keep your original data • You can load your data before you know what to do with it (so you don't drop it on the ground) • Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data Downsides of schema on read: • Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML) • The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is) • More error prone and your analytics have to account for dirty data   http://nosql.mypopescu.com/post/48638541973/schema-on-writes-vs-schema-on-reads-apache-hadoop-and
  • 67. Reporting users make their own schemas and naming standards Reporting users run their own analytics --- as many times as they want
  • 68. Key Ideas - Summary One Big Data database cannot accommodate all the Big Data types You need to know the data type and data architecture to select the most appropriate Big Data database. There are guidelines to help suggest the Big Data Types that are commonly used by each business type. Hadoop is used as the shared data source platform to merge and standardize big data with legacy data Hadoop is used as the shared analytical platform to merge and standardize analytics Hadoop is an excellent choice to start building your shared data source platform Hadoop can become your System of Record (SOR) for Big Data and part of your Master Data Management system (MDM) Hadoop is used to standardize and centralize the Key Performance Indicators (KPI) and Profit Drivers for an Enterprise Analytical Platform There are guidelines to help suggest the analytics, KPI’s and Profit Drivers for Big Data that are commonly used by each industry. Schema on read
  • 69. Critical Success Factors - Summary Validate the data being collected has business value. The date time format must be standardized across the data platform. Unique identifiers (domain keys) must be clearly described using friendly terminology
  • 70. 1) Pervasive insights produce better business decision opening access to business intelligence by embedding analytics capabilities into everyday software tools pays substantial dividends. By Lauren Gibbons Paul 2) Data Mining Algorithms (Analysis Services - Data Mining) http://msdn.microsoft.com/en-us/library/ms175595.aspx 3) Data Mining Query Task http://msdn.microsoft.com/en-us/library/ms141728.aspx 4) Predictive Analysis with SQL Server 2008 - White Paper - Microsoft - Published: November 2007 5) Predictive Analytics for the Retail Industry - White Paper - Microsoft - Writer: Matt Adams Technical Reviewer: Roni Karassik, Published: May 2008 6) Breakthrough Insights using Microsoft SQL Server 2012 - Analysis Services https://www.microsoftvirtualacademy.com/tracks/breakthrough-insights-using-microsoft-sql-server-2012-a 7) Useful DAX Starter Functions and Expressions http://thomasivarssonmalmo.wordpress.com/category/powerpivot-and-dax/ 8) Stairway to PowerPivot and DAX - Level 1: Getting Started with PowerPivot and DAX By Bill_Pearson, 2011/12/21 9) Data Mining Tool http://technet.microsoft.com/en-us/library/ms174467.aspx 10) DAX Cheat Sheet http://powerpivot-info.com/post/439-dax-cheat-sheet 11) Big Data Landscape - http://arnon.me/2012/11/nosql-landscape-diagrams/ References
  • 71. On the Internet, the World Wide Web Consortium (W3C) uses ISO 8601 in defining a profile of the standard that restricts the supported date and time formats to reduce the chance of error and the complexity of software.[19] RFC 3339 defines a profile of ISO 8601 for use in Internet protocols and standards. It explicitly excludes durations and dates before the common era. The more complex formats such as week numbers and ordinal days are not permitted.[20] RFC 3339 deviates from ISO 8601 in allowing a zero timezone offset to be specified as "-00:00", which ISO 8601 forbids. RFC 3339 intends "-00:00" to carry the connotation that it is not stating a preferred timezone, whereas the conforming "+00:00" or any non-zero offset connotes that the offset being used is preferred. This convention regarding "-00:00" is derived from earlier RFCs, such as RFC 2822 which uses it for timestamps in email headers. RFC 2822 made no claim that any part of its timestamp format conforms to ISO 8601, and so was free to use this convention without conflict. RFC 3339 errs in adopting this convention while also claiming conformance to ISO 8601. http://www.w3.org/TR/NOTE-datetime http://stackoverflow.com/questions/16307563/utc-time-explanation International Standard ISO 8601 specifies numeric representations of date and time. YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00) where: YYYY = four-digit year MM = two-digit month (01=January, etc.) DD = two-digit day of month (01 through 31) hh = two digits of hour (00 through 23) (am/pm NOT allowed) mm = two digits of minute (00 through 59) ss = two digits of second (00 through 59) s = one or more digits representing a decimal fraction of a second TZD = time zone designator (Z or +hh:mm or -hh:mm) Times are expressed in UTC (Coordinated Universal Time), with a special UTC designator ("Z"). Times are expressed in local time, together with a time zone offset in hours and minutes. A time zone offset of "+hh:mm" indicates that the date/time uses a local time zone which is "hh" hours and "mm" minutes ahead of UTC. A time zone offset of "-hh:mm" indicates that the date/time uses a local time zone which is "hh" hours and "mm" minutes behind UTC.