This document provides an introduction to big data, including definitions of big data and its key characteristics of volume, variety, velocity, variability, and veracity. It discusses big data analysis and how it differs from traditional analytics by examining large, diverse datasets. Hadoop is presented as a popular open-source framework for managing and analyzing big data, and its use by companies like Facebook, LinkedIn, Walmart, and Twitter is described. The document also briefly outlines Hadoop's history and architecture, common Hadoop variants, skills needed to work with Hadoop, and examples of big data case studies.
2. What is BIG DATA?
Characteristics of Big Data
What is BIG DATA Analysis?
Traditional vs. Current Analytics Trends
BIG Data using Hadoop!
Hadoop History
Hadoop – High Level Architecture
Hadoop Variants
Hadoop Skills
NOSQL Introduction
Big Data – Case Studies
Topics Covered
Table of Contents
2 | Oh! Session - Introduction to Big Data
3. What is BIG DATA?
Big Data, simply put, is data which is very BIG!
3 | Oh! Session - Introduction to Big Data
Big data is new and “ginormous”
& scary – very, very scary term.
No, wait. It is not.
Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
Examples of Big Data:
SOCIAL MEDIA ACTIVITY – like Facebook, Twitter, LinkedIn, etc.
FINANCIAL TRANSACTIONS – Internet Banking logs, Share Market, etc.
LOCATION TRACKING – Global Positioning System data, etc.
WEB BEHAVIOUR – Internet browsing, Google searches, etc.
4. Characteristics of BIG DATA?
Big data can be described by the following characteristics:
4 | Oh! Session - Introduction to Big Data
Volume
The Quantity of generated & stored data. Size determines big data.
Variety
The Type And Nature of the data.
Velocity
The Speed of data generation.
Variability
Inconsistency of the data set
Veracity
The Quality of captured data can vary greatly, affecting accurate analysis.
5. What is BIG DATA ANALYSIS?
5 | Oh! Session - Introduction to Big Data
Big data analytics is the process of examining large data sets containing a variety of data
types i.e. Big Data – to uncover hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information.
Benefits of Big Data Analytics
The analytical findings done on the Big Data can lead to:
•more effective marketing
•new revenue opportunities
•better customer service
•improved operational efficiency
•competitive advantages over rival
organizations
•& other business benefits.
6. Traditional vs. Current Analytics Trends
6 | Oh! Session - Introduction to Big Data
Data processing and Analytics: The old way
Traditionally, data processing analytics followed
creation of modest amounts of structured data via
enterprise applications (CRM, ERP, etc.)
The modeled & cleansed data loaded into an
enterprise data warehouse.
The extent of complexity of data analyzed was limited
to relational data only, thus TERADATA, EXADATA &
NETEZZA was running the show.
Data processing and Analytics : The New way
Currently, data is growing exponentially and the
variety has grown from text & relational (i.e.
structured) to a mix of structured, semi-structured &
un-structured data.
The analytical tools-set had to change for handling
the un-structured part of data which is why
technologies like Hadoop, SPARK, NOSQL have
become famous and have reduced the cost by
providing open source systems & resilience with
parallel processing.
7. BIG Data using Hadoop!
Why Hadoop?
The most well known technology, which is open source, Java-based framework
helping manage structured and unstructured data is Hadoop
It is Flexible, Scalable, Robust, Cost effective, adaptive to upcoming technologies.
7 | Oh! Session - Introduction to Big Data
Hadoop in Action:
Hadoop is a great framework for advertising companies as well. It keeps a good track of the millions of
clicks on the ads and how the users are responding to the ads posted by the big Ad agencies!
•Facebook – over 1.3 billion active users – storing, managing & keeping track of all profiles along with the
related posts, comments, images, videos, and so on.
•LinkedIn – managing over 1 billion personalized recommendations/week using Map Reduce & HDFS
features!
•Walmart – Helping handle more than 1 million customer transactions/hour
•Twitter – Managing and handling 85 million tweets from users/day
•Google – Managing more than 1 terabyte of data/hour
•eBay – handling and managing 80 terabytes of data/day and suggesting additional suitable products to
their customers
•Spadac.com – helps run spatial intelligence & predictive analytics on huge volumes of data for providing
actionable intelligence to its customers
9. Hadoop – High Level Architecture
9 | Oh! Session - Introduction to Big Data
10. Hadoop Variants
Major variants for Hadoop and their distribution
10 | Oh! Session - Introduction to Big Data
1. Cloudera Hadoop(CDH)
2. HortonWorks
3. MapR
12. Big Data – Case Studies
12 | Oh! Session - Introduction to Big Data
1. 2012 US Presidential Election
• Barack Obama's Big Data won the
US election
2. Data Storage
• NetApp
3. Human Sciences
• NextBio
13. Data in this model is stored inside documents.
Documents are not typically forced to have a schema and
therefore are flexible and easy to change.
No Joins required
MONGODB
What is MONGODB?
13 | Oh! Session - Introduction to Big Data
15. MONGODB
Replicatation Possible
Horizontal scalable
Master Slave concept
We can use Commodity Hardware
MONGODB
Similarities with HADOOP
15 | Oh! Session - Introduction to Big Data
HADOOP
Replication Possible
Horizontal scalable
Master Slave concept
We can use Commodity Hardware
16. MONGODB
Data stores in a Database
Data serialize
Data can be writable any time
MONGODB
Differences with HADOOP
16 | Oh! Session - Introduction to Big Data
HADOOP
Data stores in a File system
Data parallelism
One time Writable
17. Thank You
Feel Free to drop your queries to:
Benoy Daniel Benoy.daniel@axa-tech.com
Bibhusisa Pattanaik Bibhusisa.Pattanaik@axa-tech.com
Editor's Notes
1. Flexible:
As it is a known fact that only 20% of data in organizations is structured, and the rest is all unstructured, it is very crucial to manage unstructured data which goes unattended. Hadoop manages different types of Big Data, whether structured or unstructured, encoded or formatted, or any other type of data and makes it useful for decision making process. Moreover, Hadoop is simple, relevant and schema-less! Though Hadoop generally supports Java Programming, any programming language can be used in Hadoop with the help of the MapReduce technique. Though Hadoop works best on Windows and Linux, it can also work on other operating systems like BSD and OS X.
2. Scalable
Hadoop is a scalable platform, in the sense that new nodes can be easily added in the system as and when required without altering the data formats, how data is loaded, how programs are written, or even without modifying the existing applications. Hadoop is an open source platform and runs on industry-standard hardware. Moreover, Hadoop is also fault tolerant – this means, even if a node gets lost or goes out of service, the system automatically reallocates work to another location of the data and continues processing as if nothing had happened!
3. Robust Ecosystem:
Hadoop has a very robust and a rich ecosystem that is well suited to meet the analytical needs of developers, web start-ups and other organizations. Hadoop Ecosystem consists of various related projects such as MapReduce, Hive, HBase, Zookeeper, HCatalog, Apache Pig, which make Hadoop very competent to deliver a broad spectrum of services.
4. Hadoop is getting more “Real-Time”!
Did you ever wonder how to stream information into a cluster and analyze it in real time? Hadoop has the answer for it. Yes, Hadoop’s competencies are getting more and more real-time. Hadoop also provides a standard approach to a wide set of APIs for big data analytics comprising MapReduce, query languages and database access, and so on.
6. Cost Effective:
Loaded with such great features, the icing on the cake is that Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data. The basic idea behind Hadoop is to perform cost-effective data analysis present across world wide web!
7. Upcoming Technologies using Hadoop:
With reinforcing its capabilities, Hadoop is leading to phenomenal technical advancements. For instance, HBase will soon become a vital Platform for Blob Stores (Binary Large Objects) and for Lightweight OLTP (Online Transaction Processing). Hadoop has also begun serving as a strong foundation for new-school graph and NoSQL databases, and better versions of relational databases.