3. Oxford English Dictionary:
◦ “An all-encompassing term for any collection of data
sets so large and complex that it becomes difficult to
process using on-hand data management tools or
traditional data processing applications”
Defined by volume, variety, velocity
2008 computer scientist predictions:
◦ Big Data will “transform the activities of companies,
scientific researchers, medical practitioners, and our
nation’s defense and intelligence operations”
According to the New York Times:
◦ Big data science “typically means applying the tools of
artificial application of intelligence, like machine
learning, to vast new troves of data beyond that
captured in standard databases”
4. Wider
Longer
Wider and Longer
Complex
subgroupings
within wider or
longer sets
Many correlations
Noisy
Missing data
5. Computational challenges of storage and
statistical program memory
◦ R space on a laptop is limited to 2 GB unless more RAM
is added
◦ Algorithm computing time grows according to scaling
rules, many of which are exponential. Thus, 2 GB takes 4
minutes, and 4 GB then takes 16 minutes…
Statistical challenges from data structure
◦ Wide data violates many statistical assumptions.
◦ Correlations among predictors also violate statistical
assumptions and creates problems with the underlying
linear algebra calculation methods.
◦ Potential for lots of informative missing data that can’t
be imputed using existing statistical methods.
6. More computing resources
◦ Expensive
◦ Cloud computing
◦ Does not solve statistical issues posed by big data
New statistical methods
◦ Rely on a new set of tools from computer science
◦ Work around limitations of existing multivariate
data analysis methods
◦ Don’t always scale as big data grows
Still have computational issues
Need for larger and larger training sets for good
performance
7. Hadoop
◦ Open-source software for storage and processing of big data across
computer cores/clusters
◦ Compatible with existing statistical software
MapReduce
◦ Distributed computing strategy for big data processing and analyses
◦ Compute problem in parallel and combine final answers for shorter
compute times
SQL/NoSQL
◦ Relational database language for:
Database construction/modifications
Pulling pieces of data for further analyses/reporting
R
◦ Free open-source software with existing machine learning algorithms and
coding environment to create and test new machine learning algorithms
Simulations
◦ Use data structure and relationship rules to create a dataset with pre-
specified structure to it
◦ Allows for testing and validation of new algorithms against datasets with
known answers
◦ Useful for comparing existing algorithms with new algorithms
8. Statistics
◦ Hypothesis testing (parametric and nonparametric) and
experimental design
◦ Generalized linear models
◦ Longitudinal, time series, and survival models
◦ Bayesian methods
Mathematics
◦ Multivariable calculus
◦ Linear algebra
◦ Probability theory
◦ Optimization
◦ Graph theory/discrete math
◦ Real analysis/topology
Machine learning
◦ Technically, considered a branch of statistics
◦ Supervised, unsupervised, and semi-supervised models
◦ Serve to extend statistical models and relax assumptions on data
◦ Includes algorithms from topological data analysis and network
analysis
9.
10. A professional who blends several different
areas of expertise to draw insights from
disparate data sources (particularly big data)
such that inference can be made about
specific problems/decisions within the field
of application
Data science is a blend of statistical, machine
learning, computer science, mathematical,
and domain knowledge to leverage data for
decision-making in that domain (business,
medical, social media…).
11. Discuss problem with leadership to understand the
problem and how results might be used.
◦ Providing a predictive algorithm that performs well but doesn’t
provide insight into the problem might not be useful.
◦ There may be related items that leadership hasn’t considered,
items that can enrich the project.
Define data that needs to be pulled.
◦ May exist in database.
◦ May need to find elsewhere.
Pull and clean data.
◦ Examine for errors or bias.
◦ Deal with missing data.
Perform analyses and interpret output.
◦ Can be supervised (fit to outcome) or unsupervised (exploratory).
◦ Typically involves visualization of important results.
Compile summary of actionable insights for leadership.
◦ Simplification
◦ Business value (no point in doing analysis if it can’t be
implemented!)
12. Mathematical/Statistical Background
◦ Graduate degree, typically in mathematics/statistics,
computer science, or engineering
◦ Training in machine learning and algorithm design
◦ Experience with R and SAS statistical languages/programs
Computer Science Background
◦ Python/MATLAB/other high-level computing languages
◦ Hadoop/MapReduce concepts
◦ SQL or NoSQL coding for database extraction/management
◦ Experience with structured or unstructured data
◦ Data mining/algorithm design
Field of Application Expertise
◦ Intellectual curiosity
◦ Understanding of the industry of application (marketing,
medical, finance…)
◦ Communication skills to relate findings to non-technical
leaders
13. From a quick
Indeed.com search:
◦ Allstate Insurance
◦ Sprint
◦ Twitter
◦ APS Healthcare
◦ XOR Security
◦ LinkedIn
◦ IBM
◦ Intel
Indeed.com search
continued:
◦ Roche
Pharmaceuticals
◦ Amazon
◦ Capital One
14. According to NewVantage and others:
◦ 2016 revenue gained from data science is estimated at
$130.1 billion.
◦ This is expected to grow to $203 billion by 2020.
Individual company results vary according to:
◦ Team talent and expertise
◦ Data collected (and quality of data)
◦ Competitor strengths in data science.
Current and projected shortages of those with
analytics talent will impact the market.
◦ Hubs of data science are emerging outside California—
Boston, New York, Austin, Chicago, Jacksonville, Tampa,
Charlotte, Atlanta…
◦ Across industries—healthcare, tech, finance, energy…
Hinweis der Redaktion
http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/
Bryant, R., Katz, R. H., & Lazowska, E. D. (2008). Big-data computing: creating revolutionary breakthroughs in commerce, science and society.
Lohr, S. (2012). How big data became so big. New York Times, 11.
Cuzzocrea, A., Song, I. Y., & Davis, K. C. (2011, October). Analytics over large-scale multidimensional data: the big data revolution!. In Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP (pp. 101-104). ACM.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. McKinsey Quarterly, 4, 24-35.
Heidema, A. G., Boer, J. M., Nagelkerke, N., Mariman, E. C., & Feskens, E. J. (2006). The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC genetics, 7(1), 23.
Draper, N. R., Smith, H., & Pownell, E. (1966). Applied regression analysis (Vol. 3). New York: Wiley.
Gopalkrishnan, V., Steier, D., Lewis, H., & Guszcza, J. (2012, August). Big data, big business: bridging the gap. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 7-11). ACM.
Bekkerman, R., Bilenko, M., & Langford, J. (Eds.). (2011). Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press.
Christopher K. Riesbeck. From conceptual analyzer to Direct Memory Access Parsing: an overview., chapter 8. Ellis Horwood Limited, 1986.
M. W. Berry. Large-scale sparse singular value computations. The International Journal of Supercomputer Applications, 6(1):13–49, Spring, 1992.
Caporaso, J. G., Baumgartner Jr, W. A., Kim, H., Lu, Z., Johnson, H. L., Medvedeva, O., ... & Hunter, L. (2006). Concept Recognition, Information Retrieval, and Machine Learning in Genomics Question-Answering. In TREC.
Madden, S. (2012). From databases to big data. IEEE Internet Computing, 16(3), 4-6.
Agrawal, D., Das, S., & El Abbadi, A. (2011, March). Big data and cloud computing: current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530-533). ACM.