Nowadays in-stream data analysis takes an important role in modern research and development. Cerrera is a cloud-based data stream processing platform that lets analysts concentrate on solving their research problems but not the administrator’s or developer’s ones. Cerrera provides functionality to build stream processing workflow with pre-defined and user modules via Web interface, to run computations over stream and to export data or to get real time visualization. Cerrera is designed to be integrated with Microsoft Azure services (HDInsight, SQL, ML) and relies on modern cloud technologies.
1. Cerrera: in-stream data analytics
cloud platform
Dmitry Kalashnikov,
Artem Bartashev,
Anastasia Mitropolskaya,
Edgar Klimov,
Natalia Gusarova
ITMO University, St.Petersburg, Russia
CERRERA
2. Big Data Analytics
• Big Data Analytics examines large data sets
containing a variety of data types
• Big Data Analytics has two approaches:
– Batch processing (Hadoop) – store and process later
– Stream processing (Storm) – process on-the-fly
• Stream processing cases: IoT, Sensors, Social
Networks, Logs, Stocks, Personal devices.
CERRERA
3. Big Data Processing Systems
We consider 3 modern approaches for batch
and stream processing:
– Cloud Solutions for Machine Learning
– Stream Processing Engines
– Cloud Solutions for Stream Data Processing
CERRERA
4. Cloud Solutions for
Machine Learning
Allows create and run sophisticated machine learning models in a
cloud
Strengths:
• Software as a Service
• Lots of built-in components
• Data visualization
• Visual programming
Weakness:
• Stream processing is not
supported
Azure Machine Learning
Examples
CERRERA
5. Stream Processing Engines
Stream Processing Engines are distributed real-time computation
systems for processing fast, large streams of data.
Examples:
Strengths:
• Stream processing support
• Fine control
Weaknesses:
• Infrastructure Management issues
• Writing lots of code
• Lack of real-time visualization
CERRERA
6. Cloud Solutions for
Stream Data Processing
Provides API and libraries to develop enterprise software for
stream processing and executes them in the cloud.
Strengths:
• Stream processing support
• Software as a Service
• Real-time scaling
Weaknesses:
• Writing code
• Lack of embedded real-time
visualization
Examples:
CERRERA
7. Cerrera: Overview
Cerrera provides a possibility to describe data stream processing
workflow, to run the processing and to get visualized results by
interacting only with the web-browser.
Features:
• Stream Processing Support
• Software as a Service
• Visualization
• Visual Programming
• Built-in components
CERRERA
8. Comparison between
Cerrera and others
Projects Features
Stream
support
Visualization SaaS
Built-in
components
Visual
Programming
Cerrera Yes Yes Yes Yes Yes
Cloud Solutions
for Machine
Learning
No Yes Yes Yes Yes
Stream Processing
Engines
Yes No No Depends No
Cloud Solutions
for Stream Data
Processing
Yes No Yes Depends No
CERRERA
9. Cerrera: User Interface
• The data processing workflow is a directed acyclic graph.
• Nodes are processing unit.
• Edges describe data flow between nodes.
CERRERA
10. Use Case: Emotion and Finances
Compares sentiments about a company in Twitter with its stock
rates in real-time.
Twitter
Streamer
Yahoo
Streamer
Text
Splitter
Entity
Splitter
Entity
Splitter Text Filter
Sentiment
Analysis
MongoDB
CERRERA
11. Cerrera: Architecture
• Coordination System manages entire
infrastructure and work of all
subsystems.
• Code Management System translates
visual representation of the processing
workflow into Java code and builds it to
create executable artifact.
• Processing System runs workflow over
data stream.
• NoSQL DB stores processing workflow
and result data
• SQL DB stores secure user information
CERRERA
12. Future plans
• Early private access
• Teams
• Project sharing
• Marketplace
To stay up-to-date subscribe on:
cerrera.org
or follow
@cerrera_project
CERRERA
Hello,
I’m Kalashnikov Dmitry, ITMO University. Today I’d like to tell you about Cerrera, cloud platform for in-stream data analytics.
In the modern world, big data analysis plays a key role in scientists’ life. Ability to process huge amount of available information opens a way to previously unimaginable researches.
There are two main approaches for Big Data Analytics: batch processing and stream processing. Big Data community recognized the usefulness and power of batch-oriented data computation quite a long time ago. Lots of different systems like Hadoop, YARN or Pig were created to simplify Big Data analytics. However in the last couple of years, in-stream processing is becoming more and more important player in the big data arena. Stream Analytics is tremendously different from the standard batch methods. We consider stream as unbound continuous flow of heterogeneous data which should be processed on time. It’s not possible to stop stream or collect data to process them later since often computation should be done on time. In what cases such requirements might arise?
At first, getting data using sensors are one of the most noticeable use case. Wide spread of ubiquitous computation and Internet of Things have been driving researches in the field of stream processing for last couple of years. Next, Social networks are one of the main producer of data in the Internet. Data analysts found that social networks trends are extremely volatile therefore without real-time computation it’s possible to become outdated fast. Software logs are another important case since it’s crucial to detect failure or problems as soon as possible. Stocks obviously require immediate response to changes.
Nowadays there are lots of different systems and platforms for big data analytics. We consider 3 specific modern approaches that are in some sense similar to the Cerrera platform: Cloud Solutions for Machine Learning, Stream Processing Engines, Cloud Solutions for Stream Data Processing.
The first group is cloud solutions for machine learning. Such systems greatly simplify data analysis due to the couple of reasons. At first, all infrastructure issues are cared away because of Software as a Service nature. It means that a researcher does not have to think about how to run computation and scale it, where to displace processing systems and store results. Also these solutions provide a great deal of pre-defined machine learning and data analysis components to accelerate model development. Another points are data visualization and visual programming. The latter is quite important because it frees researcher from knowing about specific and disturbing questions about underlying computation systems. Also it allows to concentrate on model development and not on programming.
However the huge disadvantage of these systems is weak or lack of stream computation support. Thereby these systems cannot be easily used for on time processing.
Stream Processing Engines take a key position in Big Data Stream Analytics since they are bases for any modern development or research that are connected with in-stream computations. These technologies are fault-tolerant, scalable and allow writing code in several programming languages like Python, Java or Scala.
Using these Engines, a researcher has fine control over computation, systems displacement and configuration. On the other hand, it means that all administration issues will disturb the scientist from data analysis and model building. Moreover, it’s necessary to know about tricky programming questions about how to run computation properly and without bottlenecks that are specific for the engine. The last but not least is lack of visualization. Of course, it’s an obvious point but looking for additional systems for visualization usually only increases number of issues.
The third group is cloud solutions for stream data processing. Such platforms bring power of stream processing engines into the cloud and therefore wipe some weaknesses of the previous group.
There is no more tough questions about infrastructure. Moreover, most of solutions provide real-time autoscaling and computation redistribution to achieve required performance. This omit administration questions but there are still issues around writing code and visualization. Now researches should investigate particular programming questions of these platforms and in some cases even learn new specific languages. Lack of integrated visualization is also a common issue. To deal with it companies advice to use their other products or third-party solutions.
To address previously mentioned weaknesses, we designed Cerrera. Cerrera is a cloud-based data stream processing platform that lets researcher concentrate on solving their scientific problems but not the administrator’s and developer’s ones. Cerrera provides a possibility to describe data stream processing workflow, to run the processing and to get visualized results by interacting only with the web-browser.
To support the features Cerrera has several crucial points. At first, it incorporates stream processing engine with support of real-time data processing. Second, the infrastructure is cloud-based, distributed and fault-tolerant. Next, Cerrera has web interface to control over workflow execution. It also supports real-time visualization and allows describing of the workflow using visual programming technique. Another point is built-in components set of machine learning algorithms and statistical methods. This set can also be expanded by the user with own components.
On the slide you can see a spreadsheet where all previously mentioned systems are compared by specific features: (1) stream processing support, (2) visualization, (3) SaaS, (4) built-in components and (5) visual programming. By “yes” we mean full support of the feautre. “Depends” mean semi-support. For example, there are machine learning libraries for some Stream processing Engines, for example, Spark MLLib. Some Cloud Solutions for stream data processing also provide a few components like aggregation window. However we would not like to say that this is full support of the feature.
On the slide you can see Cerrera’s prototype of web-based user interface. Its functionality is primarily focused on building processing workflow. Besides that, users can manage workflow lifecycle, visualize results of the processing and export data.
(1) The main space is taken by workflow construction area. Users can simply drag and drop process units to this area and connect them. (2) Components are displaced at the bottom control panel. (3) Connections between two nodes are immediately checked for consistency, for example, types of inputs and outputs of nodes are verified to be the same. It’s done to decrease a number of mistakes as soon as possible. (4) Process units can be represent different elements: statistic aggregators, xml parsers, NLP modules and so on. (5) It also includes special modules, Streamers, that take data from external sources and bring them for further processing. For example, Twitter Streamer calls Twitter API to get new tweets. Some processing units have specific parameters which tunes units execution. For example, size of sliding window, regular expression or mathematical statement. After user set required parameters computation can be run. We will see a particular example of workflow a bit later.
(6) Life cycle is controlled using top buttons. When computation is executed, user may open visualization window (add popup to the slide with plots) using buttons on the top.
Now we will see a particular use case that can be done using Cerrera. Many scientists and analysts investigated connection between stock price changes and people’s opinions about companies on the market. On the slide you can see a workflow which compare people sentiments level about the company and changes of the company stock rates.
(1) Tweets are retrieved from Twitter API using our predefined Twitter Streamer and (2) stock information is obtained using Yahoo Streamer which takes bids and asks in real-time from Yahoo Finances. (3) After that information from XML or JSON is extracted using Entity Splitter. (4, 5) Next couple of components prepare text for sentiment analysis (6) which is a job of the right top component. (7) All result data are saved into NoSQL Database for further visualization or export.
Now lets move to the more technical side of the project. Internal Cerrera architecture is depicted on the slide.
(1) We’ve already discussed the web-site so we will start with the heart of Cerrera – Coordination system. Coordination System manages entire Cerrera infrastructure and work of all subsystems. Besides that, it encompasses our own balancer to distribute load among homogeneous components.
(2) Together, Code Management System, Code Repository and Maven Artifact Repository orchestrate code building processes. Code Management System translates JSON representation of the processing workflow into Java code and builds it in order to create executable Maven artifact. Code Management Repository keeps code of workflow and processing units. We use GitLab for this purpose. Maven Artifacts Repository in its turn stores built packages of processing units and workflows for further access and reuse. In our case, Artifactory is used as the repository.
(3) If the coordination systems is a heart, then the Processing System is a brain of the Cerrera. Processing System runs processing workflow over data stream. In general, it retrieves data from external sources (such as Twitter API or sensors), makes all specified computations and saves result data into NoSQL database.
(4) NoSQL databases stores results of the processing and workflow description. We use MongoDB for this purpose since workflow and results are represented in JSON format which is natural for MongoDB. It also provides a great deal of performance in our case because none of our document has references to any other one.
(5) And the SQL database keeps user information, transactions and other types of strongly structured data.
We are going to continue Cerrera development. Our current main goal is opening early access program for research who are interested in using Cerrera. We will provide test account for anyone who want to use Cerrera and ready to share user experience. The second our goal is implementation teams and project sharing. We also would like to create a marketplace for processing units where users can exchange, rate and discuss different units. If you would like to participate in early access program, you may subscribe on our newsletter and we include you in it.