In this webinar, Charles Rich, VP of Product Management at jKool will share their journey with DataStax; how jKool knew from the start that traditional relational databases wouldn’t work for the scalability and availability demands of time-series data, and why they turned to DataStax Enterprise for blazing performance and powerful enterprise search and analytics capabilities.
The Ultimate Guide to Choosing WordPress Pros and Cons
How jKool Analyzes Streaming Data in Real Time with DataStax
1. How jKool Analyzes Streaming Data in Real
Time with DataStax
Charles Rich
VP of Product Management
jKool – jKoolcloud.com
Thank you for joining. We will begin shortly.
2. All attendees
placed on mute
Input questions at any time
using the online interface
Webinar Housekeeping
22. Thank you!
Input questions at any time
using the online interface
More information on jKool at: jKoolCloud.com
Hinweis der Redaktion
Choices we had to make and the architectural decisions to build a system for both real-time and historical…
For Java applications, initially with RESTful for any apps
Open source collectors
Log4J, SLF4J, Logback, JMX, HTTP
Spark
RESTful API…
More coming…
Real-time, in-memory analytics
Operational Intelligence for machine data
Analyze & Visualize: Logs & Metrics & Transactions
Gain insight, root cause, understand application behavior
Reduce MTTR (mean-time-to-problem-resolution)
Leverage NoSQL and Open source
Deliver Operational Intelligence for machine data
Analyze your logs & metrics in real-time (& historical)
Spot patterns, trends, behavior
SaaS or On-Premise
Built ground up on Big data analytics platforms
NoSQL, STORM, Spark, Kafka
Light weight, simple, open source instrumentation
Improved cost/benefit
Keep developers developing and enable App support to analyze app behavior, determine causality and resolve is
Reduce time associated with manually analyzing logs
Improve productivity of your DevOps, Application teams
Keep developers coding…enable app support
Benefits:
- Fix faster
Release sooner
Be proactive
For the Business:
Focus your time on what matters to your business issues
Quickly identify risks and opportunities
Learn what’s important – what you didn’t know…
Exploit hidden & perishable insights
Turn machine data into insight
Detect preventable losses…
if you knew, you could act now…
Know your application and how it is used
Just deployed a new feature? Are people using it? Was it worth the cost?
Real-time means analyzing before data is persisted…
We created FatPipes to manage this around STORM/Spark with message infrastructure Kafka/JMS
Process data but don’t wait till after a write – no disk IO, split, analyze
2 parallel architectures to handle historical and one for real-time
(eventually… both real-time and historical must reconcile)
User interacts with Real-time via JKQL (jKool Query Language)
English like query language for analyzing data in motion and at rest.
“Subscribe” verb for real-time updates
Clustered computing was selected to scale with the demands of the workload
STORM – distribution of CEP (also helpful for distributing data to specific tasks, conditionally)
JMS/Kafka for distributing data amongst nodes in our real-time grids
CEP for processing streams and publishing results to clients via JMS/Kafka
Spark jobs will crunch the data and then write back to Cassandra
Created our own micro-services architecture (FatPipes) which runs on top of:
STORM/JMS/Kafka
STORM – distributes the CEP (also helpful for distributing data to specific tasks, conditionally)
JMS/Kafka for distributing data amongst nodes in our real-time grids
CEP for processing streams and publishing results to clients via JMS/Kafka
FatPipes can be embedded or distributed
Real-time Grid
Feeds tracking data and real-time queries to CEP and back
Customers from our experience didn’t know what they needed to store until they actually need it…but then it is to late…so hence, store everything…
Historical requirements for architecture very different thank real-time to deliver with fast response time and to provide user defined KPIs
Scale must there for interaction as it comes in – not how many TB’s you can analyze but fast can you go and keep up with data streams
Can’t build everything, so to accelerate time to market, how much open-source could we leverage?
For Elasticity we can add nodes horizontally.
Can’t test all possibilities and then select…not agile…not enough time.
Long term analytics needs different than real-time and weed out what would slow down real-time
Providing this as a service, estimating capacity also a challenge.
We are on DSE 4.6.5 and going shortly to 4.7 (today is: 10.13.15 …)
We tried using CQL
CQL (Cassandra Query Language)
Ad-hoc query would be very hard ad CQL query capabilities are very limited. We would need to define all the tables and indexes for every possible query permutation and the user would need to know the event_id. – too much to be usable
Too slow. We only use CQL for admin tasks
Lucene addresses the above problem, but adds its own issues.
We started with Lucene and did inline inserts and the time to index was too long
For each Cassandra insert ,we had to write a Lucene doc…since there is no rewrite, we had to read, delete and then write – a series of batch ops and too slow for our real-time goals
Solr helped with this – we write to Cassandra and Solr handles the indexed (automagically) for us Solr is a Web app on top of Lucene
We do use Solr indexes
jKQL does invoke Solr queries. But we needed to enhance this as we are a multi-tenancy solution and pass it our repository_id to ensure we get the data appropriate to that tenant.
We use 3 nodes in a Cassandra Cluster and data ingested is replicated to Solr clusters with 3 nodes (they have both Cassandra and Solr)
Data-at-rest – we can ingest as fast as Cassandra can handle it using eventual availability. The data is distributed across Cassandra and Solr.
We use DSJava driver. Data is written to coordinator node and he handles the distribution to other nodes.
Quorum means 1/2 + 1. You would say that "we use consistency level "quorum" for queries", which means half +1 of the replicas must respond.
Like if you were taking a vote and in order for the vote to be valid, you need a quorum of members to be present.
Has the same meaning here. If your replication factor is 3, you need 2 of the 3 nodes to respond.
1/2 + 1 using integer division, so half of 3, using integer division is 1 (1.5 truncated) + 1 = 2
We use consistency level #1 for writes
For reads we use quorum (admin tasks)
All other reads use Solr - the jKQL queries you see on dashboard are all coming from Solr.
STORM for ingesting
Spark for processing data (compute framework)
Simple to setup & scale for clustered deployments
Scalable, resilient, fault-tolerant (easy replication)
Ability to have data automatically expire (TTL – necessary for our pricing model)
Configurable replication strategy
Great for heavy write workloads
Write performance was better than Hadoop.
Insert rate was of paramount importance for us – get data in as fast as possible was our goal
Java driver balances the load amongst the nodes in a cluster for us (master-slave would never have worked for us)
Solr provides a way to index all incoming data - essential
DSE provides a nice integration between Cassandra and Solr