This document discusses tuning Solr for log search and analysis. The author describes optimizing a baseline Solr configuration through techniques like time-based field collections, document value fields, commit settings and hardware choices. Significant performance gains were found, such as a 31x increase over baseline by using time-based collections with a 10 minute window. The author also recommends using specialized log processing tools like Apache Flume to parallelize and distribute indexing load for further throughput improvements.
4. Tuning. Is it worth it?
baseline last run
# of logs 10M 310M
EC2 bill/month 700 450
5. What to optimize for?
capacity: how many logs
the same hardware can keep
while still providing decent
performance
http://www.seasonslogs.co.uk/images/products/SL_001.png
https://openclipart.org/image/300px/svg_to_png/169833/Server_1U.png
6. What's decent performance? “It depends”
Assumptions
indexing: enough to keep up with generated logs*
search concurrency
search latency: 2s for debug queries, 5s for charts
*account for spikes!
8. Baseline test
15GB heap
debug query
status:404 in the last hour
charts query
all time status counters
all time top IPs
user agent word cloud
http://blog.sematext.com/2013/12/19/getting-started-with-logstash/
12. 12000
10000
8000
6000
4000
2000
0
100K 2.5M 4M 6M 9M 10M
debug
charts
EPS
Baseline result
capacity
on average, bottleneck: facets eat CPU
CPU is OK
13. 12000
10000
8000
6000
4000
2000
0
100K 2.5M 4M 6M 9M 10M
indexing limited
because python
scripts eats
feeder CPU
debug
charts
EPS
Baseline result
capacity
bottleneck: facets eat CPU
on average,
CPU is OK
14. Indexing throughput: is it enough?
“it depends”
how long do you keep your logs?
1M logs/day * 10 days <> 0.3M logs/day * 30 days. Both need 10M capacity
1M logs/day * 30 days? Needs 3 servers, each getting 0.3M logs/day
Baseline run: 10M index fills up in <1/2h at 7K EPS
15. Indexing throughput: is it enough?
“it depends”
how long do you keep your logs?
1M logs/day * 10 days <> 0.3M logs/day * 30 days. Both need 10M capacity
1M logs/day * 30 days? Needs 3 servers, each getting 0.3M logs/day
how big are your spikes? (assumption: 10x regular load)
7K EPS is enough for 10M capacity if you keep logs >5h
16. 8000
7000
6000
5000
4000
3000
2000
1000
0
1.5M 3M 5M 8M 11M
charts
EPS
debug
Rare commits
10% above baseline
auto soft commits every 5 seconds
auto hard commits every 30 minutes
RAMBufferSize=200MB; maxBufferedDocs=10M
17. Same results with
even rarer commits (auto-soft every 30s, 500MB buffer)
omitNorms + omitTermFreqAndPositions
larger caches
cache autowarming
THP disabled
mergeFactor 5
mergeFactor 20
but indexing
was cheaper
manually ran
queries, too
18. 8000
7000
6000
5000
4000
3000
2000
1000
0
1.5M 3M 5M 8M 10M 12M
charts
EPS
debug
DocValues on IP and status code
20% above baseline
19. 8000
7000
6000
5000
4000
3000
2000
1000
0
3M 10M 18M 24M 31M 36M
charts
EPS
debug
Detour: what if user agent was string?
3.6x baseline
20. 8000
7000
6000
5000
4000
3000
2000
1000
0
8M 16M 24M 32M 40M 48M 56M 64M 67M 69M 70M 70.5M
charts
EPS
debug
… and if user agent used DocValues?
6.7x baseline
reducing indexing
adds 5% capacity
27. Monthly EC2 cost per 1M logs*
m3.2xlarge: $1.3
r3.2xlarge: $1.33
c3.2xlarge: $1.78
TODO (a.k.a. truth always messes with simplicity):
more/expensive facets => more CPU => c3 looks better
less/cheap facets => not enough instance storage
=> EBS (magnetic/SSD/provisioned IOPS)?
=> storage-optimized i2?
=> old-gen instances with magnetic instance storage?
use different instance types for “hot” and “cold” collections?
*on-demand pricing at 2014-11-07
28. How NOT to build an indexing pipeline
custom script:
reads apache logs from files
parses them using regex
takes 100% CPU and 100% RAM
from a c3.2xlarge instance
maxes out at 7K EPS
29. Enter Apache Flume*
agent.sources = spoolSrc
agent.sources.spoolSrc.type = spooldir
agent.sources.spoolSrc.spoolDir = /var/log
agent.sources.spoolSrc.channels = solrChannel
agent.channels = solrChannel
agent.channels.solrChannel.type = file
agent.sinks.solrSink.channel = solrChannel
put Solr and Morphline
jars in lib/
agent.sinks = solrSink
agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.solrSink.morphlineFile = conf/morphline.conf
agent.sinks.solrSink.morphlineId = 1
*Or Logstash. Or rsyslog. Or syslog-ng. Or any other specialized event processing tool
source
channel
sink
30. morphline.conf (think Unix pipes)
morphlines : [
{ id : 1
commands : [
same ID as in the flume.conf
sink definition
{ readLine { charset : UTF-8 } }
{
grok {
dictionaryFiles : [conf/grok-patterns]
expressions : {
message : """%{COMBINEDAPACHELOG}"""
}
}
}
{ generateUUID { field : id } }
{
loadSolr {
solrLocator : {
collection : collection1
solrUrl : "http://10.233.54.118:8983/solr/"
}
}
}
]
}
]
process one line at a time
(there's also readMultiLine)
https://github.com/cloudera/search/blob/master/samples/solr-nrt/grok-dictionaries/grok-patterns
parses each property
(eg: IP, status code)
Solr can in its own field
do it, too*
use zkHost
for SolrCloud
*http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/
32. 2.4K EPS is typically enough for this
application server
+ Flume agent
application server
+ Flume agent
application server
+ Flume agent
scales nicely with # of servers
but all buffering and processing
is done here
33. but not for this
application server
+ Flume agent
application server
+ Flume agent
application server
+ Flume agent
centralized buffering
and processing
Flume agent
Flume agent
34. or this
application server
+ Flume agent
application server
+ Flume agent
application server
+ Flume agent
buffer, then process (separately)
Flume agent
Flume agent
Flume agent
37. More throughput? Parallelize
Depends* on the bottleneck
source channel sink
more threads
(if applicable)
more sources
*last time I use this word, I promise
multiplexing
channel selector
more threads
(if applicable)
load balancing
sink processor
Source1 C1
Source1
C1
Source2
Source1
C1
C2
C1 Sink1
C1
Sink1
Sink2
39. TODO: log in JSON where you can
Then, in morphline.conf, replace the grok command with the much ligher:
readJson {}
Easy with apache logs, maybe not for other apps:
LogFormat "{
"@timestamp": "%{%Y-%m-%dT%H:%M:%S%z}t",
"message": "%h %l %u %t "%r" %>s %b",
...
"method": "%m",
"referer": "%{Referer}i",
"useragent": "%{User-agent}i"
}" ls_apache_json
CustomLog /var/log/apache2/logstash_test.ls_json ls_apache_json
More details at:
http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/
40. Conclusions
Use time-based collections and DocValues
Rare soft&hard commits are good
Pushing them too far is probably not worth it
Hardware: test and see what works for you
A balanced, SSD-backed machine (like m3) is a good start
Use specialized event processing tools
Apache Flume is a fine example
Processing and buffering on the application server side scales better
Buffer before [heavy] processing
Mind your batch sizes, buffer types and parallelization
Log in JSON where you can
41. Thank you!
Feel free to poke me @radu0gheorghe
Check us out at the booth, sematext.com and @sematext
We're hiring, too!