Search---A-Journey-of-Delivery-on-a-Budget

http://ixa.io

SAM: This is about our Journey with implementing a modern search product - on a tight budget

Technology for social justice.

Search: A Journey of Delivery on a Budget

Sam McLeod [ Operations ]

Ricky Cook [ Web Development ]

ixa.io | infoxchange.net.au

RICKY: Non-profit that delivers technology for social justice
RICKY: Work to increase technical competance and solve social and community issues with tech
We work on the technical side to DELIVER & HOST a number of web applications to aid other organisations

Who is @infoxchange?

Infoxchange is a not-for-profit community organisation that delivers technology for social justice.

We work to strengthen communities and organisations, using information technology as the primary tool to create positive social change.

Web Applications
IT Efficiency Consulting
Community Empowerment

SAM: Diverse environment mix of languages and platforms
We run at approximately 4000 - 8000 requests per minute across our apps

Our Products

20 Web Applications (Django, Perl & PHP)
300+ hosted CMS websites
Application RPM = 4000 - 8000

Budget: Tight

RICKY: Brief runover of our offerings
eRef: specialist referrals, eg go to GP, referral to hospital radiology
eWait: aged care queues
Service coord: care plans/how to coordinate care between specialists
QIPPS: project pre-planning, what do you want to achieve? target?, slip slop slap
Patient management: Make sure that patient information is readily, easily accessible
Drupal: Give small community orgs web presence

Web Applications

Electronic Referral

Electronic Waitlist

Service Coordination

Health Project Management

Patient Management

Content Management

Specialising In

Homelessness

Mental Health

Eldercare

Welfare

SAM: We've been in digital search since 1991
Think about what finding services was like in 1990...

Rich Search History

Infoxchange established the first searchable electronic directory of community services in Australia in 1990

Since the 90's Service Seeker has included:

Name search
Keyword search
Regional maps
Vacancy search

RICKY: 1990: Archie, the worlds first search engine aka 'query form' was launched
Index FTP filenames
McGill University, Montreal

1990 - Archie

RICKY: 1991: Service Seeker was launched by Infoxchange

1991 - ServiceSeeker

RICKY: 1995: For perspective was when google first appeared

1995 - Google

RICKY: 1996: Service Seeker 2 was released (and it's still running now)
Only minor cosmetic changes (mostly looks the same)
Powers lifeline call centre

1996 - ServiceSeeker 2

RICKY: 2014: This year we have announced Service Seeker 3, in the form of an API

2014 - ServiceSeeker 3 (API)

I'm going to talk about the OPs journey
Our Environment
Performance
Monitoring
Lessons Learnt
and then I'll hand you over to Ricky - talk through the dev side of things

The Ops Journey

Environmental Constraints

Small Ops team, 4 people working across all our products
We're a non-profit so we have to be careful with our money
Relatively low-end hardware
No search / java experience
When we started out on the journey we had LOTS of technical debt that needed to be paid down
and an elderly network

Environmental Constraints

Small Operations Team
Slow SAN Performance
Limited Hardware Resources
No Search experience & limited Java experience
LOTS of Technical Debt
Elderly Network

But we've learnt to leverage our stengths...
We run modern platforms
Our Dev and Ops teams work together
We don't really have much internal bureaucracy getting in the way
We're an open source shop
We just rolled a sexy new network
and most of all, our people

Environmental Strengths

Modern Tech: BTRFS, Docker, ZRAM & Kernel 3.14
Strong DevOps and Agile Practises
No Bureaucracy == Agility
Open Source Architecture
Newly Deployed Network
Hardworking, Passionate People

Highly Diverse Environment

I mentioned we have a highly diverse environment
I love this graphic...
It's what exactly we need to be careful to avoid
I didn't want ElasticSearch to appear here
It's deployment must be agile and managable

3 node clusters
Debian shop
24 Core, 30GB clusters
XenServer for virtalisation works very well for us

ES Hosts

3 Node Clusters
OS = Debian 7 (Wheezy)
CPU = 8x E5-2660 0 @ 2.20GHz (24 Core Cluster)
RAM = 10GB (30GB Cluster)
Virt = XenServer 6.2
Kernel = linux-3.14-amd64

Some general Elasticsearch deployment tweaks
Make sure your heap size is set to half your RAM
As we're not running fancy physical hardware
We dont have SSDs etc - we try to get blood from a stone
We did some tuning to allow for FASTER SHARD RECOVERY after upgrading / failing a node
Don't let your hosts SWAP
RICKY: ulimit: better to die than be killed
Don't let people delete all your nice things
Keep-alives were a pain

ES Performance & Tweaks

Heap size = half the available RAM:

ES_HEAP_SIZE = 5g

Faster shard recovery:

routing.allocation.node_initial_primaries_recoveries = 4,
routing.allocation.node_concurrent_recoveries    = 15
recovery.max_bytes_per_sec                       = 100mb,
recovery.concurrent_stream                       = 5

Attempt to lock the process address space so it won't be swapped:

mlockall                    = true,

Stop people from deleting all your nice things:

disable_delete_all_indices  = true,

Ensure kernel keep-alive timeouts < routers timeouts!

We had BIG problems with Javas GC, lag up ES and queries would time out ...
"Stop-the-world garbage collection and blocking for disk IO can cause runtime latencies on the order of seconds to minutes."
I'm certainly no expert with java but after some reasearch our solution is to garbage collect more often, which has a far lesser impact on blocking performance

Taking out the garbage

"Stop-the-world garbage collection and blocking for disk IO can cause runtime latencies on the order of seconds to minutes."

Garbage collecting more often

less 'hit' to performance when it occurs

s/CMSInitiatingOccupancyFraction=[0-9]* /CMSInitiatingOccupancyFraction=35/g

/usr/share/elasticsearch/bin/elasticsearch.in.sh

http://aphyr.com/posts/288-the-network-is-reliable

Monitoring

At Infoxchange we LOVE monitoring, I think it's something we do quite well
I have a few slides on this I'll go quickly but I think it's really important
ElasticHQ has been helpful for spotting obvious performance bottlenecks

Royrusso/ElasticHQ

Marvel is GREAT... BUT it is heavy and it's no longer free

Monitoring

Elasticsearch/Marvel

Powerful Cluster Insights
Easily Visualise Cluster Performance
Puts Metrics In Context
Creates Very Large Indexes
Doesn't Clean Up After Itself
No Longer Free

If you're already monitoring that everything is alive & well...
And you're returning that Information to Nagios...
Why not use & graph the returned performance metrics from the checks for trending

Monitoring

Nagios - PNP4Nagios

Aggregates Nagios Performance Information

We monitor the state of our elasticsearch clusters with Nagios and dashboard any state changes

Monitoring

Nagios + check_mk multisite

Cross-Site Elasticsearch Cluster Information

We even XMPP or jabber alerts to our chat rooms for certain events

Monitoring

Nagios + XMPP (Jabber)

Realtime Cluster Notifications

Let's talk briefly about logs
We pass all our app logs to Logstash
Logstash does some magic then stores them in elasticsearch
Our devs can see those logs using Kibana (web interface to ES)
It's not as good as splunk, but it's pretty decent

ES + Logstash = Splunk4Free

(Almost)

Just like Elasticsearch itself we deploy & control logstash using puppet
Both Elasticsearch & Logstash have excellent (official) puppet modules
Logstash deployment looks a bit like this ...

Logstash

Puppet Managed Logstash Config

  ixalogstash::input { 'syslog-docker':
    input_type     => 'syslog',
    type_tag       => 'docker',
    input_port     => 5550,
  }
  ixalogstash::input { 'syslog-nginx':
    input_type     => 'syslog',
    type_tag       => 'nginx',
    input_port     => 5551,
  }

  ixalogstash::filter { 'syslog-nginx':
    type_tag       => 'nginx',
    filter_match   => '%{COMBINEDAPACHELOG} %{QS:vhost}',
  }

  ixalogstash::output { "${elastic_host}":
    output_type    => 'elastichttp',
  }

https://github.com/elasticsearch/puppet-logstash

The short of it is that we've learnt some lessons...

Lessons Learnt

Small Budget = Creative Solutions
Puppet is great for managing Elasticsearch & Logstash
Bulk indexing can be slow (more from ricky soon...)
Test for split-brain (before it happens)
Modern OS & kernel helps
Java GC is painful

Lessons Learnt

Puppet is great for managing Elasticsearch
Small budget = creative solutions
Bulk indexing can be slow
Test for split-brain (before it happens)
Modern OS & kernel helps
Java GC is painful
Kernel keep-alive timeouts < routers timeouts
RAM is cheap, buy lots of it

Links & References

The Dev Journey

It "worked" in dev :/

ISS2

Custom indexing (not just full text search!)
Stemming (sort of)
Keywords/Synonyms
Area search (not quite GIS)

Gunicorn/nginx, no Apache
TastyPie is a RESTful framework
ElasticUtils is like an ORM for ES

ISS3

(ElasticUtils)

... and Tasty Pie (which has no logo)

GIS
Private
Query strings
Keywords
Cascaded data

The Hard Parts

GIS searches
Private data
Query strings
Keywords (I KNOW RIGHT?!)
Suggestions
Cascaded data (slow indexing)

Mashup of ElasticUtils and TastyPie

ElasticPie

... for consuming, not eating

Cover this quickly
Django routes to TastyPie
TastyPie handles RESTful
obj_get_list builds query
Query string parser returns query chunks
Queries ES and we get JSON doc source
Dehyrate adds dynamic fields, remove _id etc
TastyPie adds meta, returns 200/etc

Must match in area perfectly
Must fall back to matching by distance
Country vs city is tricky, because you have to normalize distance vs search
In the country, 100km is acceptable
In the city, 10km is acceptable

Geospatial Search

inverse document frequency
boosts terms that appear less often
GIS is not boosted
Funcscore query combines queries on a given function
Decay functions like gauss, etc
Set scoremode to whatever suits you (add or max; we use add)
Set boostmode to multiply (default)
Can match on things like price, etc

GIS and IDF

Like oil and water

idf(t)=1+log(numDocsdocFreq+1)

Use funcion score! Ignores IDF and query normalization completely

key words to pick out
or not....
multiple suburbs, same name
multi word states
landmarks
multiple different keywords to parse
sometimes no keywords

Intelligent Query Strings

doctor in richmond
doctor near richmond
doctor richmond
doctor near richmond nsw
doctor near richmond new south wales
doctor near richmond town hall
doctor around richmond town hall that speaks chinese
female doctor around richmond qld that speaks chinese

Any family in Banks before I insult it?
blood banks
blood in banks
top search is needles exchange, because many local councils static link to this search
needles exchange -> needles, TAS
the rocks/the rock

Banks, ACT

blood banks
blood banks in banks
needles exchange (thank you Needles, TAS)
The Rocks, NSW... also The Rock, NSW

Keywords

How is this even a problem?

new south wales is not 3 tokens some things should match better when in order allied health services primary school alternative therapy

Keywords

New South Wales
Allied Health Services
Primary school
Alternative therapy

(Ab)use Synonym filters

  "type": "synonym",
  "synonyms": [
    "new south wales => new___south___wales,nsw",
    "primari school => primari school,primary___school",
    "altern therapi => altern therapi,alternative___therapy"
  ]

comp: ONLY a prefix suggester
comp: More like a completer for a list of tags
term: Only does "fuzzy" spell correction
term: Type "doct", more likely to suggest a 4 letter word with 3 letters changed that "doctor"
comp/term: Only for single words; "new s" wont come up as new south wales
phra: Multiple words/whole phrase (in place replacement)
phra: Fuzzy corrections
phra: Completion
phra: Replaces anywhere in the string; more like a "did you mean"

Suggestions

Completion suggester
✘ NOPE
Term suggester
✘ NOPE
Phrase suggester
✔ YES
Just kidding, ✘ NOPE

Fun With "Markov Chains"

You can read it later...

celery is like a message queue, for tasks
to index, we get all the doc ids that need refresh
we split into chunks of 500
we send a job for each chunk of 500
each chunk may spawn other jobs (org -> site -> service)

Celery: Distributed Tasks

aka how we "fixed"our indexer

(and addressed fault-tolerance)

Thank you

Sam McLeod - @s_mcleod

Ricky Cook - @ThatPandaDev

Infoxchange tech blog: ixa.io

Infoxchange website: infoxchange.net.au

Links & References

Dev Links

something

– Who is @infoxchange? – Our Products 20 Web Applications (Django, Perl & PHP) 300+ hosted CMS websites Application RPM = 4000 - 8000 Budget: Tight

sammcj

– Who is @infoxchange? – Our Products 20 Web Applications (Django, Perl & PHP) 300+ hosted CMS websites Application RPM = 4000 - 8000 Budget: Tight

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();