– Who is @infoxchange? – Our Products 20 Web Applications (Django, Perl & PHP) 300+ hosted CMS websites Application RPM = 4000 - 8000 Budget: Tight



– Who is @infoxchange? – Our Products 20 Web Applications (Django, Perl & PHP) 300+ hosted CMS websites Application RPM = 4000 - 8000 Budget: Tight

0 0


Search---A-Journey-of-Delivery-on-a-Budget


On Github sammcj / Search---A-Journey-of-Delivery-on-a-Budget

http://ixa.io

SAM: This is about our Journey with implementing a modern search product - on a tight budget

Technology for social justice.
Search: A Journey of Delivery on a Budget
Sam McLeod [ Operations ]
Ricky Cook [ Web Development ]
  • RICKY: Non-profit that delivers technology for social justice
  • RICKY: Work to increase technical competance and solve social and community issues with tech
  • We work on the technical side to DELIVER & HOST a number of web applications to aid other organisations

Who is @infoxchange?

Infoxchange is a not-for-profit community organisation that delivers technology for social justice.
We work to strengthen communities and organisations, using information technology as the primary tool to create positive social change.
  • Web Applications
  • IT Efficiency Consulting
  • Community Empowerment
  • SAM: Diverse environment mix of languages and platforms
  • We run at approximately 4000 - 8000 requests per minute across our apps

Our Products

  • 20 Web Applications (Django, Perl & PHP)
  • 300+ hosted CMS websites
  • Application RPM = 4000 - 8000
Budget: Tight
  • RICKY: Brief runover of our offerings
  • eRef: specialist referrals, eg go to GP, referral to hospital radiology
  • eWait: aged care queues
  • Service coord: care plans/how to coordinate care between specialists
  • QIPPS: project pre-planning, what do you want to achieve? target?, slip slop slap
  • Patient management: Make sure that patient information is readily, easily accessible
  • Drupal: Give small community orgs web presence

Web Applications

Electronic Referral
Electronic Waitlist
Service Coordination
Health Project Management
Patient Management
Content Management

Specialising In

Homelessness
Mental Health
Eldercare
Welfare
  • SAM: We've been in digital search since 1991
  • Think about what finding services was like in 1990...

Rich Search History

Infoxchange established the first searchable electronic directory of community services in Australia in 1990
Since the 90's Service Seeker has included:
  • Name search
  • Keyword search
  • Regional maps
  • Vacancy search
  • RICKY: 1990: Archie, the worlds first search engine aka 'query form' was launched
  • Index FTP filenames
  • McGill University, Montreal

1990 - Archie

RICKY: 1991: Service Seeker was launched by Infoxchange

1991 - ServiceSeeker

RICKY: 1995: For perspective was when google first appeared

1995 - Google

  • RICKY: 1996: Service Seeker 2 was released (and it's still running now)
  • Only minor cosmetic changes (mostly looks the same)
  • Powers lifeline call centre

1996 - ServiceSeeker 2

RICKY: 2014: This year we have announced Service Seeker 3, in the form of an API

2014 - ServiceSeeker 3 (API)

  • I'm going to talk about the OPs journey
  • Our Environment
  • Performance
  • Monitoring
  • Lessons Learnt
  • and then I'll hand you over to Ricky - talk through the dev side of things

The Ops Journey

Environmental Constraints

  • Small Ops team, 4 people working across all our products
  • We're a non-profit so we have to be careful with our money
  • Relatively low-end hardware
  • No search / java experience
  • When we started out on the journey we had LOTS of technical debt that needed to be paid down
  • and an elderly network

Environmental Constraints

  • Small Operations Team
  • Slow SAN Performance
  • Limited Hardware Resources
  • No Search experience & limited Java experience
  • LOTS of Technical Debt
  • Elderly Network
  • But we've learnt to leverage our stengths...
  • We run modern platforms
  • Our Dev and Ops teams work together
  • We don't really have much internal bureaucracy getting in the way
  • We're an open source shop
  • We just rolled a sexy new network
  • and most of all, our people

Environmental Strengths

  • Modern Tech: BTRFS, Docker, ZRAM & Kernel 3.14
  • Strong DevOps and Agile Practises
  • No Bureaucracy == Agility
  • Open Source Architecture
  • Newly Deployed Network
  • Hardworking, Passionate People

Highly Diverse Environment

  • I mentioned we have a highly diverse environment
  • I love this graphic...
  • It's what exactly we need to be careful to avoid
  • I didn't want ElasticSearch to appear here
  • It's deployment must be agile and managable
  • 3 node clusters
  • Debian shop
  • 24 Core, 30GB clusters
  • XenServer for virtalisation works very well for us

ES Hosts

  • 3 Node Clusters
  • OS = Debian 7 (Wheezy)
  • CPU = 8x E5-2660 0 @ 2.20GHz (24 Core Cluster)
  • RAM = 10GB (30GB Cluster)
  • Virt = XenServer 6.2
  • Kernel = linux-3.14-amd64
  • Some general Elasticsearch deployment tweaks
  • Make sure your heap size is set to half your RAM
  • As we're not running fancy physical hardware
  • We dont have SSDs etc - we try to get blood from a stone
  • We did some tuning to allow for FASTER SHARD RECOVERY after upgrading / failing a node
  • Don't let your hosts SWAP
  • RICKY: ulimit: better to die than be killed
  • Don't let people delete all your nice things
  • Keep-alives were a pain

ES Performance & Tweaks

Heap size = half the available RAM:
ES_HEAP_SIZE = 5g
Faster shard recovery:
routing.allocation.node_initial_primaries_recoveries = 4,
routing.allocation.node_concurrent_recoveries    = 15
recovery.max_bytes_per_sec                       = 100mb,
recovery.concurrent_stream                       = 5
Attempt to lock the process address space so it won't be swapped:
mlockall                    = true,  
Stop people from deleting all your nice things:
disable_delete_all_indices  = true, 
Ensure kernel keep-alive timeouts < routers timeouts!
  • We had BIG problems with Javas GC, lag up ES and queries would time out ...
  • "Stop-the-world garbage collection and blocking for disk IO can cause runtime latencies on the order of seconds to minutes."
  • I'm certainly no expert with java but after some reasearch our solution is to garbage collect more often, which has a far lesser impact on blocking performance

Taking out the garbage

"Stop-the-world garbage collection and blocking for disk IO can cause runtime latencies on the order of seconds to minutes."
Garbage collecting more often
=
less 'hit' to performance when it occurs
s/CMSInitiatingOccupancyFraction=[0-9]* /CMSInitiatingOccupancyFraction=35/g
/usr/share/elasticsearch/bin/elasticsearch.in.sh

Monitoring

  • At Infoxchange we LOVE monitoring, I think it's something we do quite well
  • I have a few slides on this I'll go quickly but I think it's really important
  • ElasticHQ has been helpful for spotting obvious performance bottlenecks
Royrusso/ElasticHQ
Marvel is GREAT... BUT it is heavy and it's no longer free

Monitoring

Elasticsearch/Marvel
  • Powerful Cluster Insights
  • Easily Visualise Cluster Performance
  • Puts Metrics In Context
  • Creates Very Large Indexes
  • Doesn't Clean Up After Itself
  • No Longer Free
  • If you're already monitoring that everything is alive & well...
  • And you're returning that Information to Nagios...
  • Why not use & graph the returned performance metrics from the checks for trending

Monitoring

Nagios - PNP4Nagios
Aggregates Nagios Performance Information
We monitor the state of our elasticsearch clusters with Nagios and dashboard any state changes

Monitoring

Nagios + check_mk multisite
Cross-Site Elasticsearch Cluster Information
We even XMPP or jabber alerts to our chat rooms for certain events

Monitoring

Nagios + XMPP (Jabber)
Realtime Cluster Notifications
  • Let's talk briefly about logs
  • We pass all our app logs to Logstash
  • Logstash does some magic then stores them in elasticsearch
  • Our devs can see those logs using Kibana (web interface to ES)
  • It's not as good as splunk, but it's pretty decent

ES + Logstash = Splunk4Free

(Almost)
  • Just like Elasticsearch itself we deploy & control logstash using puppet
  • Both Elasticsearch & Logstash have excellent (official) puppet modules
  • Logstash deployment looks a bit like this ...

Logstash

Puppet Managed Logstash Config
  ixalogstash::input { 'syslog-docker':
    input_type     => 'syslog',
    type_tag       => 'docker',
    input_port     => 5550,
  }
  ixalogstash::input { 'syslog-nginx':
    input_type     => 'syslog',
    type_tag       => 'nginx',
    input_port     => 5551,
  }
  ixalogstash::filter { 'syslog-nginx':
    type_tag       => 'nginx',
    filter_match   => '%{COMBINEDAPACHELOG} %{QS:vhost}',
  }
  ixalogstash::output { "${elastic_host}":
    output_type    => 'elastichttp',
  }
https://github.com/elasticsearch/puppet-logstash
The short of it is that we've learnt some lessons...

Lessons Learnt

  • Small Budget = Creative Solutions
  • Puppet is great for managing Elasticsearch & Logstash
  • Bulk indexing can be slow (more from ricky soon...)
  • Test for split-brain (before it happens)
  • Modern OS & kernel helps
  • Java GC is painful

Lessons Learnt

  • Puppet is great for managing Elasticsearch
  • Small budget = creative solutions
  • Bulk indexing can be slow
  • Test for split-brain (before it happens)
  • Modern OS & kernel helps
  • Java GC is painful
  • Kernel keep-alive timeouts < routers timeouts
  • RAM is cheap, buy lots of it

The Dev Journey

It "worked" in dev :/

ISS2

  • Custom indexing (not just full text search!)
  • Stemming (sort of)
  • Keywords/Synonyms
  • Area search (not quite GIS)
  • Gunicorn/nginx, no Apache
  • TastyPie is a RESTful framework
  • ElasticUtils is like an ORM for ES

ISS3

(ElasticUtils)
... and Tasty Pie (which has no logo)
  • GIS
  • Private
  • Query strings
  • Keywords
  • Cascaded data

The Hard Parts

  • GIS searches
  • Private data
  • Query strings
  • Keywords (I KNOW RIGHT?!)
  • Suggestions
  • Cascaded data (slow indexing)
Mashup of ElasticUtils and TastyPie

ElasticPie

... for consuming, not eating
  • Cover this quickly
  • Django routes to TastyPie
  • TastyPie handles RESTful
  • obj_get_list builds query
  • Query string parser returns query chunks
  • Queries ES and we get JSON doc source
  • Dehyrate adds dynamic fields, remove _id etc
  • TastyPie adds meta, returns 200/etc
  • Must match in area perfectly
  • Must fall back to matching by distance
  • Country vs city is tricky, because you have to normalize distance vs search
  • In the country, 100km is acceptable
  • In the city, 10km is acceptable

Geospatial Search

  • inverse document frequency
  • boosts terms that appear less often
  • GIS is not boosted
  • Funcscore query combines queries on a given function
  • Decay functions like gauss, etc
  • Set scoremode to whatever suits you (add or max; we use add)
  • Set boostmode to multiply (default)
  • Can match on things like price, etc

GIS and IDF

Like oil and water

idf(t)=1+log(numDocsdocFreq+1)
Use funcion score! Ignores IDF and query normalization completely
  • key words to pick out
  • or not....
  • multiple suburbs, same name
  • multi word states
  • landmarks
  • multiple different keywords to parse
  • sometimes no keywords

Intelligent Query Strings

  • doctor in richmond
  • doctor near richmond
  • doctor richmond
  • doctor near richmond nsw
  • doctor near richmond new south wales
  • doctor near richmond town hall
  • doctor around richmond town hall that speaks chinese
  • female doctor around richmond qld that speaks chinese
  • Any family in Banks before I insult it?
  • blood banks
  • blood in banks
  • top search is needles exchange, because many local councils static link to this search
  • needles exchange -> needles, TAS
  • the rocks/the rock

Banks, ACT

  • blood banks
  • blood banks in banks
  • needles exchange (thank you Needles, TAS)
  • The Rocks, NSW... also The Rock, NSW

Keywords

How is this even a problem?
new south wales is not 3 tokens some things should match better when in order allied health services primary school alternative therapy

Keywords

  • New South Wales
  • Allied Health Services
  • Primary school
  • Alternative therapy
(Ab)use Synonym filters
  "type": "synonym",
  "synonyms": [
    "new south wales => new___south___wales,nsw",
    "primari school => primari school,primary___school",
    "altern therapi => altern therapi,alternative___therapy"
  ]
  • comp: ONLY a prefix suggester
  • comp: More like a completer for a list of tags
  • term: Only does "fuzzy" spell correction
  • term: Type "doct", more likely to suggest a 4 letter word with 3 letters changed that "doctor"
  • comp/term: Only for single words; "new s" wont come up as new south wales
  • phra: Multiple words/whole phrase (in place replacement)
  • phra: Fuzzy corrections
  • phra: Completion
  • phra: Replaces anywhere in the string; more like a "did you mean"

Suggestions

  • Completion suggester
  • ✘ NOPE
  • Term suggester
  • ✘ NOPE
  • Phrase suggester
  • ✔ YES
  • Just kidding, ✘ NOPE

Fun With "Markov Chains"

You can read it later...
  • celery is like a message queue, for tasks
  • to index, we get all the doc ids that need refresh
  • we split into chunks of 500
  • we send a job for each chunk of 500
  • each chunk may spawn other jobs (org -> site -> service)

Celery: Distributed Tasks

aka how we "fixed"our indexer
(and addressed fault-tolerance)

Thank you

Sam McLeod - @s_mcleod
Ricky Cook - @ThatPandaDev
Infoxchange tech blog: ixa.io
Infoxchange website: infoxchange.net.au

Links & References

Dev Links