disqus-tech-talk



disqus-tech-talk

0 0


disqus-tech-talk

Intern Tech Talk Summer 2013

On Github perryh / disqus-tech-talk

Intern Tech Talk

Perry Huang

@perry_huang

Summer 2013

Summer Overview

  • First week: Got onboard

  • Weeks 2-3: Project strace-explain

  • Week 4: Learned about internals of Disqus, Django, Memcached, Cassandra, Redis, and more

  • Week 5-6: Analyzed Memcached Usage at Disqus

  • Weeks 7-12: Worked on Disqus and other things

First week: Got onboard

Onboarding was difficult, but I survived. As the only intern, I knew I had huge responsibilities.

I hardly knew anything about Disqus, but got a lot of help from the team.

"What's Promoted Discovery, Storm, Sentry, Jones, Gargoyle, and etc.?"

I wanted to fix Disqus. I was learning throughout the summer and made more of an impact each week.

Weeks 2-3: strace-explain

Worked on strace-explain, a Ruby gem that can be used to trace and analyze system calls for any user-level process. It reveals time being spent waiting for networked I/O from different resources. This can be used in debugging the Django dev server.

Source:

http://github.com/disqus/strace-explain/

Install:

$ gem install strace-explain
Disquss-MacBook-Pro:~ $ strace-explain -h
Usage: strace-explain -p [PID]
   or: strace-explain [command]

Options:
    -p, --pid N                      Attach to process PID N.
    -t, --time N                     Run analysis for N seconds.
    -h, --help                       Show this message
    -v, --version                    Show version

Weeks 2-3: strace-explain

http://strace-explain.herokuapp.com/analysis/53558744580160

Week 4 and beyond: Learning

Learned about the internals of various tools that we use at Disqus, as well as Disqus itself:

  • Memcached and alternatives (e.g. mattproxy, mattdb, twitter-memcached, fatcache)
  • Redis, Cassandra, PgSQL
  • Indoor golf
  • Disqus, Django, and of course, Python

Disqus Ops Team

Meet the Disqus Ops Team

  • Harder - Matt
  • Better - John
  • Faster - seems to be me
  • Stronger - Mike

Week 5-6: Analyzed Memcached Usage at Disqus

Created a tool to analyze Memcached traffic from real-time or previously recorded TCP/UDP dumps.

Source:

github link to memcached analysis tools

Results:

https://gist.github.com/perryh/07ef6828a9351f2604eb

At Disqus, most of our cache keys start with a group name, so we can adjust our code to optimize cache usage. Here are results from one group:

Type: :8:NewPostTreePaginator
    145810 gets = 9767 hits + 136043 misses
    153451 sets
    (1.2299291936654724% of total gets)
    (93.30155682051986% cache miss rate)
    (5.8092943561057195% of total sets)
    Min. payload length: 29
        25th percentile: 29
        50th percentile: 2095
        75th percentile: 8518
    Max. payload length: 273910

Lessons Learned Regarding Caching

"Don't be retarded." ~ Ben

Lesson 1

Use the cache properly. Typically, we would first check our cache for our values before we query the database. If we do end up query the database, we would need to set the values into cache for them to be retrieved later. I found a few uses of the cache this summer where we did not actually set the keys we were querying.

Lessons Learned Regarding Caching

Lesson 2

Avoid caching keys that have high cache miss rates or are never called within their lifetime. Large keys with this property are especially bad.

Case study from our Paginator:

Type: :8:NewPostTreePaginator
    145810 gets = 9767 hits + 136043 misses
    153451 sets
    (1.2299291936654724% of total gets)
    (93.30155682051986% cache miss rate)
    (5.8092943561057195% of total sets)
    Min. payload length: 29
        25th percentile: 29
        50th percentile: 2095
        75th percentile: 8518
    Max. payload length: 273910

These keys can cause Memcached to evict other keys that are called more often within their lifetime. They are a waste of memory.

Lessons Learned Regarding Caching

Lesson 3

Use cache.get_many and cache.set_many instead of looping cache.get and cache.set. This will dramatically decrease cache retrieval and setting from anywhere between 200 to 600 ms to a mere 60 ms on average.

Avoid this:

for o in objects:
    result[o.pk] = cache.get(make_key(o.pk))
# Build list of cache miss keys, query them from database
...
for missing in missing_objects:
    cache.set(make_key(missing.pk))

Do this:

cache_keys = [make_key(o.pk) for o in objects]
results = cache.get_many(cache_keys)
# Build list of cache miss keys, query them from database,
# and build a dict of cache miss keys and values
...
cache.set_many(missed_key_values)

Weeks 7-End: Speeding up Disqus

Caching our API responses

Using Varnish and the HTTP cache-control header helped decrease response time. While browsing New Relic, I noticed that some of our API endpoints were responding very slowly. New Relic reported that our 'Community' tab on large sites, such as CNN, would take upwards of 9 seconds to respond. This was confirmed by my own experience after experimenting on CNN, IGN, and The Atlantic.

With help from Matt, I got these API responses to respond in ~100 ms through the use of Varnish. This leads to a great improvement in user experience.

Weeks 7-End: Speeding up Disqus

Removing keys that fill up cache

We used to have a key group that was cached in our cluster of Memcached servers. This proved to be really bad to cache:

Type: :8:html
    1673394 gets = 624777 hits + 1048617 misses
    1292953 sets
    (14.11532907965599% of total gets)
    (62.66408269660343% cache miss rate)
    (48.94816303321555% of total sets)
    Min. payload length: 32
        25th percentile: 70
        50th percentile: 122
        75th percentile: 255
    Max. payload length: 29261

While a 62% miss rate is not the worst in the world, this key group did account for 48% of our cache sets. Removing this from our main Memcached cluster, along with some of my other fixes, allowed us to increase are cache hit ratio from 80% to over 90%. We now have this key group in its own cache server being handled by the LRU eviction policy, where it cannot affect other key groups.

Weeks 7-End: Things that did not work

Rewrite our Paginator in C++

I spent a lot of time learning to use SciPy Weave and Python Boost, but found the idea of rewriting the Paginator in C++ impractical. We have other plans in the pipeline to improve the Paginator.

tl;dr

  • learned a lot this summer
  • made performance upgrades to Disqus
  • made open source contributions

Thanks

  • everyone in operations team for helping me throughout the summer
  • everyone for being awesome
  • Mike for being my mentor and organizing an memorable sailing trip
  • Jason and Daniel for forever branding me with my Disqus tattoo