perf_goto_chicago



perf_goto_chicago

0 0


perf_goto_chicago

Node.js Performance Optimization A Case Study -- for GOTO Chicago 2016

On Github brycebaril / perf_goto_chicago

Node.js Performance Optimization

❔🔍🔬📐📍👓📐🔁

Bryce B. Baril -- @brycebaril

Just tweeted the slide deck location

🚀 Performance 🚀

Two primary performance concepts:

  • Task Completion (User-experience)
  • Throughput (Scale)

🏗 Task Completion 🏗

"Work is done in a timely manner."

Often includes (and is dominated by) Asynchronous IO in Node.js

task_end - task_start

🏭 Throughput 🏭

"Work isn't blocking other work."

Blocking work is usually Javascript execution or Garbage Collection in Node.js

requests-per-second

🏭 Throughput 🏭

JavaScript execution is single-threaded and on-CPU tasks will be serialized.

IMPORTANT: This can cascade into significant task completion delays.

A,B,C scheduled at the same time, each will take 10ms A takes 10ms B takes 10ms plus A's 10ms so 20ms C takes 10ms plus A and B, so 30ms

🖼 Glitching .gif Images 🖼

Case Study

Collaborative art project that did user-driven image glitching and manipulation. Core manipulation logic is in glitcher.

📝 Workflow 📝

❔ Is it fast enough? 🔍 Identify the nature of the problem. (🏗 vs 🏭) 🔬 Select tools based on the problem. 📐 Measure. 📍 Identify the location of the problem. 👓 Make the slower parts faster. 📐 Measure again. 🔁 Go back to step 0.

🚨 WARNING 🚨

Do NOT skip straight to step 5. Randomly applied V8 compiler tricks are unlikely to have any significant impact.

Saving 20ms on a task that includes a 5 second database query won't provide a significant improvement.

❔ 0. Is it fast enough? ❔

The logs show some images take well over a second with the manipulation replaceBackground.

🔍 1. Identify the nature of the problem 🔍

Based on logs and the code, the bulk of the time is the image manipulation. It's not doing any network calls or other asynchronous work.

Feels like a cop-out... how to do this with tooling?

🙋 What if it was async? 🙋

  • 🛠 Tools
    • Logging
    • APM (NewRelic, Dynatrace, etc.)
  • 💪 Solutions
    • Caching
    • Query optimization
    • etc.

(out of scope for this presentation)

These are usually infrequent and slow enough you can always log. This is essentially what APM vendors do for you automatically. Caching: recurring theme--don't do things if you don't have to

💻 Our test harness 💻

var THRESHOLD = 40
var image = /* read file, call readimage */
function fillBlack(frame) {
  return frame.fill(0)
}
replaceBackground(image.frames, fillBlack, THRESHOLD)
/* call writegif on image, write to file */
replaceBackground calculates the median frame by comparing every pixel in every frame, then for each frame calculates the difference from that background frame

⏳ Before ⏳

⌛ After ⌛

🔬 2. Select tools 🔬

Most tools fall into one of three categories:

  • How slow is it?
  • Where is the slowness at?
  • Why is it slow?

🛠 How slow is it? 🛠

  • /usr/bin/time
  • Benchmark tools (ab, siege, etc.)

🛠 Where is the slowness at? 🛠

  • Kernel tools (perf, dtrace, etc.)
  • V8 Instrumentation (v8-profiler, NSolid)

🛠 Why is it slow? 🛠

If the problem is slow JavaScript execution:

  • Compiler tracing (IRHydra, --prof, --trace-deopt)
  • Code review

📐 3. Measure 📐

With our selected poor-performing image, we'll use /usr/bin/time to measure a baseline:

$ \time node harness.js ~/Downloads/bling2.gif
8.67user 0.06system 0:08.71elapsed 100%CPU (0avgtext+0avgdata 181988maxresident)k
0inputs+752outputs (0major+41833minor)pagefaults 0swaps

🛠 Using perf 🛠

  • The perf tool on Linux is a kernel-level CPU profiling tool
  • Captures full stack, C++ & JS execution
  • Followed these instructions (not up to date!)
  • Or check out the new tool 0x
0x was not an option when this slide deck was created, haven't tried it yet.

🔥 How to read a flamegraph 🔥

  • X axis is % of total time
  • Y axis is stack depth
  • Look for:
    • plateaus
    • fat pyramids

🔥 Perf Flamegraph 🔥

]

🛠 v8-profiler 🛠

  • See v8-profiler README
  • Uses V8-provided instrumentation hooks
  • Profiles JS only
  • View by loading into Chrome Dev Tools

📈 v8-profiler results 📈

(open assets/median.cpuprofile in Chrome DevTools)

We can see garbage collection in V8-Profiler output, can eliminate that concern

🙋 What if it was garbage collection? 🙋

  • 🛠 Tools
    • heap snapshots
    • post-mortem analysis (mdb, lldb, etc.)
  • 💪 Solutions
    • fix errant closures
    • hold less data/parts of objects vs whole objects
    • code/static analysis
    • etc.

(out of scope for this presentation)

📍 4. Identify the location 📍

💢 Focus on avg() 💢

The avg() function averages every frame to make a simulated background frame.

Both perf and v8-profiler indicate we're spending the bulk of the time there.

👓 5. Make the slower parts faster 👓

💩 Reasons for Poor Performance 💩

  • Wrong tool for the job
  • Doing unnecessary things
  • Poor algorithm choice
  • Not cooperating with the runtime/compiler
Node isn't great at everything: ssl termination, use nginx don't create functions in loops, avoid Promises decrease instruction count optimizing compiler makes assumptions, work with it, avoid invalidating assumptions

🆕 COOL NEW THINGS! 🆕

  • 🆕 Transpile to ES7!
  • 🆕 ES6!
  • 🆕 ES5! (wait, what?)

These things can absolutely make your code easier

for you

to work with.

😭 However... 😭

Generally these:

  • Add additional code that will add to execution time
  • Are not yet optimizable by V8

⏪ Welcome to ES3! ⏪

Even nice ES5 features such as Array.forEach Array.map, etc. are slower

😅 Oh right, focus on avg() 😅

Focus your effort on high return-on-investment. Don't sacrifice dev convenience by refactoring everything to speed up fast enough code.

Some things could cause systemic slowdowns, hard to see if every function got slower.

⚙ Optimizing Compilers (Simplified) ⚙

  • JavaScript is extremely flexible.
  • Most code doesn't use that flexibility.
  • Observations -> Assumptions -> Optimize to pure assembly
  • Assumption invalid -> Deoptimize -> Discard assembly
Parse and analyze your code as it is executed making optimized assembly based on what it encounters with guards against cases it skipped. "trust but verify" * If you deoptimize a function too many times it will give up on optimization.

⚙ Optimizing Compilers ⚙

The optimizing compiler doesn't care if your code is GOOD

  • Code that does things the wrong/unoptimal way can be "optimized"
  • Code that does the wrong thing can be "optimized"
Even if it's highly optimized assembly code, the optimizing compiler can't save you from doing stupid things. Unncecessary work always takes time

🚂 V8 Compiler Output 🚂

  • Lots of tips out there on how to optimize your code for V8.
  • Nothing beats V8 telling you what it didn't like.
  • My favorite tool: IRHydra2
    • Follow the instructions on that page
    • Load results into IRHydra2 (it's a web app)

🌡 IRHydra Results 🌡

(Load the two files in assets/irhydra into IRHydra to explore)

🔎 Analysis 🔎

  • the avg function had an eager deoptimization
  • ... but it was the inlined Buffer constructor in node core ...
  • (So I fixed it in Node core...)
  • otherwise nothing too interesting ¯\_(ツ)_/¯

📝 Code Analysis 📝

avg() calls medianPixel()

The algorithm to calculate median requires a sort

We don't see medianPixel in the flamegraph or v8-profiler output because it was inlined.

⚖ Some Math ⚖

Our image is 800 x 450 pixels with 51 frames

800 * 450 = 360000 pixels per frame

So 360_000 sortPixels calls, each sorting 51 pixels. 😨

📊 Median Frame 📊

🔔 Mean 🔔

Calculating mean doesn't require a sort. Maybe it will work?

🔔 Mean Frame 🔔

⌛ After 2 ⌛

📐 6. Measure again 📐

$ \time node harness.js ~/Downloads/bling2.gif
3.40user 0.04system 0:03.43elapsed 100%CPU (0avgtext+0avgdata 182048maxresident)k
0inputs+704outputs (0major+44011minor)pagefaults 0swaps

8.67 seconds to 3.40 seconds!

🔥 Perf Flamegraph (Mean) 🔥

]

📉 v8-profiler results 📉

🔁 7. Go back to step 0 🔁

  • Fix copy: remove call to it?
  • Reduce instruction count
  • Even better algorithms?
  • Optimize inxsearch function?

📠 Let's remove the call to Buffer::Copy 📠

$ \time node harness.js ~/Downloads/bling2.gif
2.21user 0.06system 0:02.28elapsed 99%CPU (0avgtext+0avgdata 183944maxresident)k
10584inputs+704outputs (0major+43860minor)pagefaults 0swaps

Another second saved!

💖 Check out NSolid! 💖

Capture flamegraphs of a production process with the click of a button!

Bryce B. Baril - http://brycebaril.com

Twitter: @brycebaril

NSolid: Try NSolid!