perf_node_interactive



perf_node_interactive

0 9


perf_node_interactive

Node.js Performance Optimization A Case Study -- for Node Interactive 2015

On Github brycebaril / perf_node_interactive

🚀 Performance 🚀

Two primary performance targets:

  • Task Completion (User-experience)
  • Throughput (Scale)

🏗 Task Completion 🏗

"Work is done in a timely manner."

Often includes (and dominated by) to Asynchronous IO in Node.js

task_end - task_start

🏭 Throughput 🏭

"Work isn't blocking other work."

Blocking is usually Javascript execution or Garbage Collection in Node.js

requests-per-second

🏭 Throughput 🏭

IMPORTANT: Blocking operations can cause cascading blockage resulting in significant task completion delays.

A,B,C scheduled at the same time, each will take 10ms A takes 10ms B takes 10ms plus A's 10ms so 20ms C takes 10ms plus A and B, so 30ms

🖼 Glitching .gif Images 🖼

Case Study

  • readimage reads gif/png/jpg and converts to a common format
  • glitcher various image glitches and manpiulations
  • writegif write that common format to an animated gif

📝 Workflow 📝

❔ Is it fast enough? 🔍 Identify the nature of the problem. (🏗 vs 🏭) 🔬 Select tools based on the problem. 📐 Measure. 📍 Identify the location of the problem. 👓 Make the slower parts faster. 📐 Measure again. 🔁 Go back to step 0.

🚨 WARNING 🚨

Do NOT skip straight to step 5. Randomly applied V8 compiler tricks are unlikely to have a significant impact.

Saving 20ms on a task that includes a 5 second database query won't provide a significant improvement.

❔ 0. Is it fast enough? ❔

The logs show some images take well over a second to manipulate with the manipulation replaceBackground, and this is a blocking operation.

🔍 1. Identify the nature of the problem 🔍

Based on logs and the code, I have identified the bulk of the time is the image manipulation. It's not doing any network calls or other asynchronous work.

🙋 What if it was async? 🙋

  • 🛠 Tools
    • Logging
    • APM (NewRelic, Dynatrace, etc.)
  • 💪 Solutions
    • Caching
    • Query optimization
    • etc.

(out of scope for this presentation)

These are usually infrequent and slow enough you can always log. This is essentially what APM vendors do for you automatically.

💻 Our test harness 💻

var THRESHOLD = 40
var image = /* read file, call readimage */
function fillBlack(frame) {
  return frame.fill(0)
}
replaceBackground(image.frames, fillBlack, THRESHOLD)
/* call writegif on image, write to file */
replaceBackground calculates the median frame by comparing every pixel in every frame, then for each frame calculates the difference from that background frame

⏳ Before ⏳

⌛ After ⌛

🔬 2. Select tools 🔬

Most tools fall into one of three categories:

  • How slow is it?
  • Where is the slowness at?
  • Why is it slow?

🛠 How slow is it? 🛠

  • /usr/bin/time
  • Benchmark tools (ab, siege, etc.)

🛠 Where is the slowness at? 🛠

  • Kernel tools (perf, dtrace, etc.)
  • V8 Instrumentation (v8-profiler, NSolid)

📐 3. Measure 📐

With our selected poor-performing image, we'll use /usr/bin/time to measure a baseline:

$ \time node harness.js ~/Downloads/bling2.gif
8.67user 0.06system 0:08.71elapsed 100%CPU (0avgtext+0avgdata 181988maxresident)k
0inputs+752outputs (0major+41833minor)pagefaults 0swaps

📍 4. Identify the location 📍

🛠 Using perf 🛠

  • The perf tool on Linux is a kernel-level CPU profiling tool
  • Captures full stack, C++ & JS execution
  • Follow these instructions

(They are not up-to-date but the steps are all there)

🔥 How to read a flamegraph 🔥

  • X axis is % of total time
  • Y axis is stack depth
  • Look for:
    • plateaus
    • fat pyramids

🔥 Perf Flamegraph 🔥

🛠 v8-profiler 🛠

  • See v8-profiler README
  • Uses V8-provided instrumentation hooks
  • Profiles JS only
  • View by loading into Chrome Dev Tools

📈 v8-profiler results 📈

💢 Focus on avg() 💢

The avg() function averages every frame to make a simulated background frame.

Both perf and v8-profiler indicate we're spending the bulk of the time there.

🛠 Why is it slow? 🛠

If the problem is slow JavaScript execution:

  • Compiler tracing (IRHydra, --prof, --trace-deopt)
  • Code review
  • Static analysis?

🙋 What if it was garbage collection? 🙋

  • 🛠 Tools
    • heap snapshots
    • post-mortem analysis (mdb, lldb, etc.)
  • 💪 Solutions
    • fix errant closures
    • hold less data/parts of objects vs whole objects
    • code/static analysis
    • etc.

(out of scope for this presentation)

👓 5. Make the slower parts faster 👓

💩 Reasons for Poor Performance 💩

  • Wrong tool for the job
  • Doing unnecessary things
  • Poor algorithm choice
  • Not cooperating with the runtime/compiler
Node isn't great at everything: ssl termination, use nginx don't create functions in loops, avoid Promises decrease instruction count optimizing compiler makes assumptions, work with it, avoid invalidating assumptions

⚙ Optimizing Compilers (Simplified) ⚙

  • JavaScript is extremely flexible. Most code doesn't use that flexibility.
  • Optimizing compilers make assumptions on how you use JS and cut corners with optimized assembly code.
  • Deoptimization is when you invalidate the assumptions it made, forcing it to discard the assembly version.
Parse and analyze your code as it is executed making optimized assembly based on what it encounters with guards against cases it skipped. If you deoptimize a function too many times it will give up on optimization.

🚂 V8 Compiler Output 🚂

  • Lots of tips out there on how to optimize your code for V8.
  • Nothing beats V8 telling you what it didn't like.
  • My favorite tool: IRHydra2
    • Follow the instructions on that page
    • Load them into IRHydra2 (it's a web app)

🌡 IRHydra Results 🌡

(Load the two files in assets/irhydra into IRHydra to explore)

🔎 Analysis 🔎

  • the avg function had an eager deoptimization
  • ... but it was the inlined Buffer constructor in node core ...
  • (So I filed a PR against node, released today in 5.2.0)
  • otherwise nothing too interesting ¯\_(ツ)_/¯

📝 Code Analysis 📝

avg() calls medianPixel()

The algorithm to calculate median requires a sort

We don't see medianPixel in the flamegraph or v8-profiler output because it was inlined.

⚖ Some Math ⚖

Our image is 800 x 450 pixels with 51 frames

800 * 450 = 360000 pixels per frame

So 360000 51 pixel sortPixels calls. 😨

📊 Median Frame 📊

🔔 Mean 🔔

Calculating mean doesn't require a sort. Maybe it will work?

🔔 Mean Frame 🔔

⌛ After 2 ⌛

📐 6. Measure again 📐

$ \time node harness.js ~/Downloads/bling2.gif
3.40user 0.04system 0:03.43elapsed 100%CPU (0avgtext+0avgdata 182048maxresident)k
0inputs+704outputs (0major+44011minor)pagefaults 0swaps

8.67 seconds to 3.40 seconds!

🔥 Perf Flamegraph (Mean) 🔥

📉 v8-profiler results 📉

🔁 7. Go back to step 0 🔁

  • Fix copy: remove call to it?
  • Reduce instruction count
  • Even better algorithms?
  • Optimize inxsearch function?

📠 Let's remove the call to Buffer::Copy 📠

$ \time node harness.js ~/Downloads/bling2.gif
2.21user 0.06system 0:02.28elapsed 99%CPU (0avgtext+0avgdata 183944maxresident)k
10584inputs+704outputs (0major+43860minor)pagefaults 0swaps

Another second saved!

💖 Check out NSolid! 💖

Capture flamegraphs of a production process with the click of a button!

Bryce B. Baril - http://brycebaril.com

Twitter: @brycebaril

NSolid: Try NSolid!