Max & Min – Average – StandardDeviation



Max & Min – Average – StandardDeviation

0 1


jsconfeu-2013

Presentation at JSConf Europe 2013. Check gh-page branch.

On Github julien51 / jsconfeu-2013

Streaming

Algorithms

JSconf Eu 2013

Julien Genestoux

@julien51

http://Superfeedr.com

Your turn. You have 1 minute. (I only get 30 total!)

At Superfeedr, we deal with streams that are too big for us.
  • Memory constraints
  • Inifinite data sets
  • Realtime data

This changes everything...

... starting with algorithms

A note on windowing and sampling

Max & Min

Finding the winer!

This is really easy: keep track of the max.

function Streamath() {
  this.max = 0;
}

Streamath.prototype.feed = function feed(value) {
  if(this.max < value) {
    this.max = value;
  }
}

// Let's create random numbers
function random(min, max) {
  return function() {
    var r = Math.random() * (max - min) + min;
    return r*r*r*r*r*r*r; // <- Make distribution lower!
  }
}

var c = 0;
// Input yields a result of random(0,1) every 10ms to the function
var i = new Input(10, random(0,1), function(number) {
  streamath.feed(number);
  c += 1;
  console.log(c, number, streamath.max); // Outpout for GNUPlot
});

This algorithm knows too much!

Let's inoculate Alzheimer.

function Streamath() {
  this.max = 0;
  this.amortized = 0;
  this.idx = 0;
}

Streamath.prototype.feed = function feed(value) {
  this.idx += 1;
  if(this.amortized < value + this.idx) {
    this.max = value;
    this.amortized = value + this.idx;
  }
}
Really effective way to find local maximums. Works the same for minimums!

Average

Sum values

Count values

Keep running counts!

function Streamath() {
  /* Keep running counts */
  this.count = 0;
  this.sum = 0;
}

Streamath.prototype.feed = function feed(value) {
  this.count += 1;
  this.sum += value
}

Streamath.prototype.average = function average() {
  if(this.count)
    return this.sum/this.count
  return null;
}
function random(min, max) {
  return function() {
    return Math.random() * (max - min) + min;
  }
}

// Input yields a result of random(0,100) every 1ms to the function
new Input(1, random(0,100), function(number) {
  streamath.feed(number);
});

// Guess what? The average quickly converges to 50!

Let's try another distribution

Old values are here forever... and average become flat over time.

Recursive average to the rescue

AVG(t-1) * weight + value(t)

weight + 1

The weight of old values quickly fades in, based on the weight. We're computing running averages, without windowing

StandardDeviation

Indentifying outliers!

Variance is the average of the distance to the average

1/N * ∑(k−avg)

In other words: we need to recompute all the distances evertime we have a new value for the average *(

Maths to the rescue!

1/N * ∑(k−avg)

is the same as (König-Huygens)

1/N * ∑k^2 - avg^2

function Streamath() {
  this.count = 0;
  this.sum = 0;
  this.sqr = 0; // <- Keep a new rolling number: the sum of squares
}

Streamath.prototype.feed = function feed(value) {
  this.count += 1;
  this.sum += value;
  this.sqr += value * value; // <- Update the sum of square
}

Streamath.prototype.average = function average() {
  if(this.count)
    return this.sum/this.count
  return null;
}

Streamath.prototype.stddev = function stddev() {
  var avg = this.average();
  if(this.count)
    return Math.sqrt(this.sqr/this.count - avg * avg);
  return null;
}

Medians and percentiles

Not possible without windowing or sampling :(

Tale of 2 heaps

  • Keep 2 heaps: greater numbers and smaller numbers
  • If the new number is greater than the median, insert in the great number heap, at the right place
  • If the new number is lower than the median, insert in the smaller number heap, at the right place
  • If either heap's length is larger than the other one by more than 1, rebalance
  • Median is smallest of greater numbers
  • It is possible to 'clean up the heaps' simultaneously by their ends.

Membership

Have we seen this already? Is this trending?

Bloom Filters

  • space-efficient probabilistic data structure
  • Multiple hashes, for multiple dimensions
  • False positive possible, but no false negative
  • 10 bits per element are required for a 1% false positive probability

Shamelessly stolen from http://www.jasondavies.com/bloomfilter/

 Add
Definitely not there.

They fill up!

  • Time-limit them
  • Extend them
  • Counting filters

This was barely scratching the surface. Be curious.

Add a SubToMe button to your blog too!