Statistics at CDS



Statistics at CDS

0 0


or2016


On Github switowski / or2016

Statistics at CDS

Gathering and visualizing statistical data at the CERN Document Server

Sebastian Witowski

This presentation will be about unicorns!

Actually, no. It will be about statistics

But, man, I wish it was about unicorns!

Meet:
CERN Document Server
And it's younger brother:
CERN Document Server Junior
(the name is still pending)

CDS in statistics:

  • 1 500 000 records
  • ~5 000 unique visits per day
  • > 100 different submissions (some with multiple categories - each with a separate workflow)
  • Stores physic papers but also videos, photos, posters, audio files, books and more

How do we know what our users do? It's simple, everything is in the logs, right?

Ok, but at least we know about errors and exceptions, right?

But we can use command line to find what we need there, right?

- Hey, can you tell me how many people from US visited this record on CDS in February ? - Sure, let me quickly search for it.

pv -Webrapt -l apache.log | \
grep 'record/123456' |\
sed -r 's/^(([0-9]+\.){3}[0-9]+) .*$/\1/' |\
xargs -n 1 geoiplookup |\
grep 'US' | wc -l
                            

And that's just for access logs and errors. To see how users are using our system we had to add 2 more systems on top of that.

Piwik - to see the statistics of our visitors

And Invenio webstat module - for custom statistics (like the number of loans in the library)

Oh, come on! There has to be a better way!

Why do we need a clear overview of all the logs?
  • To know about errors before many users notice them :)
  • To see how users are using CDS (are the new features popular?)
  • To provide statistics for our users ( How many times my paper was downloaded ?) and other services at CERN (How many time this video was played?)
  • To see how many resources are used (CPU, RAM, etc.) and decide if it's time to scale up or not
  • To be able to predict if we need to scale before incoming events (based on the historical data)

We decided to switch to one system for all our logs. Since the Elasticsearch was getting more and more popular, we decided to use it.

Elasticsearch is part of the ELK stack:

  • Elasticsearch - a search server
  • Logstash - transportation layer
  • Kibana - presentation layer

Write how many machines we have used, how much data we store, what is the load and for long will it last (before we have to scale up)?

Write about custom improvements:

  • NGINX configuration instead of paid Shield
  • Flagging bots
  • Mapping IP to countries

Describe Lumberjack - custom plugin that allows us to send data to Elasticsearch from any place in Invenio

Kibana

Kibana works very nice for administrators, but we might need something more in the future (probably a module integrated directly in Invenio, with some predefined parameters and Role Based Access Control).

Current state

We are still in the phase of transition between the old way (MySQL and Piwik) and the new one (Elasticsearch), as we are still supporting both versions of CDS, but we can clearly see the benefits of this change.

The good parts:

  • We were able to replace two different tools (both of them doing similar tasks - gathering statistics) with a single one
  • Thanks to easy and minimal setup needed to run Kibana, we could start visualizing the data right away
  • Elasticsearch is a very adaptable infrastructure. You can easily add new machines to the cluster
  • If you are not happy with any part of ELK, you can easily replace it. There are many open source alternatives for each of it's part.

There are no bad parts, but be aware to not treat Elasticsearch as an error-proof black box, then you install once and it works no matter what. There are some resources that will be depleted quicker than other (memory, for example), so spend some time configuring it properly, to make the most of your Elasticsearch installation and avoid troubles in the future.

Thank you!

You were an awesome audience! I wish I could be there, but you know, I'm taking the last change to see Black Sabbath live, so I think you will understand. I just hope Esteban won't screw up this awesome presentation.

1/27
Statistics at CDS Gathering and visualizing statistical data at the CERN Document Server Sebastian Witowski