On Github willdurand-edu / web-tracking-101-slides
Web analytics is the measurement, collection, analysisand reporting of web data for purposes of understandingand optimizing web usage.
There are two categories of web analytics:
Off-site web analytics refers to web measurement and analysis regardless of whether you own/maintain a website.
On-site web analytics measures a visitor's behavioronce on your website. There are two main technicalways of collecting the data: server log file analysis,and page tagging.
Google Analytics is the most widely used on-site webanalytics service. Piwik is its Open-Source alternative.
Web servers record some of their transactions in a logfile.
In the early 1990s, web site statistics consisted of counting the number of client requests (hits) made to the web server.
This was a reasonable method initially, since each web siteoften consisted of a single HTML file. However, with the introduction of images in HTML, and web sites that spanned multiple HTML files, this count became less useful.
AWStats and Webalizer are the most well-knownweb server log analysis softwares.
Two units of measure were introduced in the mid-1990s to gauge more accurately the amount of human activity on web servers:
Because of caching, this method is not accurate enough.
In the mid-1990s, Web counters were commonly seen:
U Can't Touch This!In the late 1990s this concept evolved to include a small invisible image (called a web bug) rather than a visible one, and, by using JavaScript, to pass along with the image request certain information about the page and the visitor.
This information can then be processed remotely by a web analytics company, and extensive statistics generated.
Historically, vendors of page-tagging analytics solutions have used third-party cookies sent from the vendor's domain instead of the domain of the website being browsed.
Third-party cookies can handle visitors who cross multiple unrelated domains within the company's site, since the cookie is always handled by the vendor's servers. However, there are two problems: privacy concerns and cookie deletion.
Most vendors of page tagging solutions have now moved to provide at least the option of using first-party cookies (i.e. cookies assigned from the client subdomain).
Google Analytics works by the inclusion of a block of JavaScript code on pages in your website. When users to your website view a page, this JavaScript code references a JavaScript file which then executes the tracking operation.
The tracking operation retrieves data about the page request through various means and sends this information to theGoogle Analytics server via a list of parameters attached toa single-pixel image request.
The data used to provide all the information in Google Analytics reports comes from three different sources:
When all this information is collected, it is sent to the Google Analytics servers in the form of a long list of parameters attached to a single-pixel GIF image request:
http://www.google-analytics.com/__utm.gif? utmac=UA-30138-1& // Account utmcc=__utma%3D97315849.1774621898...& // Cookie values utmdt=analytics%20page%20test& // Page title utmfl=9.0%20r48& // Flash version utmhn=example.com& // Hostname utmn=769876874& // Unique ID to prevent caching utmcs=ISO-8859-1& // Language enc. utmsr=1280x1024& // Screen resolution utmsc=32-bit& // Screen color depth utmul=en-us& // Browser language ...
Web tracking or Website visitor tracking is the analysis of visitor behaviour on a website. Analysis of an individual visitor's behaviour may be used to provide that visitor with options or content that relates to their implied preferences.
(Well, most of the time)
A cookie is a small piece of data sent from a website and stored in a user's web browser while the user is browsing that website according to Wikipedia.
Every time the user loads the website, the browser sends the cookies back to the server to notify the website of the user's previous activity.
Cookies have some important implications on theprivacy and anonymity of web users.
While cookies are sent only to the server setting them or a server in the same Internet domain, a web page may contain images or other components stored on servers in other domains.
Cookies that are set during retrieval of these components are called third-party cookies. Advertising companies use third-party cookies to track a user across multiple sites.
The EU cookie directive requires websites to gain permission from users, before planting cookies with two exceptions:
Let's say you have four websites:
You are a user, and you visit each of these websites. Each of these websites also works with a third-party company whose job is to serve ads, and named: badboys.com.
On every page of these websites, there is a web bug that points to badboys.com's adserver and requests an ad.
Assuming gizmodo.com knows your age and gender, when visiting it, the following request is made:
http://badboys.com/adrequest?site=gizmodo.com?age=25&gender=mBrowser requests content Server requests cookies (if any) Browser sends cookies Server does some black magic, and picks an ad Server returns new cookie data, and the ad Browser displays this ad
Even though you are visiting the site gizmodo.com, the cookie is under the domain badboys.com. When the server returns new cookie data, it puts in there when you last visited gizmodo.com and how many times you’ve been there today.
Now let's say you spend the morning watching tech-news on gizmodo.com, slashdot.org, and wired.com, all the while receiving ads from badboys.com. Each page you view gives them a little bit more information about you.
Now, you go to facebook.com. Serving an ad for such a website is complicated as it depends on how you use Facebook. Maybe you didn't fill in enough information to know your age, your gender or things you like.
But, let's walk through the steps of a Facebook's ad call to badboys.com in more details!
http://badboys.com/adrequest?site=facebook.com
That's the black magic!
Instead of showing you a random ad, badboys.com is able to show you a highly relevant and targeted advertisement that you are far more likely to click on.
Given how much tracking companies know about your browsing history, it is worth asking whether these companies also know who you are. The answer, unfortunately, appears to be yes, at least for those of you who use social networking sites.
The most obvious way that a third-party tracker (remember badboys.com?) might learn which account on a social networking site is yours via the HTTP Referrer header.
A typical URL on a social networking site includes a username or user ID number, and any third-party will be able to see that.
Also, check out panopticlick.eff.org, and Lightbeam!
Browser fingerprinting is a method of tracking webbrowsers by the configuration and settings informationthey make visible to websites.
If your browser is unique, then it's possible that an online tracker can identify you even without setting tracking cookies.
Canvas fingerprinting is a more sophisticated type of browser fingerprinting technique. Entropy is due to Operating System, browser, GPU, and graphics driver. In short:
Paper: The Web Never Forgets: Persistent Tracking Mechanisms in the Wild
Source: https://securehomes.esat.kuleuven.be/~gacar/persistent/index.html
History stealing with CSS :visited
Plugging the CSS History Leak / Extra Lecture: Privacy on the Web
Google assigns a unique PREF cookie anytime someone's browser makes a connection to any of the company's Web properties or services.
This can occur when consumers directly use Google services (Search, Maps, etc.), or when they visit websites that contain embedded widgets (Google Plus).
That cookie contains a code that allows Google to uniquelytrack users to personalize ads.
The NSA and its British counterpart, GCHQ, are using cookies that advertising networks place on computers to identify people browsing the Internet.
The intelligence agencies have found particular use for a part of a Google-specific tracking mechanism known as the PREF cookie.
These cookies typically don't contain personal information, but they do contain numeric codes that enable websites to uniquely identify a person's browser.