Blog

User-Agents and URL-Resolvers

While resolving the (usually shortened) URL's from Tweets, I noticed some errors using Python Requests. Actually, it's not a Requests "Error", but a 403 ("Forbidden")-Response from some Webservers, who don't like user-agents like Apache or the standard Requests-User-Agent.

Consider the follwing example:

<span class="im">import</span> requests <span class="im">as</span> rq

url<span class="op">=</span><span class="st">"user_agent = {'User-agent': 'Mozilla/5.0'}"</span> <span class="co">#This URL has some redirects,</span>
<span class="op">and</span> doesn<span class="st">'t like some User-Agents.</span>

<span class="st">#Lets try a simple head-request (which is recommended resolving the url due to</span>
<span class="st">a drastically #lower network-load, especially when resolving thousands of urls</span>
<span class="st">at once)</span>

<span class="st">exmpl = rq.head(url) </span>

This example will result in an 403-Response code, mainly because of the standard-headers that Requests is delivering:

<span class="co">'User-Agent'</span>: <span class="st">'python-requests/1.2.3 CPython/2.7.3 Linux/3.2.0-23-generic'</span>}

Other Response-Codes may be thrown, f.e. 405 when a head-Request is forbidden. This may be handled via Exceptions, or using Curl ord PyCurl directly (which isn't under active development any more).

As a workaround, which "fakes" the user-agent of the request and should work for the most webservers or endpoints of an request, one could & should simply pass another user-agent (f.e. user_agent = {'User-agent': 'Mozilla/5.0'}). This methods allows to resolve most of the shortened URL's posted on Twitter.

To reduce the number of redirects when consuming Tweets from the Streaming API, one should use the "expanded_url"-Value of the delivered JSON. Beware of the fact that these "expansion" is not complete: Redirects, shortened Short-URL's and alike are not expanded by that method!

Blog

Blog

dorvak

Blog

0 0

dorvak.github.io

Blog

User-Agents and URL-Resolvers

Blog

dorvak

Blog

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

dorvak.github.io

Blog

User-Agents and URL-Resolvers

0 0