MSc Methods&Statistics
Working at Jibes Data Analytics
Open source projects:
20 lines mime vs ..>
rather than single page consider, consider whole domain
don't abuse
gather from reddit, github, pypi, BDFL, twitter, stackoverflow
latest greatest xtoy
yagmail send emails in 2 lines (html/attach) 246 sky next-gen intelligent web scraping 57 gittyleaks find users/keys/pass in git repos 18 pytrending discover trending python 10 xtoy automatic prep/model/predict 2Interesting for python because:
"big data" cloud scraping sending email{"domain": "http://www.gtbit.org", "url": "http://www.gtbit.org/news/viewitem.php?id=40", "injectable": true, "on line": true, "error": false, "at line": false, "time": "Wed Oct 28 00:59:39 2015", "warning": true, "failed_request": false, "emails": ["gtbit@rediffmail.com", "inderjeet@gmail.com"], "sql": true}
control micro fleet
script being run on aws
gather from S3 using GreenPool, do the computations
part = r'[^?@ ><\'":\\\/]+' email_re = re.compile(part + '@' + part + r'\.' + part)
for wet_path in wetpaths: swp = slugger(wet_path) if swp in dones: continue t1 = time.time() results = [] # Start a connection to one of the WARC files k = Key(pds, wet_path) f = warc.WARCFile(fileobj=GzipStreamFile(k)) for i, record in enumerate(f): if record.url is not None and 'php?id=' in record.url: results.append(record.url) print(time.time() - t1) save_file_s3('\n'.join(results), swp)