MSc Methods&Statistics
Working at Jibes Data Analytics
Open source projects:
20 lines mime vs ..>
rather than single page consider, consider whole domain
don't abuse
gather from reddit, github, pypi, BDFL, twitter, stackoverflow
latest greatest xtoy
yagmail send emails in 2 lines (html/attach) 246 sky next-gen intelligent web scraping 57 gittyleaks find users/keys/pass in git repos 18 pytrending discover trending python 10 xtoy automatic prep/model/predict 2Interesting for python because:
"big data" cloud scraping sending email{"domain": "http://www.gtbit.org",
 "url": "http://www.gtbit.org/news/viewitem.php?id=40",
 "injectable": true,
 "on line": true,
 "error": false,
 "at line": false,
 "time": "Wed Oct 28 00:59:39 2015",
 "warning": true,
 "failed_request": false,
 "emails": ["gtbit@rediffmail.com", "inderjeet@gmail.com"],
 "sql": true}
                control micro fleet
script being run on aws
gather from S3 using GreenPool, do the computations
part = r'[^?@ ><\'":\\\/]+'
email_re = re.compile(part + '@' + part + r'\.' + part)
                    
                    
                    
for wet_path in wetpaths:
    swp = slugger(wet_path)
    if swp in dones:
        continue
    t1 = time.time()
    results = []
    # Start a connection to one of the WARC files
    k = Key(pds, wet_path)
    f = warc.WARCFile(fileobj=GzipStreamFile(k))
    for i, record in enumerate(f):
        if record.url is not None and 'php?id=' in record.url:
            results.append(record.url)
    print(time.time() - t1)
    save_file_s3('\n'.join(results), swp)