How I Computer Science

Travis Hoppe, PhD

Postdoctoral Fellow, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD

C.S. (mis)conceptions

What I thought it was... What it actually is...

Why I do it...

Outline

Science

My Background: PhD, MS Physics, BS Mathematics

Protein "Physics"

Folding, interfaces, aggregation, electrostatics, statistical mechanics, ...

Not talking about Ising model topology, hacking graph theory, or other interesting ventures...

Protein Structure

Primary structure (sequence)

GSIGAASMEF CFDVFKELKV HHANENIFYC PIAIMSALAM VYLGAKDSTR TQINKVVRFD KLPGFGDEIE AQCGTSVNVH 
SSLRDILNQI TKPNDVYSFS LASRLYAEER YPILPEYLQC VKELYRGGLE PINFQTAADQ ARELINSWVE SQTNGIIRNV 
LQPSSVDSQT AMVLVNAIVF KGLWEKAFKD EDTQAMPFRV TEQESKPVQM MYQIGLFRVA SMASEKMKIL ELPFASGTMS 
MLVLLPDEVS GLEQLESIIN FEKLTEWTSS NVMEERKIKV YLPRMKMEEK YNLTSVLMAM GITDVFSSSA NLSGISSAES 
LKISQAVHAA HAEINEAGRE VVGGAEAGVD AASVSEEFRA DHPFLFCIKH IATNAVLFFG RCVSP

Secondary structurehelices [red], sheets [blue Tertiary structure3D structure Higher-order structurecomplexes, aggregation

Ovalbumin, Egg white protein PDB:1OVA, Crystal Structure, Carrell et al., J. Mol. Biol. (1991) SEM Aggregate structure, Zabik et al., J. Poul. Sci. (1980)

Protein interactions

Folding Binding, dimerization Aggregation, Fibril formation

Protein interactions in a crowded environment

How do we do it?

Statistical Potentials: Residue-residue interactions

Potentials constructed from Top 8000 Protein Database, Richardson Group

Residue-residue interaction matrix, MJ

Other statistical potentials: Tanaka and Scheraga (1976), Spil (1990), Miyazawa and Jernigan (1996),Betancourt and Thirumalai (1999), Skolnick, Kolinski and Ortiz (2000)

MJ matrix reveals biophysical structure

H (hydrophobic), P (polar), C (charged)

Higher order structure

Phase separations lead to sudden changes in liquid structure.

Leibler, Nature 2004 Tanaka, Phys. Rev. E 2005

How do we model many protein-protein interactions?Can we predict aggregates from experimental structure?

Human serum albuminPDB:1AO6 OvalbuminPDB:1OVA LysozymePDB:1W6Z Bovine Serum AlbuminPDB:3V03

Where does CS come into play?

Be able to say what is possible, and what isn't!

Algorithmic design, ex. linear algebra, molecular dynamics...

Hardware design, specialized hardware, ex. Anton, GRAPE.

Predicting run-time (non-trivial at model stage!).

Scaling up!

Machine Learning

Meet The Man Who Gamed Reddit With A Bot

Not talking about transorthogonal linguistics, colorless green ideas, code linguistics, or Godwin's Law...

The goal

Train a machine to find

new & interesting things

Requires a corpus of interesting things...

Supervised learning

r/TIL, a subreddit short for Today I Learned

Keep only Wikipedia data

Filter for consistent writing style...

Data collection

Download WikipediaDownload all posts with score>1000 for 2013 and 2014 (~5000)Cross-reference each post to the correct Wikipedia paragraphBuild True positives (known TIL's)Build Decoys (other paragraphs in TIL's)Build unknown samples (rest of Wikipedia*)

from python import science

sqlite3, requests, bs4, pandas, numpy, scikit-learn,gensim, praw, wikipedia, nltk, stemmming.porter2

*Assume that most of Wikipedia isn't interesting...

Data Wrangling

Tokenize

>> "Good muffins cost $3.88\n in New York"
['Good', 'muffins', 'cost', 'TOKEN_MONEY', 'in', 'New', 'York', 'TOKEN_EOS']

Remove "stop words"

>> "I sat on the rock"
['I', 'sat', 'on', 'rock']

Stem words

>> stem("factionally")
'faction'

"Entropy" vectors

counts the uniqueness of each word to the rest of the entry,local TF-IDF (term frequency-inverse document frequency)

Feature generation

Used Word2Vec (developed by Google),weighted by local article TF-IDF

>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
>>> model['computer']  # raw numpy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

Uses far fewer features to store relationships between words!

Modeling training

Used Extremely Randomized Trees, variant of Random Tree classifier.

Training classifier
Test Accuracy: 0.878;    Test Accuracy on TP: 0.116;   Test Accuracy on TN: 0.998

Receiver Operating Characteristic

Does it work?

yes! look at all that sweet front-page karma...

TIL The Founder Of Japans Mcdonalds Stated | 4726TIL Mike Kurtz An American Burglar Found Out That | 4123TIL A Woman That Reported 100 Incidents Of | 2899TIL During The Sentencing Of His War Crimes Trial | 1551TIL That Art Spiegelman The Creator Of Maus A | 1144TIL That Once Officially Labeled As Retarded | 640TIL Before World War Ii It Was Very Rare For | 498TIL That A Study Showed Those With A Distressed | 142TIL Frankie Fraser A Notorious English Gangster | 135TIL Rafael Quintero A Mexican Drug Trafficker | 68...

AI vs. Human (Turing test pt. 1)

I can do (almost) anything you can do better...

Turing test pt. 2

After three months and 60 submissions, I revealed to Redditthe true nature of /u/possible_urban_king.The account was promptly banned from r/todayIlearned ...thanks anonymous moderator for helping prove the test!

The Turing test is a necessary but not

sufficient test for artificial intelligence.

Artificial Intelligence Machine Learning

Where does CS come into play?

Natural language parsing, NLP.

Supervised and unsupervised learning.

Knowing the right algorithm and its limitations...

Validation and statistics.

Public Relations

Build a portfolio

Network with others

Before you start ... and once you get out there...

Advertise yourself!

Learn from others!

... computer science is more than just code ...

Learn from others / help others!

Stack Overflow

Challenge yourself!

PE: Math challenges that require coding. Kaggle: Machine learning for profit! TC: Mini-Hackathons and prizes! HR: Used in interviews.

Share your code!

github

Meetups and Hackathons

Meetup

Shameless plug and Extra Credit!

DC Hack && Tell

Next event October 13th!

Thanks, you!

For class participation credit, fill out this questionnaire:

Presentation Review

http://bit.ly/1KVprYC

permalink

How I Computer Science Travis Hoppe, PhD@metasemantic Postdoctoral Fellow, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD

How I Computer Science – Travis Hoppe, PhD – C.S. (mis)conceptions

thoppe

How I Computer Science – Travis Hoppe, PhD – C.S. (mis)conceptions

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

Presentation_GWU_CSintro

How I Computer Science

Travis Hoppe, PhD

C.S. (mis)conceptions

Outline

Science

Protein "Physics"

Protein Structure

Primary structure (sequence)

Protein interactions

How do we do it?

Residue-residue interaction matrix, MJ

MJ matrix reveals biophysical structure

Higher order structure

Where does CS come into play?

Machine Learning

The goal

Train a machine to find

new & interesting things

Supervised learning

Keep only Wikipedia data

Data collection

from python import science

Data Wrangling

Tokenize

Remove "stop words"

Stem words

"Entropy" vectors

Feature generation

Modeling training

Does it work?

AI vs. Human (Turing test pt. 1)

Turing test pt. 2

The Turing test is a necessary but not

sufficient test for artificial intelligence.

Where does CS come into play?

Public Relations

Build a portfolio

Network with others

Advertise yourself!

Learn from others!

Learn from others / help others!

Challenge yourself!

Share your code!

Meetups and Hackathons

Shameless plug and Extra Credit!

Thanks, you!

Presentation Review

0 0