intro-to-big-data



intro-to-big-data

0 1


intro-to-big-data

Introduction to Big Data with practical use-cases (Meetup Talk)

On Github dhilipsiva / intro-to-big-data

Introduction to Big Data

with practical use-cases

@dhilipsiva

@dhilipsiva

  • Full-Stack & DevOps Engineer @Appknox
  • Big Data, Machine Learning & IoT Enthusiast
  • Open-Source Fanatic & GitHub Addict
  • Father

Data & History

Lets discuss about History of data

Surprise, Surprise!!!

  • Each card is about 100 Bytes
  • 62500 Cards
  • 5.96 MB (Approx 3 floppy disk)

Data Today

  • HDD
  • SSD
  • Flash Storage

Data Tomorrow

  • Data on Bacteria
  • Data on DNA
  • Qubits
  • And so much more

Computers - Back then

  • What was Computer?

  • Or more specifically, who were computers?

10 years down the line

  • Mechanical *
  • The Electronics

Computers - Today

  • Moore's law
  • Already hit the roadblock

Computers - tomorrow

  • Quantum

Inputs - Back then

  • Punch Cards - for 10 professional programmers it would take 10 days to generate 5MB of data (Approx.)

  • Keyboards - 1 typist can produce 1.7 MB data in 24 Hours (Approx)

Inputs - today

What Happens in a min

[doubled in the pic]

Keyboards, Mobile, Camera alone gives Text, Audio, Pictures, Videos.

Big Data

And that is Big Data :)

But Wait

What about the Future?

  • IoT

  • Cutting Edge Researches DNA, Cancer Treatment, Gnome Research, etc. [Will talk in the end]

Census

  • First, there was census
  • Then, came computers

Moral of the story

  • Why just one more powerful computer?

  • How to apply the same technique with today's computers?

Hadoop - to the resque

But first - History

Google & MapReduce

Hadoop & Co.

  • Hadoop
  • MapReduce
  • Ambari: A web-based tool for provisioning, managing and monitoring Apache Hadoop clusters
  • Avro: A data serialization system
  • HBase: A Column based Data Store
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • Spark: ETL (Extract, Transform, Load), machine learning, stream processing, and graph computation.
  • ZooKeeper: A high-performance coordination service for distributed applications.

demo

[show the demo]

Real World Applications

Applications: Google

Webpage indexing

Face detection

Personalized Ads

Plenty of others

Applications: Facebook

Malware detection

Spam detection

Finding Faces

And much more

Applications: Twitter

Trending Posts

Analytics

Applications: Others

Appknox

Nest

Cancer Research

Rice DNA

Cancer Research

Applications: Future

IoT - Health, Home Appliances, Weather

Thanks! & Questions?

Introduction to Big Data with practical use-cases @dhilipsiva