Kinds of database – CAP Theorem – Key Value Stores



Kinds of database – CAP Theorem – Key Value Stores

0 0


no-right-way

No Right Way presentation 1st given at Bath Digital Festival - Web Day

On Github ukmadlz / no-right-way

No Right Way

A discussion on Databases and a guide to CouchDB and Cloudant

ANY VIEWS OR OPINIONS EXPRESSED IN THIS PRESENTATION ARE THOSE OF THE AUTHOR, AND DO NOT NECESSARILY REPRESENT OFFICIAL POSITIONS, STRATEGIES OR OPINIONS OF INTERNATIONAL BUSINESS MACHINES (IBM) CORPORATION.

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY.

WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.

IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.

NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF:CREATING ANY WARRANTY OR REPRESENTATION FROM IBM (OR ITS AFFILIATES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS); ORALTERING THE TERMS AND CONDITIONS OF THE APPLICABLE LICENSE AGREEMENT GOVERNING THE USE OF IBM SOFTWARE.

Who the are you…?

Mike Elsmore

Developer Advocate for Cloudant

mike.elsmore@uk.ibm.com

@ukmadlz

Kinds of database

Currently Databases are classified as either Relational or NoSQL

NoSQL itself is just a capture term to describe, none relational data structures.

As with RDBs having multiple flavours and brands

Relational Databases

  • MySQL
  • MariaDB
  • MS SQL
  • Oracle
  • IBM DB2
  • PostGres

NoSQL

  • MongoDB
  • CouchDB
  • CouchBase
  • Riak
  • Cassandra

What is NoSQL

The main separators from RDB to NoSQL, apart from the obvious lack of SQL

Variation: Graph, XML, JSON & triples

New query languages: you can have an alternative to SQL, so possibly simpler or structure-specific

'Schema less': no rigid schema enforced by the DBMS

Programmer friendly (we hope): easily to programmatically navigate the structure

May not guarantee full ACID behavior

 Atomicity – Consistency – Isolation – Durability

May have a distributed, fault-tolerant, elastic architecture

What's the appeal

Data Model Flexibility

Elastic (automatic) scale in/out

Lower-cost operational data management platform for
thousands & millions of users

Data Model Flexibility

  • Data models that are native to the application space (e.g. JSON)
  • No “schema-first” requirement: rapid and agile development process

Elastic (automatic) scale in/out

  • Easy elasticity and scalability to multiple racks (10s to 100s of severs)
  • Supports dynamic workloads
  • Optimized for web scale and extreme performance
  • Ease of replication

Lower-cost operational data management platform for
thousands  millions of users

  • Increase in volumes of data, retention requirements (3-15 years)
  • Commodity hardware and pay-for services
  • Fault-tolerance and high availability

CAP Theorem

Taken from

CAP Theorem, also knows as Brewer's Theorem from Eric Brewer

  • Consistency: all nodes belonging to system see
the same data at the same time
  • Availability: a guarantee that every request
receives a response of success or failure
  • Partition tolerance: system continues to operate
despite message loss or failure of parts of system

Given a data partition, will you prioritize consistency or availability?

And this applies to Relation how?

When to use Relational Database Management Systems (RDBMS)

  • Data normalization is critical for elimination of redundancy and ensuring master data consistency
  • Many “justifications” for using relational databases are cultural, not technical e.g. Solution already built with relational DB  resistance to change
  • Prioritization of data availability and consistency lends RDBMS well to handling transactional, reporting, log, and warehouse data
  • Analytics and BI tooling are valid reasons for maintaining a relational database, but this too is rapidly changing

According to the CAP Theorem, it is impossible for relational databases to be partition tolerant

  • So long as RDBMS prioritizes availability and consistency, they are
unable to scale out (horizontally)!
  • Vertical scaling is the alternative, but this practice becomes prohibitively expensive and is not sustainable

And NoSQL fits in here…?

Most NoSQL technologies have been built for scale, which means that they fit either a CP or AP model

Popular datastores like CouchBase & Mongo try to be CP, but fallback to AP when things get tough

CouchDB follows an AP approach from the start using an eventual consistency

Key Value Stores

Columnar Stores

Graph Stores

Document Stores

Key Value Stores

  • Cassandra
  • Riak
  • MemcacheDB
  • HBase
  • pickleDB

Columnar Stores

  • Cassandra
  • HBase

Document Stores - 50% of NoSQL DBs are document based

  • CouchDB and because of it Cloudant
  • MongoDB
  • Redis
  • CouchBase
  • Engine Yard

Graph Store

  • *dex
  • Neo4j
  • InfiniteGraph
  • Sesame
                
        {
             "firstName“ : "John",
             "lastName" : "Smith",
             "age“ : 25,
             "address" :
             {
                 "streetAddress“ : "21 2nd Street",
                 "city" : "New York",
                 "state“ : "NY",
                 "postalCode" : "10021"
             },
             "phoneNumber":
             [
                 {
                   "type" : "home",
                   "number“ : "212 555-1234"
                 },
                 {
                   "type" : "fax",
                   "number“ : "646 555-4567"
                 }
             ]
         }
                
              

And they all use JSON or some derivative, that's basically JSON but a different name

How does this apply to CAP theorem

As I said, all these datastore use or fallback to an AP approach. Which means…

Instead of prioritizing consistency and availability, shift focus towards ensuring availability and partition tolerance

  • Unlikely to find a scenario where loss in availability would be tolerable
  • Selecting for partition tolerance opens up possibility for horizontal scaling!
  • Distribution over cluster also improves availability In aggregate the cluster is more reliable than the individual nodes that comprise it

This comes at the cost of a weakened consistency model

  • Can no longer guarantee that all nodes (and clients connected to these nodes) share identical versions of the same data at a given moment
  • The result is an “eventual consistency” model The premise that all nodes in a distributed system will eventually share the same versioning of all data, given sufficient time

Why use MongoDB

It's quick

It's easy to use

            
              db.unicorns.insert({name: 'Aurora', gender: 'f', weight: 450});
              db.unicorns.find();
              db.system.indexes.find();
            
          

It's quick, to get started you:

  • download the binaries (server and client)
  • create and set the config file
  • launch the DB binary

Written around JS, which is why it works AMAZINGLY within the MEAN stack, means simple chained

Why use Cloudant / CouchDB

It works as a HTTP API

It's eventually consistant

It's managed - if you use Cloudant

Why carry on with Relational

http://sqoop.apache.org/

As all the NoSQL DB's have merits and they work well when implemented well, they just aren't cut out for everything

With the fact that SQL takes Consistency and Availability as it's primary factors it makes it the only real choice for sensitive transactional data

For example: financial information, the availability must be there to read data, but to change the information for taking money out it must be consistant (don't want money out in 2 locations)

Also for all the Big Data stuff you REALLY can just use Sqoop and process it in Hadoop separately

The End

Slides

©2014 IBM Corporation