Cassandra for craftsmen developers

Created by Felipe Fernández / @felipefzdz

About Me

Felipe Fernández

Work as Software Craftsman for Codurance
Currently with Crowdmix
Blog
Twitter: @felipefzdz

Some craftsmanship related topics

BDD
Outside-in development
Focus on delivery
Minimise waste

About the talk

1. Cassandra rationale
2. Cassandra internals
3. Data modeling

1. Cassandra rationale

Modern problems like: scalability, high availability, locality, low latency
Scalability because of internet scale
HA because of internet timezones
Locality because of speed of light
Low latency because of millenials

Cassandra rationale

RDBMS Approach to modern world challenges

Third normal form
Favour space over time

Stack overflow to the rescue
NF1: A table cell must not contain more than one value.
NF2: NF1, plus all non-primary-key columns must depend on all primary key columns.
NF3: NF2, plus non-primary key columns may not depend on each other.

Cassandra rationale

ACID Properties

Atomicity
Consistency
Isolation
Durability

ACID properties are referred to a single operation or to transaction
Those properties are a safety net, but they are not for free
They are really hard to implement in a distributed system.

Cassandra rationale

Relational Data Modeling

Denormalisation
Referential integrity
Strong uniqueness
Indexes
Joins
Aggregations

Our tables have strong connection with our entity/relationship model
Application can be more relaxed about checking referential integrity as the DB will fail explicitly
Primary key serves for the purpose of uniqueness, again DB will fail
Indexes give us flexibility when we design our tables, as we can add them whenever we want
If the data is not contained in a single table we need to join some of them, creating some locks and likely contention over those resources
There are several aggregation functions like: min, max, count, sum...

Cassandra rationale

RDBMS Problems

Scale up
Sharding
Availability
Locality

Usual approach of RDBMS is scaling up buying more resources for a single machine, that's expensive and hits a limit quite soon
A database shard is a horizontal partition of data. Sharding makes the system tolerant to partitioning
If you stick with spof is not possible to achive HA. Distributed ACID is really hard to achieve.
Again if you have a single instance you can't colocate data to different regions

Cassandra rationale

Cassandra to the rescue

Distributed peer to peer architecture
Scale out
Low latency
CAP theorem

Non master/slave, so no spof.
Commodity servers, perfect for cloud computing
Denormalisation and tables based on queries. Time over space trade-off
It drops consistency over partition tolerancy and availability. It provides tunable consistency though.

Cassandra rationale

Right tool for the right job

Big data
Higher storage costs
Change mindset

If your data fits in a machine is not big data. Avoid the overkill and complications of distributed systems if possible
Storage is cheaper and cheaper, but it's important to note that we're trading time for space, no such thing as free lunch
You can't simply plug-in cassandra into your old rdbms based system. A massive refactor needs to be done.

2. Cassandra internals

Based on Google BigTable and Amazon Dynamo.
Used by Netflix or Spotify.
Open source project backed by Datastax.

Cassandra internals

Distribution internals

Data is partitioned around the ring
Coordinator node
Replication factor
Consistency level

Every partition has a key, using some hash algorithm, we can get the location of the node where that data will live.
Clients talk with a coordinator node, that is randomly assigned every time. That coordinator apply the hash function.
RF per keyspace. We need to have different replicas to ensure high availability even when some node is down.
CL per query/command. Coordinator will wait until required number of nodes ack the request. Could be ONE, QUORUM, ALL...

Cassandra internals

Distribution internals

Column family
Keyspace
Node
Rack
Data center
Cluster

Or table. Cassandra has some confusing names around tables, keys and so on. In the end, collection of partitions.
Collection of CF. RF is set at this level. Schema or namespace.
Could be a host. Every node has a range in the partition ring, so lookups can be extremely fast.
Availability zones in AWS. Your replication strategy should be aware of this and not placing replicas on the same rack, otherwise is pointless.
Could be one in USA and another one in Ireland. RF works like a mirror, and there are CL that address this scenario.
The whole cassandra system.

Cassandra internals

Storage internals: Write path

Commit log
Memtable
SSTable
Compaction
Tombstones

Append-only structure. Serves for durability in case of crash. Kind of a backup of memtable
Second write. One per CF. Serves like a first cache for sstables
SStable are writen when memtable are full so they are flushed into disk
Merging SStables to make the read faster, to remove old data and to remove deleted data.
Those are markers for deletion, that will happen on compaction
No writes or updates in place. Append-only immutable structures allowing sequential disk or in-memory ones. Extremely fast

Cassandra internals

Storage internals: Read path

Timestamps
Read from Memtable and SSTables
Caches

It's important to have a synchronised clock in different hosts. Timestamp is used for read resolution.
It chooses last cells/colums/values depending on timestamp.
There are a lot of caches that avoid the access to disk, making reads extremely fast.

3. Data modeling

Without a proper data modeling Cassandra won't be able to deliver expected perfomance
Some of the ideas from relational world are still valuable
With this section we'll understand why Cassandra aligns with previously presented Craftsmanship concepts.

Data modeling

Partition

Partition key
Clustering key
Primary key

Remember that a CF is collection of partitions. Partition key defines in which node and in which unit of a CF our data is going to live
Cassandra has the concept of nesting data. Those are cluster around this key. This key also defines the internal order of the rows inside a partition
Primary key is composed of previous keys. That primary key ensures uniqueness of data, but through upserts.
Table is a two-dimensional view of a multi-dimensional column family.

Data modeling

Anti patterns

Secondary indexes
Join tables
Allow filtering

We can equality query by partition key and inequality query by clustering keys. Using secondary indexes into regular colums has a great performance penalty as it needs to go to different nodes to find the data.
Joins are not even possible in Cassandra. You can do application joins, but those are disencouraged too.
Kind of table scan, so only useful for development purposes

Data modeling

Conceptual Modeling

Entity-relationship diagrams
Chen notation
Technology agnostic

Set up entities, relationships, cardinality between them and attributes.
This analysis can be done disregarding Cassandra.

Data modeling

Logical Modeling

Query based approach
Chebotko notation
Mapping rules and patterns

We need to think in our queries or workloads. Based on that we can define different tables for different needs
Different cardinalities usually have same logical designs. Uniqueness, ordering and key queries are easy to extract in patterns too.

Data modeling

Physical modeling

Fleshing out details
Optimizations

Providing types for the columns
Partition too big
Duplication in linear growth
Table too big or too small

Data modeling

Mindset change

Data consistency
Transactions
Concurrent access

As we have duplications we need to keep consistent the data. You need to it at application level as Cassandra doesn't give you that.
You could use logged batches, but they're a performance penalty.
You could use counters or lighweight transactions. But probably the better is duplicate the data in different partitions.

Thank you

Any questions?

Cassandra for craftsmen developers Created by Felipe Fernández / @felipefzdz

Cassandra for craftsmen developers – Some craftsmanship related topics – 1. Cassandra rationale

felipefzdz

Cassandra for craftsmen developers – Some craftsmanship related topics – 1. Cassandra rationale

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

cassandra-for-craftsmen-developers

Cassandra for craftsmen developers

About Me

Felipe Fernández

Some craftsmanship related topics

About the talk

1. Cassandra rationale

Cassandra rationale

RDBMS Approach to modern world challenges

Cassandra rationale

ACID Properties

Cassandra rationale

Relational Data Modeling

Cassandra rationale

RDBMS Problems

Cassandra rationale

Cassandra to the rescue

Cassandra rationale

Right tool for the right job

2. Cassandra internals

Cassandra internals

Distribution internals

Cassandra internals

Distribution internals

Cassandra internals

Storage internals: Write path

Cassandra internals

Storage internals: Read path

3. Data modeling

Data modeling

Partition

Data modeling

Anti patterns

Data modeling

Conceptual Modeling

Data modeling

Logical Modeling

Data modeling

Physical modeling

Data modeling

Mindset change

Thank you

Any questions?

0 0