Cassandra for craftsmen developers – Some craftsmanship related topics – 1. Cassandra rationale



Cassandra for craftsmen developers – Some craftsmanship related topics – 1. Cassandra rationale

0 0


cassandra-for-craftsmen-developers


On Github felipefzdz / cassandra-for-craftsmen-developers

Cassandra for craftsmen developers

Created by Felipe Fernández / @felipefzdz

About Me

Felipe Fernández

Some craftsmanship related topics

  • BDD
  • Outside-in development
  • Focus on delivery
  • Minimise waste

About the talk

  • 1. Cassandra rationale
  • 2. Cassandra internals
  • 3. Data modeling

1. Cassandra rationale

  • Modern problems like: scalability, high availability, locality, low latency
  • Scalability because of internet scale
  • HA because of internet timezones
  • Locality because of speed of light
  • Low latency because of millenials
Cassandra rationale

RDBMS Approach to modern world challenges

  • Third normal form
  • Favour space over time
  • Stack overflow to the rescue
  • NF1: A table cell must not contain more than one value.
  • NF2: NF1, plus all non-primary-key columns must depend on all primary key columns.
  • NF3: NF2, plus non-primary key columns may not depend on each other.
Cassandra rationale

ACID Properties

  • Atomicity
  • Consistency
  • Isolation
  • Durability
  • ACID properties are referred to a single operation or to transaction
  • Those properties are a safety net, but they are not for free
  • They are really hard to implement in a distributed system.
Cassandra rationale

Relational Data Modeling

  • Denormalisation
  • Referential integrity
  • Strong uniqueness
  • Indexes
  • Joins
  • Aggregations
  • Our tables have strong connection with our entity/relationship model
  • Application can be more relaxed about checking referential integrity as the DB will fail explicitly
  • Primary key serves for the purpose of uniqueness, again DB will fail
  • Indexes give us flexibility when we design our tables, as we can add them whenever we want
  • If the data is not contained in a single table we need to join some of them, creating some locks and likely contention over those resources
  • There are several aggregation functions like: min, max, count, sum...
Cassandra rationale

RDBMS Problems

  • Scale up
  • Sharding
  • Availability
  • Locality
  • Usual approach of RDBMS is scaling up buying more resources for a single machine, that's expensive and hits a limit quite soon
  • A database shard is a horizontal partition of data. Sharding makes the system tolerant to partitioning
  • If you stick with spof is not possible to achive HA. Distributed ACID is really hard to achieve.
  • Again if you have a single instance you can't colocate data to different regions
Cassandra rationale

Cassandra to the rescue

  • Distributed peer to peer architecture
  • Scale out
  • Low latency
  • CAP theorem
  • Non master/slave, so no spof.
  • Commodity servers, perfect for cloud computing
  • Denormalisation and tables based on queries. Time over space trade-off
  • It drops consistency over partition tolerancy and availability. It provides tunable consistency though.
Cassandra rationale

Right tool for the right job

  • Big data
  • Higher storage costs
  • Change mindset
  • If your data fits in a machine is not big data. Avoid the overkill and complications of distributed systems if possible
  • Storage is cheaper and cheaper, but it's important to note that we're trading time for space, no such thing as free lunch
  • You can't simply plug-in cassandra into your old rdbms based system. A massive refactor needs to be done.

2. Cassandra internals

  • Based on Google BigTable and Amazon Dynamo.
  • Used by Netflix or Spotify.
  • Open source project backed by Datastax.
Cassandra internals

Distribution internals

  • Data is partitioned around the ring
  • Coordinator node
  • Replication factor
  • Consistency level
  • Every partition has a key, using some hash algorithm, we can get the location of the node where that data will live.
  • Clients talk with a coordinator node, that is randomly assigned every time. That coordinator apply the hash function.
  • RF per keyspace. We need to have different replicas to ensure high availability even when some node is down.
  • CL per query/command. Coordinator will wait until required number of nodes ack the request. Could be ONE, QUORUM, ALL...
Cassandra internals

Distribution internals

  • Column family
  • Keyspace
  • Node
  • Rack
  • Data center
  • Cluster
  • Or table. Cassandra has some confusing names around tables, keys and so on. In the end, collection of partitions.
  • Collection of CF. RF is set at this level. Schema or namespace.
  • Could be a host. Every node has a range in the partition ring, so lookups can be extremely fast.
  • Availability zones in AWS. Your replication strategy should be aware of this and not placing replicas on the same rack, otherwise is pointless.
  • Could be one in USA and another one in Ireland. RF works like a mirror, and there are CL that address this scenario.
  • The whole cassandra system.
Cassandra internals

Storage internals: Write path

  • Commit log
  • Memtable
  • SSTable
  • Compaction
  • Tombstones
  • Append-only structure. Serves for durability in case of crash. Kind of a backup of memtable
  • Second write. One per CF. Serves like a first cache for sstables
  • SStable are writen when memtable are full so they are flushed into disk
  • Merging SStables to make the read faster, to remove old data and to remove deleted data.
  • Those are markers for deletion, that will happen on compaction
  • No writes or updates in place. Append-only immutable structures allowing sequential disk or in-memory ones. Extremely fast
Cassandra internals

Storage internals: Read path

  • Timestamps
  • Read from Memtable and SSTables
  • Caches
  • It's important to have a synchronised clock in different hosts. Timestamp is used for read resolution.
  • It chooses last cells/colums/values depending on timestamp.
  • There are a lot of caches that avoid the access to disk, making reads extremely fast.

3. Data modeling

  • Without a proper data modeling Cassandra won't be able to deliver expected perfomance
  • Some of the ideas from relational world are still valuable
  • With this section we'll understand why Cassandra aligns with previously presented Craftsmanship concepts.
Data modeling

Partition

  • Partition key
  • Clustering key
  • Primary key
  • Remember that a CF is collection of partitions. Partition key defines in which node and in which unit of a CF our data is going to live
  • Cassandra has the concept of nesting data. Those are cluster around this key. This key also defines the internal order of the rows inside a partition
  • Primary key is composed of previous keys. That primary key ensures uniqueness of data, but through upserts.
  • Table is a two-dimensional view of a multi-dimensional column family.
Data modeling

Anti patterns

  • Secondary indexes
  • Join tables
  • Allow filtering
  • We can equality query by partition key and inequality query by clustering keys. Using secondary indexes into regular colums has a great performance penalty as it needs to go to different nodes to find the data.
  • Joins are not even possible in Cassandra. You can do application joins, but those are disencouraged too.
  • Kind of table scan, so only useful for development purposes
Data modeling

Conceptual Modeling

  • Entity-relationship diagrams
  • Chen notation
  • Technology agnostic
  • Set up entities, relationships, cardinality between them and attributes.
  • This analysis can be done disregarding Cassandra.
Data modeling

Logical Modeling

  • Query based approach
  • Chebotko notation
  • Mapping rules and patterns
  • We need to think in our queries or workloads. Based on that we can define different tables for different needs
  • Different cardinalities usually have same logical designs. Uniqueness, ordering and key queries are easy to extract in patterns too.
Data modeling

Physical modeling

  • Fleshing out details
  • Optimizations
  • Providing types for the columns
  • Partition too big
  • Duplication in linear growth
  • Table too big or too small
Data modeling

Mindset change

  • Data consistency
  • Transactions
  • Concurrent access
  • As we have duplications we need to keep consistent the data. You need to it at application level as Cassandra doesn't give you that.
  • You could use logged batches, but they're a performance penalty.
  • You could use counters or lighweight transactions. But probably the better is duplicate the data in different partitions.

Thank you

Any questions?

Cassandra for craftsmen developers Created by Felipe Fernández / @felipefzdz