Some Observations About Cassandra

class: center, middle

# Some Observations About Cassandra

---

# Many data stores are, at heart, key-value

What distinguishes one KVS from another is

- how structured the value is

- what the data access paths are

- whether the KVS was intended to be vertially or horizontally scalable

---

# Completely unstructured values:

The values are binary blobs; data store totally doesn't care

- memcached

verticaly scalable

- HBase

horizontally scalable

---

# Data structures as values:

- Redis

The values are scalars, lists, maps etc; Redis provides
commands to manipulate the contents of the values.

Vertically scalable.

---

# Tuples/rows as values:

- PostgreSQL

Vertically scalable.

The values are tuples, though the SQL standard calls them rows.

Every time you use an index, the value of the index is a TID (tuple ID),
which, in turn, is the key to a tuple in a table.

You can even look at them!

```
select ctid, * from users;
```

However...

---

# Aside: Why not scan all the values?

## (Hint: sloooooow)

SQL shows that sometimes ignoring the keys and scanning all the values
is a valid approach.

- but only for small data sets

- queries where the anwer doesn't have to be in real time

For horizontally-scalable scanning, see Map/Reduce.

(Still only suitable for batch, not real-time performance.)

---

# Cassandra!

Horizontally scalable.

Sometimes described as column-oriented.

Early versions allow mis-matched columns:

- every row in the same keyspace could have different columns!

```
key1 colA colB
     valA valB

key2 colA colB colC
     valA valB valC

key3 colX colY colZ
     valX valY valZ
```

---

Now, with CQL, enforces the same columns inside of a table:

```
key1 colA colB colC
     valA valB valC

key2 colA colB colC
     valA valB valC
```

Though null columns simply aren't recorded:

```
key1 colA colB colC
     valA valB valC

key2 colA      colC
     valA      valC
```

Adding a column to an existing table doesn't re-write existing values:

```
key1 colA colB
     valA valB

key2 colA colB newcol
     valA valB newval
```

---

### On-disk layout and data access paths are key to Cassandra:

```
create table emails (
  user_id int,
  sent timestamp,
  subj text,
  body text,
  primary key (user_id, sent));
```

- user_id is the partition key. user_id1 and user_id2 could be on different nodes.

- sent is the clustering key; it, and the remaining cols, repeat on disk,
in order of the clustering key.

```
user_id1  sent1:subj  sent1:body  sent2:subj  sent2:body
             val1       val1         val2        val2

user_id2  sent1:subj  sent1:body  sent2:subj  sent2:body
             val1       val1         val2        val2
```

---

# Key Insight: Cassandra demands that you think really hard about your data access paths

(Because all but the primary data access path will either suck or be impossible.)

---

## This is not to be cruel to Cassandra

Cassandra simply makes clear the tradeoffs of horizontal scaling.

Partition/shard any data store over multiple nodes, and guess what?

One data access path will become the "first class citizen", and all others
will essentially suck.

(Try sharding Redis or PostgreSQL --- same deal.)

So you have to plan for this.

(It also makes you realize how often you access your data in more than
one way.)

---

# Cassandra is not an RDBMS Part 1

## You must know your primary data access path

---

# PostgreSQL

```
create table users (
  id bigint constraint users_pk primary key not null,
  first_name text not null,
  last_name text not null);
```

"Gee, I wish lookups on users.last_name were fast."

```
create index users_last_name_idx on users(last_name);
```

:-)

The advantage of a dataset that can be vertically scaled!

- multiple data access paths that are equally performant

---

### Cassandra

```
create table users (
    id int primary key,  -- this lookup will be fast
    first_name text,
    last_name text);
insert into users (id, first_name, last_name) values (1, 'Manni', 'Wood');
insert into users (id, first_name, last_name) values (2, 'Bob', 'Smith');
insert into users (id, first_name, last_name) values (3, 'Sue', 'Johnson');
```
I can do a table scan, right?

```
select * from users where last_name = 'Smith';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot
  execute this query as it might involve data filtering and thus may have
  unpredictable performance. If you want to execute this query despite the
  performance unpredictability, use ALLOW FILTERING"
```
Only if I ask nicely. Also: really bad idea.

```
select * from users where last_name = 'Smith' allow filtering;

id | first_name | last_name
----+------------+-----------
  2 |        Bob |     Smith
```
---

### Cassandra (continued)

"Gee, I wish lookups on users.last_name were fast."

```
create index users_last_name_idx on users(last_name);
```

Wow, Cassandra lets us create indexes on data that are spread
around the cluster!

Of course, there's guidance on [when *not* to use an index](https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html#concept_ds_sgh_yzz_zj__when-no-index)

(Hint: never use indexes.)

---

### In Cassandra (and any other distributed KVS), lookup by key is king.

Not even ranges are acceptable:

```
select * from users where id > 0 and id < 4;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Only
  EQ and IN relation are supported on the partition key (unless you use
  the token() function)"
```
Lists of keys work, though (bad idea: bounce around the cluster!):

```
select * from users where id in (1, 2, 3);

```
select * from users where id = 3;

id | first_name | last_name
----+------------+-----------
  3 |        Sue |   Johnson
```

---

### Ranges on clustering keys are OK:

```
create table emails (user_id int, sent int, subj text, primary key (user_id, sent));

insert into emails (user_id, sent, subj) values (1, 1, 'Hello');
insert into emails (user_id, sent, subj) values (1, 2, 'Hello2');
insert into emails (user_id, sent, subj) values (1, 3, 'Hello3');
insert into emails (user_id, sent, subj) values (2, 1, 'Hello');
insert into emails (user_id, sent, subj) values (2, 2, 'Hello2');
insert into emails (user_id, sent, subj) values (2, 3, 'Hello3');

select * from emails where user_id > 0 and user_id < 3;
InvalidRequest: Error from server: code=2200 [Invalid query]
  message="Only EQ and IN relation are supported on the partition key
  (unless you use the token() function)"

select * from emails where user_id = 2 and sent > 0 and sent < 4;

---

# Cassandra is not an RDBMS Part 2

## The DB where rows decompose and consistency is eventual

Or: How to invent data with Cassandra

First, set up a three node cluster

---

## Node 1

Take down the whole cluster.

Start just node 1.

Insert to a table and key using consistency level one:

```
cqlsh:foo> consistency one;
Consistency level set to ONE.

cqlsh:foo> insert into users (id, first_name, last_name)
                      values (3, 'AAA', 'ZZZ')
                       using timestamp 1449548965888760; -- <--- useful
```

(Being able to enforce a timestamp is really useful for debugging / understanding
certain scanarios!)

---

## Node 2

Take down the whole cluster.

Start just node 2.

Insert to the same key using the same timestamp a user with a different first name and last name.

```
cqlsh:foo> consistency one;
Consistency level set to ONE.

cqlsh:foo> insert into users (id, first_name, last_name)
                      values (3, 'BBB', 'YYY')
                       using timestamp 1449548965888760; -- same time as node1
```

---

## Node 3

Take down the whole cluster.

Start just node 3.

Insert to the same key using the same timestamp a user with yet another different first name and last name.

```
cqlsh:foo> consistency one;
Consistency level set to ONE.

cqlsh:foo> insert into users (id, first_name, last_name)
                      values (3, 'CCC', 'XXX')
                       using timestamp 1449548965888760; -- same time as node[12]
```

---

## Resolution

Our three conflicting users are

```
AAA ZZZ
BBB YYY
CCC XXX
```

Bring up all three nodes.

```
cqlsh:foo> consistency;
Current consistency level is ONE.

cqlsh:foo> select * from users where id = 3;

id | first_name | last_name
----+------------+-----------
  3 |        CCC |       ZZZ
```

Cassandra will resolve the conflict for an exact timestamp using the lexicographically larger
value *for each column*.

Counter-intuitive: would expect an entire row to win.

---

# A few other observations

- Cassandra is an eventually consistent data store, with tunable consistency

In the prior example, asking for a higher consistency level with only one
node responding would have refused the update.

- Avoids a single source of truth bottleneck... except for the time,
which needs to be agreed upon by every node in the cluster.

---

# Cassandra is not an RDBMS Part 3

## Default durability settings favor throughput over durability

---

# Basic durability test for any data store!

- set your hard drive to write through (not write back)

```
hdparm -W0 /dev/sda
```

- Create a simple table to hold integers

- Create a client to write increasing integers to the table

- kill -9 your database

- Start your database back up and see if all confirmed writes still exist

---

# PostgreSQL - The Gold Standard

The client is inserting, and then we kill -9 the database!

```
...
inserted 1880
inserted 1881
inserted 1882
Did not insert 1883
```

We power up the database and check the last value written:

```
select max(i) from t;

max  
------
 1882
```

Yay, durability! (Gonna pay a throughput price for that, though.)

---

# Cassandra, default durability settings

The client is inserting, and then we kill -9 the entire cluster!

```
...
inserted 3096
inserted 3097
inserted 3098
Did not insert 3099

```

We power up the cluster and check the last value written:

```
select max(i) from t;

system.max(i)
---------------
          1825
```

Yay, throughput? (Gonna pay a durability price for that...)

---

# Cassandra, strict durability settings

```
$ cd ${HOME}/apache-cassandra-3.0.0/conf
$ vim cassandra.yaml
```

Turn the defalts...

```
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
#commitlog_sync: batch
#commitlog_sync_batch_window_in_ms: 2
```

...into safer settings:

```
#commitlog_sync: periodic
#commitlog_sync_period_in_ms: 10000
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2
```

---

# Cassandra, strict durability settings (cont'd)

The client is inserting, and then we kill the entire cluster!

```
...
inserted 1048
inserted 1049
inserted 1050
Did not insert 1051
```

We power up the cluster and check the last value written:

```
select max(i) from t;

system.max(i)
---------------
          1050
```

Yay, durability! (Gonna pay a throughput price for that, though.)

---

# Cassandra, strict durability settings (cont'd)

Being multi-node means Cassandra has a more complicated durability
story than a single-node RDBMS.

You can take more risks and have higher write throughput for not a
lot of lost real-world durability.

---

# Fun resources:

- Kyle Kingsbury's Call Me Maybe tests of DB durability

https://aphyr.com/tags/jepsen

- "How do we consume and make sense of all this data?" with Don Haderle

https://www.youtube.com/watch?v=H6v47sE6MvE

Fun to watch someone who worked on IBM's DB2 (relational) talk
about the longevity and importance of KVS.

- Designing Data-Intensive Applications, Martin Kleppmann, O'Reilly

Early Release -- no dead tree version yet.

- CockroachDB: SQL over distributed KVS

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md