class: center, middle # Some Observations About Cassandra --- # Many data stores are, at heart, key-value What distinguishes one KVS from another is - how structured the value is - what the data access paths are - whether the KVS was intended to be vertially or horizontally scalable --- # Completely unstructured values: The values are binary blobs; data store totally doesn't care - memcached verticaly scalable - HBase horizontally scalable --- # Data structures as values: - Redis The values are scalars, lists, maps etc; Redis provides commands to manipulate the contents of the values. Vertically scalable. --- # Tuples/rows as values: - PostgreSQL Vertically scalable. The values are tuples, though the SQL standard calls them rows. Every time you use an index, the value of the index is a TID (tuple ID), which, in turn, is the key to a tuple in a table. You can even look at them! ``` select ctid, * from users; ``` However... --- # Aside: Why not scan all the values? ## (Hint: sloooooow) SQL shows that sometimes ignoring the keys and scanning all the values is a valid approach. - but only for small data sets or - queries where the anwer doesn't have to be in real time For horizontally-scalable scanning, see Map/Reduce. (Still only suitable for batch, not real-time performance.) --- # Cassandra! Horizontally scalable. Sometimes described as column-oriented. Early versions allow mis-matched columns: - every row in the same keyspace could have different columns! ``` key1 colA colB valA valB key2 colA colB colC valA valB valC key3 colX colY colZ valX valY valZ ``` --- Now, with CQL, enforces the same columns inside of a table: ``` key1 colA colB colC valA valB valC key2 colA colB colC valA valB valC ``` Though null columns simply aren't recorded: ``` key1 colA colB colC valA valB valC key2 colA colC valA valC ``` Adding a column to an existing table doesn't re-write existing values: ``` key1 colA colB valA valB key2 colA colB newcol valA valB newval ``` --- ### On-disk layout and data access paths are key to Cassandra: ``` create table emails ( user_id int, sent timestamp, subj text, body text, primary key (user_id, sent)); ``` - user_id is the partition key. user_id1 and user_id2 could be on different nodes. - sent is the clustering key; it, and the remaining cols, repeat on disk, in order of the clustering key. ``` user_id1 sent1:subj sent1:body sent2:subj sent2:body val1 val1 val2 val2 user_id2 sent1:subj sent1:body sent2:subj sent2:body val1 val1 val2 val2 ``` --- # Key Insight: Cassandra demands that you think really hard about your data access paths (Because all but the primary data access path will either suck or be impossible.) --- ## This is not to be cruel to Cassandra Cassandra simply makes clear the tradeoffs of horizontal scaling. Partition/shard any data store over multiple nodes, and guess what? One data access path will become the "first class citizen", and all others will essentially suck. (Try sharding Redis or PostgreSQL --- same deal.) So you have to plan for this. (It also makes you realize how often you access your data in more than one way.) --- # Cassandra is not an RDBMS Part 1 ## You must know your primary data access path --- # PostgreSQL ``` create table users ( id bigint constraint users_pk primary key not null, first_name text not null, last_name text not null); ``` "Gee, I wish lookups on users.last_name were fast." ``` create index users_last_name_idx on users(last_name); ``` :-) The advantage of a dataset that can be vertically scaled! - multiple data access paths that are equally performant --- ### Cassandra ``` create table users ( id int primary key, -- this lookup will be fast first_name text, last_name text); insert into users (id, first_name, last_name) values (1, 'Manni', 'Wood'); insert into users (id, first_name, last_name) values (2, 'Bob', 'Smith'); insert into users (id, first_name, last_name) values (3, 'Sue', 'Johnson'); ``` I can do a table scan, right? ``` select * from users where last_name = 'Smith'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING" ``` Only if I ask nicely. Also: really bad idea. ``` select * from users where last_name = 'Smith' allow filtering; id | first_name | last_name ----+------------+----------- 2 | Bob | Smith ``` --- ### Cassandra (continued) "Gee, I wish lookups on users.last_name were fast." ``` create index users_last_name_idx on users(last_name); ``` Wow, Cassandra lets us create indexes on data that are spread around the cluster! Of course, there's guidance on [when *not* to use an index](https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html#concept_ds_sgh_yzz_zj__when-no-index) (Hint: never use indexes.) --- ### In Cassandra (and any other distributed KVS), lookup by key is king. Not even ranges are acceptable: ``` select * from users where id > 0 and id < 4; InvalidRequest: Error from server: code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)" ``` Lists of keys work, though (bad idea: bounce around the cluster!): ``` select * from users where id in (1, 2, 3); id | first_name | last_name ----+------------+----------- 1 | Manni | Wood 2 | Bob | Smith 3 | Sue | Johnson ``` Single key is the optimal, highly-encouraged use case. ``` select * from users where id = 3; id | first_name | last_name ----+------------+----------- 3 | Sue | Johnson ``` --- ### Ranges on clustering keys are OK: ``` create table emails (user_id int, sent int, subj text, primary key (user_id, sent)); insert into emails (user_id, sent, subj) values (1, 1, 'Hello'); insert into emails (user_id, sent, subj) values (1, 2, 'Hello2'); insert into emails (user_id, sent, subj) values (1, 3, 'Hello3'); insert into emails (user_id, sent, subj) values (2, 1, 'Hello'); insert into emails (user_id, sent, subj) values (2, 2, 'Hello2'); insert into emails (user_id, sent, subj) values (2, 3, 'Hello3'); select * from emails where user_id > 0 and user_id < 3; InvalidRequest: Error from server: code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)" select * from emails where user_id = 2 and sent > 0 and sent < 4; user_id | sent | subj ---------+------+-------- 2 | 1 | Hello 2 | 2 | Hello2 2 | 3 | Hello3 ``` --- # Cassandra is not an RDBMS Part 2 ## The DB where rows decompose and consistency is eventual Or: How to invent data with Cassandra First, set up a three node cluster --- ## Node 1 Take down the whole cluster. Start just node 1. Insert to a table and key using consistency level one: ``` cqlsh:foo> consistency one; Consistency level set to ONE. cqlsh:foo> insert into users (id, first_name, last_name) values (3, 'AAA', 'ZZZ') using timestamp 1449548965888760; -- <--- useful ``` (Being able to enforce a timestamp is really useful for debugging / understanding certain scanarios!) --- ## Node 2 Take down the whole cluster. Start just node 2. Insert to the same key using the same timestamp a user with a different first name and last name. ``` cqlsh:foo> consistency one; Consistency level set to ONE. cqlsh:foo> insert into users (id, first_name, last_name) values (3, 'BBB', 'YYY') using timestamp 1449548965888760; -- same time as node1 ``` --- ## Node 3 Take down the whole cluster. Start just node 3. Insert to the same key using the same timestamp a user with yet another different first name and last name. ``` cqlsh:foo> consistency one; Consistency level set to ONE. cqlsh:foo> insert into users (id, first_name, last_name) values (3, 'CCC', 'XXX') using timestamp 1449548965888760; -- same time as node[12] ``` --- ## Resolution Our three conflicting users are ``` AAA ZZZ BBB YYY CCC XXX ``` Bring up all three nodes. ``` cqlsh:foo> consistency; Current consistency level is ONE. cqlsh:foo> select * from users where id = 3; id | first_name | last_name ----+------------+----------- 3 | CCC | ZZZ ``` Cassandra will resolve the conflict for an exact timestamp using the lexicographically larger value *for each column*. Counter-intuitive: would expect an entire row to win. --- # A few other observations - Cassandra is an eventually consistent data store, with tunable consistency In the prior example, asking for a higher consistency level with only one node responding would have refused the update. - Avoids a single source of truth bottleneck... except for the time, which needs to be agreed upon by every node in the cluster. --- # Cassandra is not an RDBMS Part 3 ## Default durability settings favor throughput over durability --- # Basic durability test for any data store! - set your hard drive to write through (not write back) ``` hdparm -W0 /dev/sda ``` - Create a simple table to hold integers - Create a client to write increasing integers to the table - kill -9 your database - Start your database back up and see if all confirmed writes still exist --- # PostgreSQL - The Gold Standard The client is inserting, and then we kill -9 the database! ``` ... inserted 1880 inserted 1881 inserted 1882 Did not insert 1883 ``` We power up the database and check the last value written: ``` select max(i) from t; max ------ 1882 ``` Yay, durability! (Gonna pay a throughput price for that, though.) --- # Cassandra, default durability settings The client is inserting, and then we kill -9 the entire cluster! ``` ... inserted 3096 inserted 3097 inserted 3098 Did not insert 3099 ``` We power up the cluster and check the last value written: ``` select max(i) from t; system.max(i) --------------- 1825 ``` Yay, throughput? (Gonna pay a durability price for that...) --- # Cassandra, strict durability settings ``` $ cd ${HOME}/apache-cassandra-3.0.0/conf $ vim cassandra.yaml ``` Turn the defalts... ``` commitlog_sync: periodic commitlog_sync_period_in_ms: 10000 #commitlog_sync: batch #commitlog_sync_batch_window_in_ms: 2 ``` ...into safer settings: ``` #commitlog_sync: periodic #commitlog_sync_period_in_ms: 10000 commitlog_sync: batch commitlog_sync_batch_window_in_ms: 2 ``` --- # Cassandra, strict durability settings (cont'd) The client is inserting, and then we kill the entire cluster! ``` ... inserted 1048 inserted 1049 inserted 1050 Did not insert 1051 ``` We power up the cluster and check the last value written: ``` select max(i) from t; system.max(i) --------------- 1050 ``` Yay, durability! (Gonna pay a throughput price for that, though.) --- # Cassandra, strict durability settings (cont'd) Being multi-node means Cassandra has a more complicated durability story than a single-node RDBMS. You can take more risks and have higher write throughput for not a lot of lost real-world durability. --- # Fun resources: - Kyle Kingsbury's Call Me Maybe tests of DB durability https://aphyr.com/tags/jepsen - "How do we consume and make sense of all this data?" with Don Haderle https://www.youtube.com/watch?v=H6v47sE6MvE Fun to watch someone who worked on IBM's DB2 (relational) talk about the longevity and importance of KVS. - Designing Data-Intensive Applications, Martin Kleppmann, O'Reilly Early Release -- no dead tree version yet. - CockroachDB: SQL over distributed KVS https://github.com/cockroachdb/cockroach/blob/master/docs/design.md