class: center, middle # Some Observations about PostgreSQL --- # INTRODUCTION # What is the Right Way to use PostgreSQL? Surprisingly hard to pin down, because RDBMSs are general purpose. --- # Bad-ish use cases: - Graph data* - Caching layer data† - Completely unstructured data‡ - Big Data (more than 10 TB on a single node as of 2016)§ .footnote[†though see "Our Journey from Graph Databases to PostgreSQL" http://engineering.hipolabs.com/graphdb-to-postgresql/] .footnote[†though I did manage to make pg run in RAM, on a lark: http://manniwood.com/2015_11_15/postgresql_94_in_ram.html] .footnote[‡JSONB] .footnote[§CitusDB] --- # Good-ish use cases: Data that need to be - correct - able to answer many different questions (that is, dosn't optimize answering one question over all others) --- # A Commonly Held Belief: Shared Global Mutable State is Bad # Definition of a Database: Shared Global Mutable State # Problematic --- # PART A: THE VALUE PROPOSITION Why the Relational Data Model? Why SQL? Why PostgreSQL? # What if there was a datastore that 1. only allowed correct data to be input, and 2. could only transition from one completely valid state to another? This could make shared global mutable state a lot easier to manage and reason about. --- ### 1. Only correct data may be input How NOT to leverage the power of SQL: ``` create table taxpayers ( first_name text, last_name text, sex text); insert into taxpayers default values; select * from taxpayers; first_name | last_name | sex ------------+-----------+-------- [NULL] | [NULL] | [NULL] ``` Garbage data insertion... successful! --- ### 1. Only correct data may be input One way to leverage the power of SQL: ``` create table taxpayers ( first_name text not null, last_name text not null, sex text constraint valid_sex check (sex in ('M', 'F')) not null); comment on table taxpayers is 'People or entities that pay taxes'; insert into taxpayers default values; ERROR: null value in column "first_name" violates not-null constraint insert into taxpayers (first_name, last_name) values ('Fred', 'Flintstone'); ERROR: null value in column "sex" violates not-null constraint insert into taxpayers (first_name, last_name, sex) values ('Fred', 'Flintstone', 'Y'); ERROR: new row for relation "taxpayers" violates check constraint "valid_sex" insert into taxpayers (first_name, last_name, sex) values ('Fred', 'Flintstone', 'm'); ERROR: new row for relation "taxpayers" violates check constraint "valid_sex" insert into taxpayers (first_name, last_name, sex) values ('Fred', 'Flintstone', 'M'); INSERT 0 1 ``` --- ### 1. Only correct data may be input Another way to leverage the power of SQL: ![Joe Celko's Data Measurements and Standards in SQL](joe_celko_data_measurements_and_standards_small.png) --- ### 1. Only correct data may be input Another way to leverage the power of SQL: ``` create table sexes ( code int constraint sexes_pk primary key constraint valid_iso_iec_5218_code check (code in (0, 1, 2, 9)) not null, description text not null); comment on table sexes is 'ISO/IEC 5218 codes for representation of human sexes'; insert into sexes (code, description) values (0, 'not known'), -- 0 is nicer than null; more on that later (1, 'male'), (2, 'female'), (9, 'not applicable'); -- corporations are people too ``` --- ### 1. Only correct data may be input Another way to leverage the power of SQL: ``` create table taxpayers ( ssn text constraint taxpayers_pk primary key constraint valid_ssn check (ssn ~ E'\\d{3}-\\d{2}-\\d{4}') not null, -- ^^^ regexps; nice! ^^^ first_name text not null, last_name text not null, sex_code int constraint valid_sex_code references sexes(code) not null); -- ^^^^^^^^ foreign keys; nice! ^^^^^^^^ insert into taxpayers (ssn, first_name, last_name, sex_code) values ('123-45-6789', 'Fred', 'Flintstone', 1); INSERT 0 1 insert into taxpayers (ssn, first_name, last_name, sex_code) values ('123-45-6789', 'Barney', 'Rubble', 1); ERROR: duplicate key value violates unique constraint "taxpayers_pk" DETAIL: Key (ssn)=(123-45-6789) already exists. insert into taxpayers (ssn, first_name, last_name, sex_code) values ('123-45-6780', 'Weyland-Yutani', '', 9); INSERT 0 1 ``` --- ### 1. Only correct data may be input Wow, it's like we have to pass a unit test just to put our data in this database. :-( ... Wow, it's like we have to pass a unit test just to put our data in this database! :-) --- ### 1. Only correct data may be input (fun detour: Singleton) ``` create table configuration ( id int constraint configuraration_pk primary key constraint "there can be only one" check (id = 1) not null, host text not null, port int constraint "port numbers cannot be negative" check (port > 0) not null); -- ^^ human constraint names ^^ yay! insert into configuration (id, host, port) values (1, 'localhost', 8080); INSERT 0 1 insert into configuration (id, host, port) values (1, 'localhost', 8080); ERROR: 23505: duplicate key value violates unique constraint "configuraration_pk" DETAIL: Key (id)=(1) already exists. insert into configuration (id, host, port) values (2, 'localhost', 9090); ERROR: 23514: new row for relation "configuration" violates check constraint "there can be only one" DETAIL: Failing row contains (2, localhost, 9090). update configuration set port = -8080; ERROR: 23514: new row for relation "configuration" violates check constraint "port numbers cannot be negative" DETAIL: Failing row contains (1, localhost, -8080). ``` --- ### 1. Only correct data may be input (nulls) A few thoughts on representing empty data. This guy's chapter on representing empty data is a good read. ![Joe Celko's SQL for Smarties, 5th ed](joe_celko_sql_for_smarties_5th_ed.jpeg) --- ### 1. Only correct data may be input (nulls) This guy thinks putting null in the relational data model was a mistake. ![Database in Depth](c_j_date_database_in_depth.png) --- ### 1. Only correct data may be input (nulls) ``` create table quarterly_results ( year int constraint "year must be unique" primary key not null, q1 int, q2 int, q3 int, q4 int); insert into quarterly_results (year, q1, q2, q3, q4) values (2013, 1, null, null, 2), (2014, 2, 3, 4, 5), (2015, 3, null, null, 2), (2016, 4, 4, 4, 2); ``` --- ### POP QUIZ! ``` year q1 q2 q3 q4 2013 1 [NULL] [NULL] 2 2014 2 3 4 5 2015 3 [NULL] [NULL] 2 2016 4 4 4 2 ``` Annual results for 2013: what is the total? ``` select q1 + q2 + q3 + q4 as annual_results from quarterly_results where year = 2013; ``` All second quarters summed: what is the total? ``` select sum(q2) as all_q2 from quarterly_results; ``` --- ### POP QUIZ! (ANSWERS) ``` year q1 q2 q3 q4 2013 1 [NULL] [NULL] 2 2014 2 3 4 5 2015 3 [NULL] [NULL] 2 2016 4 4 4 2 ``` Annual results for 2013: what is the total? ``` select q1 + q2 + q3 + q4 as annual_results from quarterly_results where year = 2013; NULL ``` All second quarters summed: what is the total? ``` select sum(q2) as all_q2 from quarterly_results; 7 ``` --- ### 1. Only correct data may be input (nulls) SQL null is not other languages' null. #### Ruby: (What are you doing?) ``` irb(main):001:0> 1 + nil TypeError: nil can't be coerced into Fixnum from (irb):2:in `+' from (irb):2 from /usr/bin/irb:11:in `
' ``` #### SQL: (Sure, no problem.) ``` select 1 + null; null ``` SQL has three-valued logic. Try to avoid it. --- ### 1. Only correct data may be input (nulls) Many data types already have a zero value. Number-ish types have zero ``` create table quarterly_results ( year int constraint "year must be unique" primary key not null, q1 int default 0 not null, q2 int default 0 not null, q3 int default 0 not null, q4 int default 0 not null); insert into quarterly_results (year, q1, q2, q3, q4) values (2013, 1, default, default, 2), (2014, 2, 3, 4, 5), (2015, 3, default, default, 2), (2016, 4, 4, 4, 2); ``` --- ### 1. Only correct data may be input (nulls) Many data types already have a zero value. Text-ish types have the empty string ``` create table addresses ( addr1 text not null, -- '' still allowed, but discouraged addr2 text not null default '', -- rarely used, so default to empty string state text not null); insert into addresses (addr1, state) values ('10 Somewhere Street', 'MA'); select addr1 || E'\n' || addr2 || E'\n' || state as addr from addresses; addr ------------------- 10 Somewhere Street MA ``` On the other hand... ``` select '10 A Street ' || null || 'MA'; null ``` --- ### 1. Only correct data may be input (nulls) Temporal types do not have a natural zero value. :-( They do have special values 'infinity' and '-infinity'. Too bad there's no special value 'never'. Sentry value or null? Pick your poison. --- # 2. Only valid states are allowed (Yeah, we're still on the value proposition part...) --- ### 2. Only valid states are allowed Transactions make it super-easy to reason about state! ``` create table accounts( num int constraint account_number primary key not null, owner text not null, type text constraint account_type check(type in ('chequing', 'savings')) not null, balance int not null default 0); insert into accounts (num, owner, type, balance) values (1, 'Mary', 'chequing', 20); insert into accounts (num, owner, type, balance) values (2, 'Mary', 'savings' , 20); MARY: \set AUTOCOMMIT off MARY: update accounts set balance = balance - 10 where num = 1; BOB: select * from accounts; BOB: num owner type balance BOB: --- ----- -------- ------- BOB: 1 Mary chequing 20 BOB: 2 Mary savings 20 MARY: update accounts set balance = balance + 10 where num = 2; MARY: commit; ``` --- ### 2. Only valid states are allowed One fun thing about transactions `now()` is always the same time during a transaction ``` \set AUTOCOMMIT off select now(); now ----------------------------- 2016-05-08 15:52:22.199741-04 ``` Wait 5 seconds ``` select now(); now ----------------------------- 2016-05-08 15:52:22.199741-04 ``` --- ### 2. Only valid states are allowed Another fun thing about transactions You can find out the current transaction id number ``` select txid_current(); txid_current ------------ 682 ``` --- ### 2. Only valid states are allowed A cool under-the-hood thing about transactions You can find out what transactions tuples are visible in (unless, of course, they are invisible to you) ``` select xmin, xmax, * from accounts; xmin xmax num owner type balance ---- ---- --- ----- -------- ------- 680 0 1 Mary chequing 10 680 0 2 Mary savings 30 ^ ^ | | | +-- id of deleting transaction (0 if not deleted) | +-- id of inserting/updating transaction ``` --- # PART B: A FEW (SOMETIMES DEPRESSING) THINGS I'VE NOTICED ABOUT HOW SQL IS USED IN THE REAL WORLD --- ## On the Realization that a Row is Kind of Like an Object Let's use the RDBMS as an Object Store! - Table definitions are like classes! - Rows are like objects of that class, in suspended animation! - I don't have to fit all of my objects in RAM! --- ### On the Realization that a Row is Kind of Like an Object #### Logging in a user, object-style 1. `select * from users where username = #{a_username}` 2. If not found, tell user login failed 3. If found, de-serialize into user object (`theUser`) 4. `if theUser.password != a_password`, tell user login failed 5. If we got this far, user is logged in! --- ### On the Realization that a Row is Kind of Like an Object 1. `select * from users where username = #{a_username}` A lot of data go over the wire when we select every column for a user yet we only need one column (`password`). How many columns does our user table have? Is there a better way? --- ### On the Realization that a Row is Kind of Like an Object Is there a better way? ``` select count(*) as matched from users where username = 'foo' and password = 'bar'; matched ------- 1 select count(*) as matched from users where username = 'foo' and password = 'ZZZ'; matched ------- 0 ``` One column takes fewer bytes going over the wire. Can we do even better? --- ### On the Realization that a Row is Kind of Like an Object Can we do even better? ``` select exists( select count(*) from users where username = 'foo' and password = 'bar' ) as succeeded; succeeded --------- t select exists( select count(*) from users where username = 'foo' and password = 'ZZZ' ) as succeeded; succeeded --------- f ``` We can reduce the amount of data travelling over the wire, *and* we can return the appropriate data type: a boolean answer to a boolean question. --- ### On the Realization that a Row is Kind of Like an Object How about counting the number of users? 1. All users as objects in RAM - `users.size` 2. All users in the database - don't deserialize all the users to objects and count them - `select count(*) from users` - returns one number over the wire As the Big Data people say: don't bring the data to the calculation; bring the calculation to the data. (RDBMS peole have been saying this before the Big Data people, but whatever.) --- ### On the Realization that a Row is Kind of Like an Object Reports All quarterly results as objects in RAM - iterate over them, calculating totals as you go - send off to rendering for pretty alternating row colors, bold totals, etc. Could fetch all results over the wire and do the same thing --- ### On the Realization that a Row is Kind of Like an Object Reports All quarterly results in database: ``` select 0 as srt, year, q1, q2, q3, q4, (q1 + q2 + q3 + q4) as annual from quarterly_results union all select 1 as srt, 0 as year, sum(q1) as q1, sum(q2) as q2, sum(q3) as q3, sum(q4) as q4, sum(q1 + q2 + q3 + q4) as annual from quarterly_results order by srt, year; srt year q1 q2 q3 q4 annual --- ---- -- -- -- -- ------ 0 2013 1 0 0 2 3 0 2014 2 3 4 5 14 0 2015 3 0 0 2 5 0 2016 4 4 4 2 14 1 0 10 7 8 11 36 ``` Send off to rendering for pretty alternating row colors, bold totals, etc. Shear off srt column. --- ### On the Realization that a Row is Kind of Like an Object #### Key insight: the RDBMS is not an Object Store; it's an Answer Engine. Once your data are - correctly represented - protected from incorrect/garbage entries ...you can use a general query language to test and verify assertions about your data. You can ask questions you didn't originally have about your data, and you will have a high probability of getting CORRECT answers. The answer can be - a scalar - a tuple - a table --- ### On the Realization that a Row is Kind of Like an Object #### Key insight: the RDBMS is not an Object Store; it's an Answer Engine. More fun reading! Ted Neward: The Vietnam of Computer Science http://blogs.tedneward.com/post/the-vietnam-of-computer-science/ Martin Fowler: OrmHate http://martinfowler.com/bliki/OrmHate.html Jeff Atwood's take: http://blog.codinghorror.com/object-relational-mapping-is-the-vietnam-of-computer-science/ Stack Overflow: http://stackoverflow.com/questions/404083/is-orm-still-the-vietnam-of-computer-science --- ## A Few Observations on Security ![Little Bobby Tables](http://imgs.xkcd.com/comics/exploits_of_a_mom.png) XKCD by Randall Munroe, xkcd.com Yes, sanitizing your inputs is always smart. However... #### Why exactly are we connecting to the database as such a privileged user? SQL has a very rich permissions system which goes largely ignored. --- ## A Few Observations on Security Let's pretend we have a table for gathering user suggestions: ``` create table suggestions ( id bigint constraint suggestions_pk primary key not null, suggestion text not null, created timestamp with time zone not null default now()); ``` Let's also assume that we have compromised our connection and can execute arbitrary SQL on the database! --- ## A Few Observations on Security Let's also assume that we have compromised our connection and can execute arbitrary SQL on the database! Deleting one tables' worth of data is so uncreative. Let's nuke the whole datatase! ``` drop database postgres; ERROR: 42501: must be owner of database postgres ``` Let's drop the entire public schema! ``` drop schema public cascade; ERROR: 42501: must be owner of schema public ``` Let's drop the suggestions table! ``` drop table suggestions; ERROR: 42501: must be owner of relation suggestions ``` Let's truncate the suggestions table! ``` truncate table suggestions; ERROR: 42501: permission denied for relation suggestions ``` --- ## A Few Observations on Security Let's delete from the suggestions table with no where clause! ``` delete from suggestions; ERROR: 42501: permission denied for relation suggestions ``` Let's update every suggestion so that it was created in the future, and so it won't render correctly! ``` update suggestions set created = 'infinity'; ERROR: 42501: permission denied for relation suggestions ``` Fine, I give up, let's just insert a bunch of nasty comments. ``` insert into suggestions (suggestion) values ('I hate you.'), ('I hate you.'), ('Really, you have no idea how much I hate you'), ('This isn''nt even fun anymore'); ERROR: 42501: permission denied for relation suggestions ``` --- ## A Few Observations on Security Let's read every suggestion and create a denial of service attack! ``` select * from suggestions; ERROR: 42501: permission denied for relation suggestions ``` Can I even insert suggestions at all with this database user? ``` select insert_suggestion('I give up'); insert_suggestion (1 row) ``` ### >:-( --- ## A Few Observations on Security ### The principle of least access: only let the user do what needs to be done, and no more ### (Or: Why the ____ are we letting users connect to our database as root?) --- ## A Few Observations on Security Let's put suggestions in their own schema, and make a `suggester` user who, when connecting to the database, will be "placed" in that schema via `set search_path` on connection creation. ``` create schema suggestions; set search_path = suggestions, pg_catalog; create sequence suggestions_seq; create table suggestions ( id bigint constraint suggestions_pk primary key not null, suggestion text not null, created timestamp with time zone not null default now()); create role suggester login password 'suggester'; ``` --- ## A Few Observations on Security Let's revoke all privileges from user `suggester`, even in the `suggestions` schema, save for one thing: the ability to run the insert_suggestion stored procedure. ``` create or replace function insert_suggestion (a_suggestion text) returns void as $$ begin insert into suggestions.suggestions ( id, suggestion) values ( nextval('suggestions_seq'), a_suggestion); end $$ language plpgsql security definer; -- runs as the user who created the function revoke all privileges on schema public from suggester; grant usage on schema suggestions to suggester; grant execute on function suggestions.insert_suggestion(text) to suggester; ``` --- ## A Few Observations on Security Not only were we able to restrict the `suggester` user to only inserting into `suggestions`; through a stored procedure, we were even able to restrict the *way* the `suggester` user inserted into `suggestions`. --- ## The future of PostgreSQL Continues to adapt to new use patterns and storage types (JSONB). Might need a better horizontal scaling story, moving forward. --- ## The future of SQL "Many companies have tried killing SQL, and all of them have failed." https://www.linkedin.com/pulse/next-generation-analytics-apocalypse-when-spark-drill-john-de-goes SQL exists outside of any one implementation. Even though SQL has traditionally been implemented in vertically scaled solutions, nothing in SQL precludes it from being implemented in a horizontally scalable way. --- ## The future of the relational data model Even when SQL dies, the relational data model will likely live on because it has mathematical/logical underpinnings, and is such a useful way of looking at data. "Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks", E. F. Codd --- # Questions? ![Ava](ava.jpg)