SQL Conventions

The only SQL convention you will ever need.

This convention is used @cloud_iam_com @Netwo @OuestFrance @hook0 @iAdvize @Bringr @Redsmin @Oxmoto.

Table of Contents

The only SQL convention you will ever need.
Data layer
Application layer
- Queries
DDL - Data Description Language
Tables/Views
Constraints
Functions
- Name
- Parameters
  - Example
Types
- Enum types
- Boolean
- String
- JSONB
Triggers
- Name
- Columns
Policies
- Name
SQL Formatter
Configuration
- statement_timeout
Things to monitor
Schema design
Tools
Migrations
Good practices
Managed PostgreSQL Databases

Data layer

For SQL use PostgreSQL, it’s the most loved relational database (StackOverflow survey 2018) and it’s a multi-model database (K/V store, Document store (use jsonb), foreign data wrapper, and much more). Any questions?

Application layer

If your API is only doing mainly data persistence use Postgrest is the way to go and only implement the missing part in another process. You can then compose both API with the reverse-proxy.
Otherwise, use a data-mapping library (e.g. doobie) not an ORM.

Queries

Don’t use BETWEEN (why)
Prefer = to LIKE

LIKE compares characters, and can be paired with wildcard operators like %, whereas the = operator compares strings and numbers for exact matches. The = can take advantage of indexed columns. (source)

Prefer EXIST to IN

If you just need to verify the existence of a value in a table, prefer EXISTS to IN, as the EXISTS process exits as soon as it finds the search value, whereas IN will scan the entire table. IN should be used for finding values in lists. Similarly, prefer NOT EXISTS to NOT IN. (source)

DDL - Data Description Language

SET search_path=pg_catalog to force to explicitely specify schema names in every object declaration (besides triggers). This will lower bugs and gives better understanding to developers because (explicit > implicit.

Tables/Views

Table/Views name

singular (e.g. team not teams) (Here is why)
snake_case (e.g. block_theme_version)
double underscore for n-n tables (e.g. user__organization)

Columns

snake_case (for example: created_at. Not createdAt or CreatedAt) Because in PostgreSQL keywords and unquoted identifiers are case insensitive and is the source of many mistakes.
double underscore for PK and FK columns (e.g. (PK) user__id, (PK) user__id, (FK) organization__id, (FK) organization__id)
- why?
  - leverage using(column__id)
  - easier to grasp for PK/FK the table name part (the part before __) for snake_case columns
  - Column are case-sensitive in postgresql but SQL queries are case insensitive
NOT NULL by default, NULL is the exception (think of it as the maybe Monad)
No abbreviation unless it’s both well-known and very long like i18n
No reserved keywords (Complete list)
Use UUID as PK and FK (Where is why), (do not use serial) rely on gen_random_uuid() (benchmark)
Note that when you use Postgres native UUID v4 type instead of bigserial, table size grows by 25% and insert rate drops to 25%.
If you choose bigserial than distinguish internal and external ids (e.g. gitlab internal schema design names column "iid" those that are publicly shared to the end user). Don’t forget to add an index CREATE UNIQUE INDEX index_issues_on_project_id_and_iid ON public.issues USING btree (project_id, iid);
Use text or citext (variable unlimited length) with check constraint instead of varchar(n) or char(n).
text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks. ([source](https://shekhargulati.com/2022/07/08/my-notes-on-gitlabs-postgres-schema-design/))

Date time management

Use timestamptz everywhere you need to store a date (e.g. created_at TIMESTAMPTZ DEFAULT now() (Here is why)) and leverage the BRIN index on it
updated_at TIMESTAMPTZ DEFAULT now() unless you plan to leverage (learn more) event-sourcing
deleted_at TIMESTAMPTZ DEFAULT NULL:
- unless you plan to leverage event-sourcing
- don’t forget to deleted_at
Comment each column, explain your rational, explain your decisions, should be in plain english
Boolean columns must start with either is or has.
Don’t use char(n) even for fixed-length identifiers

Constraints

General rule is: {tablename}_{columnname(s)}_{suffix} (e.g. table_name_column_name_a__pkey) where the suffix is one of the following:

Primary Key constraint: pk
Foreign key: fk
Unique constraint: key
Check constraint: chk
Exclusion constraint: exl
Any other kind of index: idx

PK - Primary Key

{table_name}_{column_name}_pk in case of a single column PK
{table_name}_{column_name1}_{column_name2}_{column_name3}_pk in case of multiple columns as primary key (column_name1, column_name2, column_name3)

FK - Foreign key

{from_table_name}_{from_column_name}_{to_table_name}_{to_column_name}__fk
Always specify ON DELETE ON UPDATE in order to force you to think about reference consequences

Unique

{from_table_name}_{from_column_name}_key in case of a single column unique constraint
{from_table_name}_{from_column_name1}_{from_column_name2}_{from_column_name3}__key in case of multiple columns as unique (column_name1, column_name2, column_name3)

Functions

Name

They are 3 types of functions, notify functions and private functions and public functions

notify, format: notify[schema_name][table_name][event] (e.g. notify_authentication_user_created(user_id)): should only format the notification message underneath and use pg_notify. Beware of the 8000 characters limit, only send metadata (ids), data should be asked by workers through the API. If you really wish to send data then pg_kafka might be a better alternative.
private, format: _[_function_name_] (e.g. _reset_failed_login): must never be exposed through the public schema. Used mainly for consistency and business-rules
public, format [_function_name_] (e.g. log_in(email, password)): must be exposed through the public schema.

Parameters

Every parameter name must ends with $. This will prevent any "Reference to XXX is ambiguous" issue.

Example

create function lib_fsm.transition_create(
  from_state__id$ uuid,
  event$ varchar(30),
  to_state__id$ uuid,
  description$ text default null
)

Types

Enum types

Don’t use enums, you will have issue over time because you cannot remove element from an enum. If your enums represent various state, leverage a state machine. Use a library like lib_fsm.

Boolean

Always use true and false, without single-quote.

PostgreSQL documentation says that TRUE and FALSE should be prefered because they are more SQL compliant but hey, LET’S STOP YELLING WHEN WE WRITE SQL SHALL WE?

String

Multi-line string must be represented with $$my string$$

JSONB

prefer jsonb to json and sql arrays. Jsonb has improved query performance and efficient storage
A metadata jsonb column is a great way to let the end-user store arbitrary key-value data to these objects. (e.g. https://documentation.hook0.com/docs/metadata https://stripe.com/docs/api/metadata )

metadata key-value pair must be limited in space you can use a trigger for that:

CREATE OR REPLACE FUNCTION validate_metadata()
RETURNS TRIGGER AS $$
DECLARE
    key TEXT;
    value TEXT;
    keys INT;
BEGIN
    keys := 0;

    FOR key, value IN (SELECT * FROM jsonb_each_text(NEW.metadata))
    LOOP
        keys := keys + 1;

        IF length(key::text) > 40 OR length(value::text) > 500 THEN
            RAISE 'Key and value must be at most 40 and 500 characters long respectively.';
        END IF;

        IF keys > 50 THEN
            RAISE 'A maximum of 50 keys are allowed in the metadata.';
        END IF;
    END LOOP;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER validate_metadata_trigger
BEFORE INSERT OR UPDATE ON your_table
FOR EACH ROW EXECUTE FUNCTION validate_metadata();

Triggers

Name

(translation in progress)

Columns

utiliser BNCF (au dessus de la 3NF) (cf normal form)
leverage using, so instead of:

select <fields> from
  table_1
  inner join table_2
    on table_1.table_1_id =
       table_2.table_1_id

use:

select <fields> from
  table_1
  inner join table_2
    using (table_1_id)

don’t use PostgreSQL enums you will have issues when you need to remove some values over time. Use a dedicated table instead.
use the right PostgreSQL types:

inet (IP address)
timestamp with time zone
point (2D point)
tstzrange (time range)
interval (duration)

constraint should be inside your database as much as possible:

create table reservation(
    reservation_id uuid primary key,
    dates tstzrange not null,
    exclude using gist (dates with &&)
);

use row-level-security to ensure R/U/D access on each table rows

(source)

Policies

Name

todo.

SQL Formatter

docker run --rm --network=none guriandoro/sqlparse:0.3.1 "SELECT several, columns from a_table as a join another_table as b where a.id = 1;"

Configuration

`statement_timeout`

Since we do want to limit everything in space and time, configure statement_timeout on role to let your database abort any statement that takes more than the specified amount of time (in ms).

-- Limit in time SQL queries => improve overall reliability
-- https://www.postgresql.org/docs/current/runtime-config-client.html
-- PostgreSQL WILL ABORT any statement that takes more than the specified amount of time (in milliseconds)
-- If you do have an issue with that, please first (from first to last):
--  - .. check that your query is relying on indices (did you use EXPLAIN (ANALYZE, BUFFERS) ?)
--  - .. consider materialized views
--  - .. ensure pg cache settings are OK
--  - .. ensure the disk is SSD and fast enough
--  - .. ensure the server has enough CPU & RAM
--  - .. check if its for analytics purposes, if so then requesting a postgres replica might be a better idea
-- When all these above points were evaluated *then* we can all talk about increasing the values below :)
alter role APP_ROLE_THAT_DOES_THE_QUERY set statement_timeout to '250ms';

Things to monitor

Removing PostgreSQL Bottlenecks Caused by High Traffic

Your cache hit ratio tells you how often your data is served from in memory vs. having to go to disk. Serving from memory vs. going to disk will be orders of magnitude faster, thus the more you can keep in memory the better. Of course you could provision an instance with as much memory as you have data, but you don’t necessarily have to. Instead watching your cache hit ratio and ensuring it is at 99% is a good metric for proper performance. (Source)

SELECT
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM
  pg_statio_user_tables;

Under the covers Postgres is essentially a giant append only log. When you write data it appends to the log, when you update data it marks the old record as invalid and writes a new one, when you delete data it just marks it invalid. Later Postgres comes through and vacuums those dead records (also known as tuples). All those unvacuumed dead tuples are what is known as bloat. Bloat can slow down other writes and create other issues. Paying attention to your bloat and when it is getting out of hand can be key for tuning vacuum on your database. (Source)

WITH constants AS (
  SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 4 AS ma
), bloat_info AS (
  SELECT
    ma,bs,schemaname,tablename,
    (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
    (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
  FROM (
    SELECT
      schemaname, tablename, hdr, ma, bs,
      SUM((1-null_frac)*avg_width) AS datawidth,
      MAX(null_frac) AS maxfracsum,
      hdr+(
        SELECT 1+count(*)/8
        FROM pg_stats s2
        WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
      ) AS nullhdr
    FROM pg_stats s, constants
    GROUP BY 1,2,3,4,5
  ) AS foo
), table_bloat AS (
  SELECT
    schemaname, tablename, cc.relpages, bs,
    CEIL((cc.reltuples*((datahdr+ma-
      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta
  FROM bloat_info
  JOIN pg_class cc ON cc.relname = bloat_info.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema'
), index_bloat AS (
  SELECT
    schemaname, tablename, bs,
    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols
  FROM bloat_info
  JOIN pg_class cc ON cc.relname = bloat_info.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema'
  JOIN pg_index i ON indrelid = cc.oid
  JOIN pg_class c2 ON c2.oid = i.indexrelid
)
SELECT
  type, schemaname, object_name, bloat, pg_size_pretty(raw_waste) as waste
FROM
(SELECT
  'table' as type,
  schemaname,
  tablename as object_name,
  ROUND(CASE WHEN otta=0 THEN 0.0 ELSE table_bloat.relpages/otta::numeric END,1) AS bloat,
  CASE WHEN relpages < otta THEN '0' ELSE (bs*(table_bloat.relpages-otta)::bigint)::bigint END AS raw_waste
FROM
  table_bloat
    UNION
SELECT
  'index' as type,
  schemaname,
  tablename || '::' || iname as object_name,
  ROUND(CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages/iotta::numeric END,1) AS bloat,
  CASE WHEN ipages < iotta THEN '0' ELSE (bs*(ipages-iotta))::bigint END AS raw_waste
FROM
  index_bloat) bloat_summary
ORDER BY raw_waste DESC, bloat DESC

Postgres makes it simply to query for unused indexes so you can easily give yourself back some performance by removing them (Source)

SELECT
            schemaname || '.' || relname AS table,
            indexrelname AS index,
            pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
            idx_scan as index_scans
FROM pg_stat_user_indexes ui
         JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE NOT indisunique AND idx_scan < 50 AND pg_relation_size(relid) > 5 * 8192
ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST,
         pg_relation_size(i.indexrelid) DESC;

pg_stat_statements is useful for monitoring your database query performance. It records a lot of valuable stats about which queries are run, how fast they return, how many times their run, etc. Checking in on this set of queries regularly can tell you where is best to add indexes or optimize your application so your query calls may not be so excessive. (Source)

SELECT query,
       calls,
       total_time,
       total_time / calls as time_per,
       stddev_time,
       rows,
       rows / calls as rows_per,
       100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
WHERE query not similar to '%pg_%'
and calls > 500
--ORDER BY calls
--ORDER BY total_time
order by time_per
--ORDER BY rows_per
DESC LIMIT 20;

Schema design

Stripe own schema

Tools

Migrations

Good practices

Managed PostgreSQL Databases

Google Cloud PostgreSQL
- Pros
- Cons
  - No support for plv8
  - Any features that require superuser privileges are not supported
  - postgres role is not a superuser
    
    Can create roles
    
    Can not select from tables that are restricted by default like pg_shadow
    
    Thus can not edit pg_catalog.pg_class (in order to change row level security activation for example)
    
    Can read from all necessary tables other than pg_authid

Scaleway Managed PostgreSQL:
- Pros
  - multi-schema support
  - configuration options are editable
  - user/role management is self-service
- Cons
  - /
OVH Cloud SQL
- Pros
  - /
- Cons
  - no multi-schema support

fgribreau / sql-convention Goto Github PK

sql-convention's Introduction

SQL Conventions

The only SQL convention you will ever need.

Data layer

Application layer

Queries

DDL - Data Description Language

Tables/Views

Table/Views name

Columns

Date time management

Constraints

PK - Primary Key

FK - Foreign key

Unique

Functions

Name

Parameters

Example

Types

Enum types

Boolean

String

JSONB

Triggers

Name

Columns

Policies

Name

SQL Formatter

Configuration

statement_timeout

Things to monitor

Schema design

Tools

Migrations

Good practices

Managed PostgreSQL Databases

sql-convention's People

Contributors

Stargazers

Watchers

Forkers

sql-convention's Issues

Recommend Projects

Recommend Topics

Recommend Org

`statement_timeout`