55 stories
·
0 followers

Youtube: It's Time for Operating Systems to Rediscover Hardware

1 Share
Read the whole story
bernhardbock
26 days ago
reply
Share this story
Delete

EBS Snapshot Pitfalls: Does your backup withstand reality?

1 Comment

Does your disaster recovery plan deliver what it promises? Here are three reasons why your plan won’t stand up to reality. Learn about common pitfalls when backing up EC2 instances with the help of EBS snapshots.

EBS Snapshot Pitfalls

A crash-consistent snapshot leads to data corruption

AWS describes EBS snapshots as crash-consistent. What does that mean? Imagine that a machine suddenly breaks. In what condition is the data on the hard disk? We don’t know whether the application and operating system wrote a consistent state to disk before the interruption. In other words: a crash-consistent snapshot is worthless for disaster recovery. You might restore corrupt or inconsistent data.

EBS Snapshot Pitfalls #1

Why is that? All EBS cares about are data blocks; it reads and writes ones and zeros. The operating system running on EC2 is responsible for persisting data on behalf of the application.

Be aware that even official tools like AWS Backup do not offer a solution to this problem - not to mention the countless third-party providers.

Solution: Before creating a snapshot, tell the application and operating system to write a consistent state to disk. For example on Linux, use AWS Systems Manager Automation to halt the application and ask the operating system to flush caches before creating an EBS snapshot. Running on Windows? Check out Create a VSS application-consistent snapshot.

A restored EBS volume requires initialization

Restoring a volume based on an EBS snapshot typically takes a few seconds only. However, the data is not available from the beginning. Instead, EBS restores the data asynchronously. You will notice high latencies during that period.

EBS Snapshot Pitfalls #2

If latency matters to your system, you should initialize a restored volume before ramping up traffic. To do so, make sure to read all blocks from the volume once. The following command does the trick on Linux. See Initialize Amazon EBS volumes for more detailed information.

fio --filename=/dev/xvda --rw=read --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --name=ebs-restore

Depending on the volume size, the volume type, and the EC2 instance type, initializing the volume might take a while. For example, it will take about 23 minutes to initialize an EBS volume of type gp3 with 500 GB connected to an EC2 instance of type m5a.large. So make sure to add the initialization phase when planning the Recovery Time Objective (RTO). The bottleneck is the maximum throughput from EC2 instance to EBS volume of 360 MB/s. See Amazon EBS–optimized instances for more details.

It should also be mentioned that AWS offers a feature called Amazon EBS fast snapshot restore. By enabling fast restore, it is no longer necessary to initialize a restored volume as described above. The volume is ready for latency-critical workloads from the start. However, the pricing for fast snapshot restore make clear that this feature is not intended for this use case: $500 per snapshot and availability zone.

Solution: Consider the time it takes to initialize a recovered volume when planning for disaster recovery. While creating an EBS volume based on a snapshot usually takes a few seconds only, it is necessary to read all blocks of a restored volume to ensure low latency throughput.

A SLA is missing for restoring EBS volumes

When someone else is operating our infrastructure, we need to rely on SLAs to ensure the provider complies with our requirements. Therefore, most AWS services come with a well-defined SLA.

Unfortunately, the SLA for EC2 and EBS is very unspecific. AWS does not specify any objectives when it comes to restoring EBS snapshots. Therefore, it isn’t easy to evaluate to what extent the system can be relied upon. For example, what will happen during a significant outage when many customers decide to restore machines from snapshots to recover machines in another availability zone or region?

Solution: There is no technical solution to this problem. Talk to your AWS representative about this and ask them to be more specific about their SLA.

Summary

To ensure that you can recover all data in an emergency, there are a few stumbling blocks to avoid.

  1. By default creating an EBS snapshot results in a crash-consistent backup. Restoring the snapshot might lead to corrupt or inconsistent data. Make sure to halt the application and flush caches before creating a snapshot to avoid that.
  2. EBS restores data from snapshots asynchronously. Therefore, you should initialize the volume by reading all blocks before ramping up your workload.
  3. Unfortunately, AWS does not publish any information about the extent to which we can rely on the recovery process of EBS snapshots. Especially if you want to plan for significant outages on AWS, this is very unsatisfactory. AWS needs to improve here.

By the way, are you interested in an example on how to use AWS Systems Manager Automation to create application-consistent snapshots on Linux? Please let me know! I’m considering to write a blog post about that.

Read the whole story
bernhardbock
27 days ago
reply
"Restoring a volume based on an EBS snapshot typically takes a few seconds only. However, the data is not available from the beginning. Instead, EBS restores the data asynchronously. You will notice high latencies during that period."
Share this story
Delete

Lesser Known PostgreSQL Features

1 Share

In 2006 Microsoft conducted a customer survey to find what new features users want in new versions of Microsoft Office. To their surprise, more than 90% of what users asked for already existed, they just didn't know about it. To address the "discoverability" issue, they came up with the "Ribbon UI" that we know from Microsoft Office products today.

Office is not unique in this sense. Most of us are not aware of all the features in tools we use on a daily basis, especially if it's big and extensive like PostgreSQL. With PostgreSQL 14 released just a few weeks ago, what a better opportunity to shed a light on some lesser known features that already exist in PostgreSQL, but you may not know.

In this article I present lesser known features of PostgreSQL.

<small>Illustration by <a href="https://www.instagram.com/_wrightdesign/">Eleanor Wright</a></small>
Illustration by Eleanor Wright

Table of Contents


Get the Number of Updated and Inserted Rows in an Upsert

INSERT ON CONFLICT, also known as "merge" (in Oracle) or "upsert" (a mashup of UPDATE and INSERT), is a very useful command, especially in ETL processes. Using the ON CONFLICT clause of an INSERT statement, you can tell the database what to do when a collision is detected in one or more key columns.

For example, here is a query to sync data in an employees table:

db=# WITH new_employees AS (
    SELECT * FROM (VALUES
        ('George', 'Sales',    'Manager',   1000),
        ('Jane',   'R&D',      'Developer', 1200)
    ) AS t(
         name,      department, role,       salary
    )
)
INSERT INTO employees (name, department, role, salary)
SELECT name, department, role, salary
FROM new_employees
ON CONFLICT (name) DO UPDATE SET
    department = EXCLUDED.department,
    role = EXCLUDED.role,
    salary = EXCLUDED.salary
RETURNING *;

  name  │ department │   role    │ salary
────────┼────────────┼───────────┼────────
 George │ Sales      │ Manager   │   1000
 Jane   │ R&D        │ Developer │   1200
INSERT 0 2

The query inserts new employee data to the table. If there is an attempt to add an employee with a name that already exists, the query will update that row instead.

You can see from the output of the command above, INSERT 0 2, that two employees were affected. But how many were inserted, and how many were updated? The output is not giving us any clue!

While I was looking for a way to improve the logging of some ETL process that used such query, I stumbled upon this Stack Overflow answer that suggested a pretty clever solution to this exact problem:

db=# WITH new_employees AS (
    SELECT * FROM (VALUES
        ('George', 'Sales',    'Manager',   1000),
        ('Jane',   'R&D',      'Developer', 1200)
    ) AS t(
         name,      department, role,       salary
    )
)
INSERT INTO employees (name, department, role, salary)
SELECT name, department, role, salary
FROM new_employees
ON CONFLICT (name) DO UPDATE SET
    department = EXCLUDED.department,
    role = EXCLUDED.role,
    salary = EXCLUDED.salary
RETURNING *, (xmax = 0) AS inserted;

  name  │ department │   role    │ salary │ inserted
────────┼────────────┼───────────┼────────┼──────────
 Jane   │ R&D        │ Developer │   1200 │ t
 George │ Sales      │ Manager   │   1000 │ f
INSERT 0 2

Notice the difference in the RETUNING clause. It includes the calculated field inserted that uses the special column xmax to determine how many rows were inserted. From the data returned by the command, you can spot that a new row was inserted for "Jane", but "George" was already in the table, so the row was updated.

The xmax column is a special system column:

The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version.

In PostgreSQL, when a row is updated, the previous version is deleted, and xmax holds the ID of the deleting transaction. When the row is inserted, no previous row is deleted, so xmax is zero. This "trick" is cleverly using this behavior to distinguish between updated and inserted rows.


Grant Permissions on Specific Columns

Say you have a users table that contain sensitive information such as credentials, passwords or PII:

db=# CREATE TABLE users (
    id INT,
    username VARCHAR(20),
    personal_id VARCHAR(10),
    password_hash VARCHAR(256)
);
CREATE TABLE

db=# INSERT INTO users VALUES (1, 'haki', '12222227', 'super-secret-hash');
INSERT 1 0

The table is used by different people in your organization, such as analysts, to access data and produce ad-hoc reports. To allow access to analysts, you add a special user in the database:

db=# CREATE USER analyst;
CREATE USER

db=# GRANT SELECT ON users TO analyst;
GRANT

The user analyst can now access the users table:

db=# \connect db analyst
You are now connected to database "db" as user "analyst".

db=> SELECT * FROM users;
 id │ username │ personal_id │   password_hash
────┼──────────┼─────────────┼───────────────────
  1 │ haki     │ 12222227    │ super-secret-hash

As mentioned previously, analysts access users data to produce reports and conduct analysis, but they should not have access to sensitive information or PII.

To provide granular control over which data a user can access in a table, PostgreSQL allows you to grant permissions only on specific columns of a table:

db=# \connect db postgres
You are now connected to database "db" as user "postgres".

db=# REVOKE SELECT ON users FROM analyst;
REVOKE

db=# GRANT SELECT (id, username) ON users TO analyst;
GRANT

After revoking the existing select permission on the table, you granted analyst select permission only on the id and username columns. Now, analyst can no longer access these columns:

db=# \connect db analyst
You are now connected to database "db" as user "analyst".

db=> SELECT * FROM users;
ERROR:  permission denied for table users

db=> SELECT id, username, personal_id FROM users;
ERROR:  permission denied for table users

db=> SELECT id, username FROM users;
 id │ username
────┼──────────
  1 │ haki

Notice that when the user analyst attempts to access any of the restricted columns, either explicitly or implicitly using *, they get a "permission denied" error.


Match Against Multiple Patterns

It's not uncommon to use pattern matching in SQL. For example, here is a query to find users with a "gmail.com" email account:

SELECT *
FROM users
WHERE email LIKE '%@gmail.com';

This query uses the wildcard '%' to find users with emails that end with "@gmail.com". What if, for example, in the same query you also want to find users with a "yahoo.com" email account?

SELECT *
FROM users
WHERE
    email LIKE '%@gmail.com'
    OR email LIKE '%@yahoo.com'

To match against either one of these patterns, you can construct an OR condition. In PostgreSQL however, there is another way to match against multiple patterns:

SELECT *
FROM users
WHERE email SIMILAR TO '%@gmail.com|%@yahoo.com'

Using SIMILAR TO you can match against multiple patterns and keep the query simple.

Another way to match against multiple patterns is using regexp:

SELECT *
FROM users
WHERE email ~ '@gmail\.com$|@yahoo\.com$'

When using regexp you need to take be a bit more cautious. A period "." will match anything, so to match the period "." in gmail.com or yahoo.com, you need to add the escape character "\.".

When I posted this on twitter I got some interesting responses. One comment from the official account of psycopg, a PostgreSQL driver for Python, suggested another way:

SELECT *
FROM users
WHERE email ~ ANY('{@gmail\.com$|@yahoo\.com$}')

This query uses the ANY operator to match against an array of patterns. If an email matches any of the patterns, the condition will be true. This approach is easier to work with from a host language such as Python:

with connection.cursor() as cursor:
    cursor.execute('''
        SELECT *
        FROM users
        WHERE email ~ ANY(ARRAY%(patterns)s)
    ''' % {
        'patterns': [
            '@gmail\.com$',
            '@yahoo\.com$',
        ],
    })

Unlike the previous approach that used SIMILAR TO, using ANY you can bind a list of patterns to the variable.


Find the Current Value of a Sequence Without Advancing It

If you ever needed to find the current value of a sequence, your first attempt was most likely using currval:

db=# SELECT currval('sale_id_seq');
ERROR:  currval of sequence "sale_id_seq" is not yet defined in this session

Just like me, you probably found that currval only works if the sequence was defined or used in the current session. Advancing a sequence for no good reason is usually not something you want to do, so this is not an acceptable solution.

In PostgreSQL 10 the table pg_sequences was added to provide easy access to information about sequences:

db=# SELECT * FROM pg_sequences WHERE sequencename = 'sale_id_seq';
─[ RECORD 1 ]─┬────────────
schemaname    │ public
sequencename  │ sale_id_seq
sequenceowner │ db
data_type     │ integer
start_value   │ 1
min_value     │ 1
max_value     │ 2147483647
increment_by  │ 1
cycle         │ f
cache_size    │ 1
last_value    │ 155

This table can answer your question, but it's not really a "lesser known feature", it's just another table in the information schema.

Another way to get the current value of a sequence is using the undocumented function pg_sequence_last_value:

db=# SELECT pg_sequence_last_value('sale_id_seq');
 pg_sequence_last_value
────────────────────────
                   155

It's not clear why this function is not documented, but I couldn't find any mention of it in the official documentation. Take that under consideration if you decide to use it.

Another interesting thing I found while I was researching this, is that you can query a sequence, just like you would a table:

db=# SELECT * FROM sale_id_seq;

 last_value │ log_cnt │ is_called
────────────┼─────────┼───────────
        155 │      10 │ t

This really makes you wonder what other types of objects you can query in PostgreSQL, and what you'll get in return.

It's important to note that this feature should not be used for anything except getting a cursory look at a sequence. You should not try to update ID's based on values from this output, for that you should use nextval.


Use \copy With Multi-line SQL

If you work with psql a lot you probably use \COPY very often to export data from the database. I know I do. One of the most annoying things about \COPY is that it does not allow multi-line queries:

db=# \COPY (
\copy: parse error at end of line

When you try to add a new line to a \copy command you get this error message.

To overcome this restriction, my first idea was to use a view:

db=# CREATE VIEW v_department_dbas AS
    SELECT department, count(*) AS employees
    FROM emp
    WHERE role = 'dba'
    GROUP BY department
    ORDER BY employees;
CREATE VIEW

db=# \COPY (SELECT * FROM v_department_dbas) TO department_dbas.csv WITH CSV HEADER;
COPY 5

db=# DROP VIEW v_department_dbas;
DROP VIEW;

This works, but if something fails in the middle it can leave views laying around. I like to keep my schema tidy, so I looked for a way to automatically cleanup after me. A quick search brought up temporary views:

db=# CREATE TEMPORARY VIEW v_department_dbas AS # ...
CREATE VIEW

db=# \COPY (SELECT * FROM v_department_dbas) TO department_dbas.csv WITH CSV HEADER;
COPY 5

Using temporary views I no longer had to cleanup after myself, because temporary views are automatically dropped when the session terminates.

I used temporary views for a while, until I struck this little gem in the psql documentation:

db=# COPY (
    SELECT department, count(*) AS employees
    FROM emp
    WHERE role = 'dba'
    GROUP BY department
    ORDER BY employees
) TO STDOUT WITH CSV HEADER \g department_dbas.csv
COP Y 5

Nice, right? Let's break it down:

  • Use COPY instead of \COPY: the COPY command is a server command executed in the server, and \COPY is a psql command with the same interface. So while \COPY does not support multi-line queries, COPY does!

  • Write results to STDOUT: Using COPY we can write results to a directory on the server, or write results to the standard output, using TO STDOUT.

  • Use \g to write STDOUT to local file: Finally, psql provides a command to write the output from standard output to a file.

Combining these three features did exactly what I wanted.

Copy expert

If you move a lot of data around, don't miss the fastest way to load data into PostgreSQL using Python.


Prevent Setting the Value of an Auto Generated Key

If you are using auto generated primary keys in PostgreSQL, it's possible you are still using the SERIAL datatype:

CREATE TABLE sale (
    id SERIAL PRIMARY KEY,
    sold_at TIMESTAMPTZ,
    amount INT
);

Behind the scenes, PostgreSQL creates a sequence to use when rows are added:

db=# INSERT INTO sale (sold_at, amount) VALUES (now(), 1000);
INSERT 0 1

db=# SELECT * FROM sale;
 id │           sold_at             │ amount
────┼───────────────────────────────┼────────
  1 │ 2021-09-25 10:06:56.646298+03 │   1000

The SERIAL data type is unique to PostgreSQL and has some known problems, so starting at version 10, the SERIAL datatype was softly deprecated in favor of identity columns:

CREATE TABLE sale (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    sold_at TIMESTAMPTZ,
    amount INT
);

Identity columns work very similar to the SERIAL datatype:

db=# INSERT INTO sale (sold_at, amount) VALUES (now(), 1000);
INSERT 0 1

db=# SELECT * FROM sale;
 id │           sold_at             │ amount
────┼───────────────────────────────┼────────
  1 │ 2021-09-25 10:11:57.771121+03 │   1000

But, consider this scenario:

db=# INSERT INTO sale (id, sold_at, amount) VALUES (2, now(), 1000);
INSERT 0 1

db=# INSERT INTO sale (sold_at, amount) VALUES (now(), 1000);
ERROR:  duplicate key value violates unique constraint "sale_pkey"
DETAIL:  Key (id)=(2) already exists.

Why did it fail?

  • The first INSERT command explicitly provides the value 2 of the id column, so the sequence was not used.
  • The second INSERT command does not provide a value for id, so the sequence is used. The next value of the sequence happened to be 2, so the command failed with a unique constraint violation.

Auto-incrementing IDs rarely need to be set manually, and doing so can cause a mess. So how can you prevent users from setting them?

CREATE TABLE sale (
    id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    sold_at TIMESTAMPTZ,
    amount INT
);

Instead of using GENERATED BY DEFAULT, use GENERATED ALWAYS. To understand the difference, try the same scenario again:

db=# INSERT INTO sale (sold_at, amount) VALUES (now(), 1000);
INSERT 0 1

db=# INSERT INTO sale (id, sold_at, amount) VALUES (2, now(), 1000);
ERROR:  cannot insert into column "id"
DETAIL:  Column "id" is an identity column defined as GENERATED ALWAYS.
HINT:  Use OVERRIDING SYSTEM VALUE to override.

What changed?

  • The first INSERT does not provide a value for id and completes successfully.
  • The second INSERT command however, attempts to set the value 2 for id and fails!

In the error message, PostgreSQL is kind enough to offer a solution for when you actually do want to set the value for an identity column explicitly:

db=# INSERT INTO sale (id, sold_at, amount)
OVERRIDING SYSTEM VALUE VALUES (2, now(), 1000);

INSERT 0 1

By adding the OVERRIDING SYSTEM VALUE to the INSERT command you explicitly instruct PostgreSQL to allow you to set the value of an identity column. You still have to handle a possible unique constraint violation, but you can no longer blame PostgreSQL for it!


Two More Ways to Produce a Pivot Table

In one of my previous articles I demonstrated how to produce pivot tables using conditional aggregates. After writing the article, I found two more ways to generate pivot tables in PostgreSQL.

Say you want to get the number of employees, at each role, in each department:

db=# WITH employees AS (
    SELECT * FROM (VALUES
        ('Haki',    'R&D',      'Manager'),
        ('Dan',     'R&D',      'Developer'),
        ('Jax',     'R&D',      'Developer'),
        ('George',  'Sales',    'Manager'),
        ('Bill',    'Sales',    'Developer'),
        ('David',   'Sales',    'Developer')
    ) AS t(
        name,       department,  role
    )
)
SELECT role, department, count(*)
FROM employees
GROUP BY role, department;

   role    │ department │ count
───────────┼────────────┼───────
 Developer │ Sales      │     2
 Manager   │ Sales      │     1
 Manager   │ R&D        │     1
 Developer │ R&D        │     2

A better way of viewing this would be as a pivot table. In psql you can use the \crosstabview command to transform the results of the last query to a pivot table:

db=# \crosstabview

   role    │ Sales │ R&D
───────────┼───────┼─────
 Developer │     2 │   2
 Manager   │     1 │   1

Magic!

By default, the command will produce the pivot table from the first two columns, but you can control that with arguments:

db=# \crosstabview department role

 department │ Developer │ Manager
────────────┼───────────┼─────────
 Sales      │         2 │       1
 R&D        │         2 │       1

Another, slightly less magical way to produce a pivot table is using the built-in tablefunc extension:

db=# CREATE EXTENSION tablefunc;
CREATE EXTENSION

db=# SELECT * FROM crosstab('
    SELECT role, department, count(*) AS employees
    FROM employees
    GROUP BY 1, 2
    ORDER BY role
', '
    SELECT DISTINCT department
    FROM employees
    ORDER BY 1
') AS t(role text, sales int, rnd int);

   role    │ sales │ rnd
───────────┼───────┼─────
 Developer │     2 │   2
 Manager   │     1 │   1

Using the function crosstab you can produce a pivot table. The downside of this method is that you need to define the output columns in advance. The advantage however, is that the crosstab function produces a table, which you can use as a sub-query for further processing.


Dollar Quoting

If you store text fields in your database, especially entire paragraphs, you are probably familiar with escape characters. For example, to include a single quote ' in a text literal you need to escape it using another single quote '':

db=# SELECT 'John''s Pizza';
   ?column?
──────────────
 John's Pizza

When text starts to get bigger, and include characters like backslashes and new lines, it can get pretty annoying to add escape characters. To address this, PostgreSQL provides another way to write string constants:

db=# SELECT $$a long
string with new lines
and 'single quotes'
and "double quotes

PostgreSQL doesn't mind ;)$$ AS text;
           text
───────────────────────────
 a long                   ↵
 string with new lines    ↵
 and 'single quotes'      ↵
 and "double quotes       ↵

 PostgreSQL doesn't mind ;)

Notice the dollar signs $$ at the beginning and end of the string. Anything in between $$ is treated as a string. PostgreSQL calls this "Dollar Quoting".

But there is more, if you happen to need to use the sign $$ in the text, you can add a tag, which makes this even more useful. For example:

db=# SELECT $JSON${
    "name": "John's Pizza",
    "tagline": "Best value for your $$"
}$JSON$ AS json;

                  json
─────────────────────────────────────────
 {                                      ↵
     "name": "John's Pizza",            ↵
     "tagline": "Best value for your $$"↵
 }

Notice that we choose to tag this block with $JSON$, so the sign "$$" was included as a whole in the output.

You can also use this to quickly generate jsonb objects that include special characters:

db=# SELECT $JSON${
    "name": "John's Pizza",
    "tagline": "Best value for your $$"
}$JSON$::jsonb AS json;
                          json
────────────────────────────────────────────────────────
 {"type": "book", "title": "How to get $$ in 21 days"}

The value is now a jsonb object which you can manipulate as you wish!


Comment on Database Objects

PostgreSQL has this nice little feature where you can add a comments on just about every database object. For example, adding a comment on a table:

db=# COMMENT ON TABLE sale IS 'Sales made in the system';
COMMENT

You can now view this comment in psql (and probably other IDEs):

db=# \dt+ sale
                                  List of relations
 Schema │ Name │ Type  │ Owner │ Persistence │    Size    │       Description
────────┼──────┼───────┼───────┼─────────────┼────────────┼──────────────────────────
 public │ sale │ table │ haki  │ permanent   │ 8192 bytes │ Sales made in the system

You can also add comments on table columns, and view them when using extended describe:

db=# COMMENT ON COLUMN sale.sold_at IS 'When was the sale finalized';
COMMENT

db=# \d+ sale
  Column  │           Type           │         Description
──────────┼──────────────────────────┼─────────────────────────────
 id       │ integer                  │
 sold_at │ timestamp with time zone │ When was the sale finalized
 amount   │ integer                  │

You can also combine the COMMENT command with dollar quoting to include longer and more meaningful descriptions of, for example, functions:

COMMENT ON FUNCTION generate_random_string IS $docstring$
Generate a random string at a given length from a list of possible characters.

Parameters:

    - length (int): length of the output string
    - characters (text): possible characters to choose from

Example:

    db=# SELECT generate_random_string(10);
     generate_random_string
    ────────────────────────
     o0QsrMYRvp

    db=# SELECT generate_random_string(3, 'AB');
     generate_random_string
    ────────────────────────
     ABB
$docstring$;

This is a function I used in the past to demonstrate the performance impact of medium sized texts on performance. Now I no longer have to go back to the article to remember how to use the function, I have the docstring right there in the comments:

db=# \df+ generate_random_string
List of functions
────────────┬────────────────────────────────────────────────────────────────────────────────
Schema      │ public
Name        │ generate_random_string
/* ... */
Description │ Generate a random string at a given length from a list of possible characters.↵
            │                                                                               ↵
            │ Parameters:                                                                   ↵
            │                                                                               ↵
            │     - length (int): length of the output string                               ↵
            │     - characters (text): possible characters to choose from                   ↵
            │                                                                               ↵
            │ Example:                                                                      ↵
            │                                                                               ↵
            │     db=# SELECT generate_random_string(10);                                   ↵
            │      generate_random_string                                                   ↵
            │     ────────────────────────                                                  ↵
            │      o0QsrMYRvp                                                               ↵
            │                                                                               ↵
            │     db=# SELECT generate_random_string(3, 'AB');                              ↵
            │      generate_random_string                                                   ↵
            │     ────────────────────────                                                  ↵
            │      ABB                                                                      ↵


Keep a Separate History File Per Database

If you are working with CLI tools you probably use the ability to search past commands very often. In bash and psql, a reverse search is usually available by hitting CTRL + R.

If in addition to working with the terminal, you also work with multiple databases, you might find it useful to keep a separate history file per database:

db=# \set HISTFILE ~/.psql_history- :DBNAME

This way, you are more likely to find a relevant match for the database you are currently connected to. You can drop this in your ~/.psqlrc file to make it persistent.


Autocomplete Reserved Words in Uppercase

There is always a lot of debate (and jokes!) on whether keywords in SQL should be in lower or upper case. I think my opinion on this subject is pretty clear.

If like me, you like using uppercase keywords in SQL, there is an option in psql to autocomplete keywords in uppercase:

db=# selec <tab>
db=# select

db=# \set COMP_KEYWORD_UPPER upper
db=# selec <tab>
db=# SELECT

After setting COMP_KEYWORD_UPPER to upper, when you hit TAB for autocomplete, keywords will be autocompleted in uppercase.


Sleep for Interval

Delaying the execution of a program can be pretty useful for things like testing or throttling. To delay the execution of a program in PostgreSQL, the go-to function is usually pg_sleep:

db=# \timing
Timing is on.

db=# SELECT pg_sleep(3);
 pg_sleep
──────────

(1 row)

Time: 3014.913 ms (00:03.015)

The function sleeps for the given number of seconds. However, when you need to sleep for longer than just a few seconds, calculating the number of seconds can be annoying, for example:

db=# SELECT pg_sleep(14400);

How long will this function sleep for? Don't take out the calculator, the function will sleep for 4 minutes.

To make it more convenient to sleep for longer periods of time, PostgreSQL offers another function:

db=# SELECT pg_sleep_for('4 minutes');

Unlike its sibling pg_sleep, the function pg_sleep_for accepts an interval, which is much more natural to read and understand than the number of seconds.


Get the First or Last Row in a Group Without Sub-Queries

When I initially compiled this list I did not think about this feature as a lesser known one, mostly because I use it all the time. But to my surprise, I keep running into weird solutions to this problem, that can be easily solved with what I'm about to show you, so I figured it deserves a place on the list!

Say you have the this table of students:

db=# SELECT * FROM students;

  name  │ class │ height
────────┼───────┼────────
 Haki   │ A     │    186
 Dan    │ A     │    175
 Jax    │ A     │    182
 George │ B     │    178
 Bill   │ B     │    167
 David  │ B     │    178

⚙ Table data

You can use the following CTE to reproduce queries in this section

WITH students AS (
    SELECT * FROM (VALUES
        ('Haki',    'A',    186),
        ('Dan',     'A',    175),
        ('Jax',     'A',    182),
        ('George',  'B',    178),
        ('Bill',    'B',    167),
        ('David',   'B',    178)
    ) AS t(
        name,       class,  height
    )
)
SELECT * FROM students;

How would you get the entire row of the tallest student in each class?

On first thought you might try something like this:

SELECT class, max(height) as tallest
FROM students
GROUP BY class;

 class │ tallest
───────┼─────────
 A     │     186
 B     │     178

This gets you the height, but it doesn't get you the name of the student. As a second attempt you might try to find the tallest student based on its height, using a sub-query:

SELECT *
FROM students
WHERE (class, height) IN (
    SELECT class, max(height) as tallest
    FROM students
    GROUP BY class
);

  name  │ class │ height
────────┼───────┼────────
 Haki   │ A     │    186
 George │ B     │    178
 David  │ B     │    178

Now you have all the information about the tallest students in each class, but there is another problem.

side note

The ability to match a set of records like in the previous query ((class, height) IN (...)), is another lesser known, but a very powerful feature of PostgreSQL.

In class "B", there are two students with the same height, which also happen to be the tallest. Using the aggregate function MAX you only get the height, so you may encounter this type of situation.

The challenge with using MAX is that you choose the height based only on the height, which makes perfect sense in this case, but you still need to pick just one student. A different approach that lets you "rank" rows based on more than one column, is using a window function:

SELECT
    students.*,
    ROW_NUMBER() OVER (
        PARTITION BY class
        ORDER BY height DESC, name
    ) AS rn
FROM
    students;

  name  │ class │ height │ rn
────────┼───────┼────────┼────
 Haki   │ A     │    186 │  1
 Jax    │ A     │    182 │  2
 Dan    │ A     │    175 │  3
 David  │ B     │    178 │  1
 George │ B     │    178 │  2
 Bill   │ B     │    167 │  3

To "rank" students bases on their height you can attach a row number for each row. The row number is determined for each class (PARTITION BY class) and ranked first by height in descending order, and then by the students' name (ORDER BY height DESC, name). Adding the student name in addition to the height makes the results deterministic (assuming the name is unique).

To get the rows of only the tallest student in each class you can use a sub-query:

SELECT
    name, class, height
FROM (
    SELECT
        students.*,
        ROW_NUMBER() OVER (
            PARTITION BY class
            ORDER BY height DESC, name
        ) AS rn
    FROM
        students
) as inner
WHERE
    rn = 1;

 name  │ class │ height
───────┼───────┼────────
 Haki  │ A     │    186
 David │ B     │    178

You made it! This is the entire row for the tallest student in each class.

Using DISTINCT ON

Now that you went through all of this trouble, let me show you an easier way:

SELECT DISTINCT ON (class)
    *
FROM
    students
ORDER BY
    class, height DESC, name;

 name  │ class │ height
───────┼───────┼────────
 Haki  │ A     │    186
 David │ B     │    178

Pretty nice, right? I was blown away when I first discovered DISTINCT ON. Coming from Oracle, there was nothing like that, and as far as I know, no other database other than PostgreSQL does.

Intuitively understand DISTINCT ON

To understand how DISTINCT ON works, let's go over what it does step by step. This is the raw data in the table:

SELECT *
FROM students;

  name  │ class │ height
────────┼───────┼────────
 Haki   │ A     │    186
 Dan    │ A     │    175
 Jax    │ A     │    182
 George │ B     │    178
 Bill   │ B     │    167
 David  │ B     │    178

Next, sort the data:

SELECT *
FROM students
ORDER BY class, height DESC, name;

  name  │ class │ height
────────┼───────┼────────
 Haki   │ A     │    186
 Jax    │ A     │    182
 Dan    │ A     │    175
 David  │ B     │    178
 George │ B     │    178
 Bill   │ B     │    167

Then, add the DISTINCT ON clause:

SELECT DISTINCT ON (class) *
FROM students
ORDER BY class, height DESC, name;

To understand what DISTINCT ON does at this point, we need to take two steps.

First, split the data to groups based on the columns in the DISTINCT ON clause, in this case by class:

  name  │ class │ height
─────────────────────────
 Haki   │ A     │    186  ┓
 Jax    │ A     │    182  ┣━━ class=A
 Dan    │ A     │    175  ┛

 David  │ B     │    178  ┓
 George │ B     │    178  ┣━━ class=B
 Bill   │ B     │    167  ┛

Next, keep only the first row in each group:

  name  │ class │ height
─────────────────────────
 Haki   │ A     │    186  ┣━━ class=A
 David  │ B     │    178  ┣━━ class=B

And there you have it! The tallest student in each class.

The only requirement DISTINCT ON has, is that the leading columns in the ORDER BY clause will match the columns in the DISTINCT ON clause. The remaining columns in the ORDER BY clause are used to determine which row is selected for each group.

To illustrate how the ORDER BY affect the results, consider this query to find the shortest student in each class:

SELECT DISTINCT ON (class)
    *
FROM
    students
ORDER BY
    class, height, name;

 name │ class │ height
──────┼───────┼────────
 Dan  │ A     │    175
 Bill │ B     │    167

To pick the shortest student in each class, you only have to change the sort order, so that the first row of each group is the shortest student.


Generate UUID Without Extensions

To generate UUIDs in PostgreSQL prior to version 13 you probably used the uuid-ossp extension:

db=# CREATE EXTENSION "uuid-ossp";
CREATE EXTENSION

db=# SELECT uuid_generate_v4() AS uuid;
                 uuid
──────────────────────────────────────
 8e55146d-0ce5-40ab-a346-5dbd466ff5f2

Starting at version 13 there is a built-in function to generate random (version 4) UUIDs:

db=# SELECT gen_random_uuid() AS uuid;
                 uuid
──────────────────────────────────────
 ba1ac0f5-5d4d-4d80-974d-521dbdcca2b2

The uuid-ossp extension is still needed if you want to generate UUIDs other than version 4.


Generate Reproducible Random Data

Generating radom data is very useful for many things such for demonstrations or testing. In both cases, it's also useful to be able to reproduce the "random" data.

Using PostgreSQL random function you can produce different types of random data. For example:

db=# SELECT
    random() AS random_float,
    ceil(random() * 10) AS random_int_0_10,
    '2022-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2022;

─[ RECORD 1 ]──────┬────────────────────
random_float       │ 0.6031888056092001
random_int_0_10    │ 3
random_day_in_2022 │ 2022-11-10 00:00:00

If you execute this query again, you will get different results:

db=# SELECT
    random() AS random_float,
    ceil(random() * 10) AS random_int_0_10,
    '2022-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2022;

─[ RECORD 1 ]──────┬────────────────────
random_float       │ 0.7363406030115378
random_int_0_10    │ 2
random_day_in_2022 │ 2022-02-23 00:00:00

To generate reproducible random data, you can use setseed:

db=# SELECT setseed(0.4050);
 setseed
─────────

(1 row)

db=# SELECT
    random() AS random_float,
    ceil(random() * 10) AS random_int_0_10,
    '2022-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2022
FROM
    generate_series(1, 2);

    random_float    │ random_int_0_10 │ random_day_in_2022
────────────────────┼─────────────────┼─────────────────────
 0.1924247516794324 │               9 │ 2022-12-17 00:00:00
 0.9720620908236377 │               5 │ 2022-06-13 00:00:00

If you execute the same block again in a new session, even in a different database, it will produce the exact same results:

otherdb=# SELECT setseed(0.4050);
 setseed
─────────

(1 row)

otherdb=# SELECT
    random() AS random_float,
    ceil(random() * 10) AS random_int_0_10,
    '2022-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2022
FROM
    generate_series(1, 2);

    random_float    │ random_int_0_10 │ random_day_in_2022
────────────────────┼─────────────────┼─────────────────────
 0.1924247516794324 │               9 │ 2022-12-17 00:00:00
 0.9720620908236377 │               5 │ 2022-06-13 00:00:00

Notice how the results are random, but still exactly the same. The next time you do a demonstration or share a script, make sure to include setseed so your results could be easily reproduced.


Add Constraints Without Validating Immediately

Constraint are an integral part of any RDBMS. They keep data clean and reliable, and should be used whenever possible. In living breathing systems, you often need to add new constraints, and adding certain types of constraints may require very restrictive locks that interfere with the operation of the live system.

To illustrate, add a simple check constraint on a large table:

db=# ALTER TABLE orders ADD CONSTRAINT check_price_gt_zero CHECK (price >= 0);
ALTER TABLE
Time: 10745.662 ms (00:10.746)

This statement adds a check constraint on the price of an order, to make sure it's greater than or equal to zero. In the process of adding the constraint, the database scanned the entire table to make sure the constraint is valid for all the existing rows. The process took ~10s, and during that time, the table was locked.

In PostgreSQL, you can split the process of adding a constraint into two steps.

First, add the constraint and only validate new data, but don't check that existing data is valid:

db=# ALTER TABLE orders ADD CONSTRAINT check_price_gt_zero CHECK (price >= 0) NOT VALID;
ALTER TABLE
Time: 13.590 ms

The NOT VALID in the end tells PostgreSQL to not validate the new constraint for existing rows. This means the database does not have to scan the entire table. Notice how this statement took significantly less time compared to the previous, it was almost instantaneous.

Next, validate the constraint for the existing data with a much more permissive lock that allows other operations on the table:

db=# ALTER TABLE orders VALIDATE CONSTRAINT check_price_gt_zero;
ALTER TABLE
Time: 11231.189 ms (00:11.231)

Notice how validating the constraint took roughly the same time as the first example, which added and validated the constraint. This reaffirms that when adding a constraint to an existing table, most time is spent validating existing rows. Splitting the process into two steps allows you to reduce the time the table is locked.

The documentation also mentions another use case for NOT VALID - enforcing a constraint only on future updates, even if there are some existing bad values. That is, you would add NOT VALID and never do the VALIDATE.

Check out this great article from the engineering team at Paypal about making schema changes without downtime, and my own tip to disable constraints and indexes during bulk loads.


Synonyms in PostgreSQL

Synonyms are a way to reference objects by another name, similar to symlinks in Linux. If you're coming from Oracle you are probably familiar with synonyms, but otherwise you may have never heard about it. PostgreSQL does not have a feature called "synonyms", but it doesn't mean it's not possible.

To have a name reference a different database object, you first need to understand how PostgreSQL resolves unqualified names. For example, if you are connected to the database with the user haki, and you reference a table foo, PostgreSQL will search for the following objects, in this order:

  1. haki.foo
  2. public.foo

This order is determined by the search_path parameter:

db=# SHOW search_path;
   search_path
─────────────────
 "$user", public

The first value, "$user" is a special value that resolves to the name of the currently connected user. The second value, public, is the name of the default schema.

To demonstrate some of the things you can do with search path, create a table foo in database db:

db=# CREATE TABLE foo (value TEXT);
CREATE TABLE

db=# INSERT INTO foo VALUES ('A');
INSERT 0 1

db=# SELECT * FROM foo;
 value
───────
 A
(1 row)

If for some reason you want the user haki to view a different object when they reference the name foo, you have two options:

1. Create an object named foo in a schema called haki:

db=# CREATE SCHEMA haki;
CREATE SCHEMA

db=# CREATE TABLE haki.foo (value text);
CREATE TABLE

db=# INSERT INTO haki.foo VALUES ('B');
INSERT 0 1

db=# \conninfo
You are connected to database "db" as user "haki"

db=# SELECT * FROM foo;
value
───────
B

Notice how when the user haki referenced the name foo, PostgreSQL resolved the name to haki.foo and not public.foo. This is because the schema haki comes before public in the search path.

2. Update the search path:

db=# CREATE SCHEMA synonyms;
CREATE SCHEMA

db=# CREATE TABLE synonyms.foo (value text);
CREATE TABLE

db=# INSERT INTO synonyms.foo VALUES ('C');
INSERT 0 1

db=# SHOW search_path;
   search_path
─────────────────
 "$user", public

db=# SELECT * FROM foo;
 value
───────
 A

db=# SET search_path TO synonyms, "$user", public;
SET

db=# SELECT * FROM foo;
 value
───────
 C

Notice how after changing the search path to include the schema synonyms, PostgreSQL resolved the name foo to synonyms.foo.

When synonyms are useful?

I used to think that synonyms are a code smell that should be avoided, but over time I found a few valid use cases for when they are useful. One of those use cases are zero downtime migrations.

When you are making changes to a table on a live system, you often need to support both the new and the old version of the application at the same time. This poses a challenge, because each version of the application expects the table to have a different structure.

Take for example a migration to remove a column from a table. While the migration is running, the old version of the application is active, and it expects the column to exist in the table, so you can't simply remove it. One way to deal with this is to release the new version in two stages - the first ignores the field, and the second removes it.

If however, you need to make the change in a single release, you can provide the old version with a view of the table that includes the column, and only then remove it. For that, you can use a "synonym":

db=# \conninfo
You are now connected to database "db" as user "app".

db=# SELECT * FROM users;
 username │ active
──────────┼────────
 haki     │ t

The application is connected to database db with the user app. You want to remove the column active, but the application is using this column. To safely apply the migration you need to "fool" the user app into thinking the column is still there while the old version is active:

db=# \conninfo
You are now connected to database "db" as user "admin".

db=# CREATE SCHEMA app;
CREATE SCHEMA

db=# GRANT USAGE ON SCHEMA app TO app;
GRANT

db=# CREATE VIEW app.users AS SELECT username, true AS active FROM public.users;
CREATE VIEW

db=# GRANT SELECT ON app.users TO app;
GRANT

To "fool" the user app, you created a schema by the name of the user, and a view with a calculated field active. Now, when the application is connected with user app, it will see the view and not the table, so it's safe to remove the column:

db=# \conninfo
You are now connected to database "db" as user "admin".

db=# ALTER TABLE users DROP COLUMN active;
ALTER TABLE

db=# \connect db app
You are now connected to database "db" as user "app".

db=# SELECT * FROM users;
 username │ active
──────────┼────────
 haki     │ t

You dropped the column and the application sees the calculated field instead! All is left is some cleanup and you are done.


Find Overlapping Ranges

Say you have a table of meetings:

db=# SELECT * FROM meetings;
       starts_at     │        ends_at
─────────────────────┼─────────────────────
 2021-10-01 10:00:00 │ 2021-10-01 10:30:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00
 2021-10-01 12:30:00 │ 2021-10-01 12:45:00

⚙ Table data

You can use the following CTE to reproduce the queries in this section:

WITH meetings AS (
    SELECT
        starts_at::timestamptz AS starts_at,
        ends_at::timestamptz AS ends_at
    FROM (VALUES
        ('2021-10-01 10:00 UTC', '2021-10-01 10:30 UTC'),
        ('2021-10-01 11:15 UTC', '2021-10-01 12:00 UTC'),
        ('2021-10-01 12:30 UTC', '2021-10-01 12:45 UTC')
    ) AS t(
        starts_at,               ends_at)
)
SELECT * FROM meetings;

You want to schedule a new meeting, but before you do that, you want to make sure it does not overlap with another meeting. There are several scenarios you need to consider:

  • [A] New meeting ends after an existing meeting starts
|-------NEW MEETING--------|
    |*******EXISTING MEETING*******|
  • [B] New meeting starts before an existing meetings ends
        |-------NEW MEETING--------|
|*******EXISTING MEETING*******|
  • [C] New meeting takes place during an existing meeting
    |----NEW MEETING----|
|*******EXISTING MEETING*******|
  • [D] Existing meeting takes place while the new meeting is scheduled
|--------NEW MEETING--------|
    |**EXISTING MEETING**|
  • [E] New meeting is scheduled at exactly the same time as an existing meeting
|--------NEW MEETING--------|
|*****EXISTING MEETING******|

To test a query that check for overlaps, you can prepare a table with all the scenarios above, and try a simple condition:

WITH new_meetings AS (
    SELECT
        id,
        starts_at::timestamptz as starts_at,
        ends_at::timestamptz as ends_at
    FROM (VALUES
        ('A', '2021-10-01 11:10 UTC', '2021-10-01 11:55 UTC'),
        ('B', '2021-10-01 11:20 UTC', '2021-10-01 12:05 UTC'),
        ('C', '2021-10-01 11:20 UTC', '2021-10-01 11:55 UTC'),
        ('D', '2021-10-01 11:10 UTC', '2021-10-01 12:05 UTC'),
        ('E', '2021-10-01 11:15 UTC', '2021-10-01 12:00 UTC')
    ) as t(
        id,   starts_at,               ends_at
    )
)
SELECT
    *
FROM
    meetings, new_meetings
WHERE
    new_meetings.starts_at BETWEEN meetings.starts_at and meetings.ends_at
    OR new_meetings.ends_at BETWEEN meetings.starts_at and meetings.ends_at;

       starts_at     │        ends_at      │ id │       starts_at     │        ends_at
─────────────────────┼─────────────────────┼────┼─────────────────────┼────────────────────
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ A  │ 2021-10-01 11:10:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ B  │ 2021-10-01 11:20:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ C  │ 2021-10-01 11:20:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ E  │ 2021-10-01 11:15:00 │ 2021-10-01 12:00:00

The first attempt found an overlap with 4 out of 5 scenarios. It did not detect the overlap for scenario D, where the new meetings starts before and ends after an existing meeting. To handle this scenario as well, you need to make the condition a bit longer:

WITH new_meetings AS (/* ... */)
SELECT
    *
FROM
    meetings, new_meetings
WHERE
    new_meetings.starts_at BETWEEN meetings.starts_at and meetings.ends_at
    OR new_meetings.ends_at BETWEEN meetings.starts_at and meetings.ends_at
    OR meetings.starts_at BETWEEN new_meetings.starts_at and new_meetings.ends_at
    OR meetings.ends_at BETWEEN new_meetings.starts_at and new_meetings.ends_at;


       starts_at     │        ends_at      │ id │       starts_at     │        ends_at
─────────────────────┼─────────────────────┼────┼─────────────────────┼────────────────────
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ A  │ 2021-10-01 11:10:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ B  │ 2021-10-01 11:20:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ C  │ 2021-10-01 11:20:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ D  │ 2021-10-01 11:10:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ E  │ 2021-10-01 11:15:00 │ 2021-10-01 12:00:00

The query now detects an overlap in all 5 scenarios, but, consider these additional scenarios:

  • [F] New meeting is scheduled immediately after an existing meetings
                            |--------NEW MEETING--------|
|*****EXISTING MEETING******|
  • [G] New meeting is scheduled to end immediately when an existing meeting starts
|--------NEW MEETING--------|
                            |*****EXISTING MEETING******|

Back-to-back meetings are very common, and they should not be detected as an overlap. Adding the two scenarios to the test, and trying the query:

WITH new_meetings AS (
    SELECT
        id,
        starts_at::timestamptz as starts_at,
        ends_at::timestamptz as ends_at
    FROM (VALUES
        ('A', '2021-10-01 11:10 UTC', '2021-10-01 11:55 UTC'),
        ('B', '2021-10-01 11:20 UTC', '2021-10-01 12:05 UTC'),
        ('C', '2021-10-01 11:20 UTC', '2021-10-01 11:55 UTC'),
        ('D', '2021-10-01 11:10 UTC', '2021-10-01 12:05 UTC'),
        ('E', '2021-10-01 11:15 UTC', '2021-10-01 12:00 UTC'),
        ('F', '2021-10-01 12:00 UTC', '2021-10-01 12:10 UTC'),
        ('G', '2021-10-01 11:00 UTC', '2021-10-01 11:15 UTC')
    ) as t(
        id,   starts_at,               ends_at
    )
)
SELECT
    *
FROM
    meetings, new_meetings
WHERE
    new_meetings.starts_at BETWEEN meetings.starts_at and meetings.ends_at
    OR new_meetings.ends_at BETWEEN meetings.starts_at and meetings.ends_at
    OR meetings.starts_at BETWEEN new_meetings.starts_at and new_meetings.ends_at
    OR meetings.ends_at BETWEEN new_meetings.starts_at and new_meetings.ends_at;

       starts_at     │        ends_at      │ id │       starts_at     │        ends_at
─────────────────────┼─────────────────────┼────┼─────────────────────┼────────────────────
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ A  │ 2021-10-01 11:10:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ B  │ 2021-10-01 11:20:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ C  │ 2021-10-01 11:20:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ D  │ 2021-10-01 11:10:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ E  │ 2021-10-01 11:15:00 │ 2021-10-01 12:00:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ F  │ 2021-10-01 12:00:00 │ 2021-10-01 12:10:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ G  │ 2021-10-01 11:00:00 │ 2021-10-01 11:15:00

The two back-to-back meetings, scenarios F and G, are incorrectly classified as overlaps. This is caused because the operator BETWEEN in inclusive. To implement this condition without using BETWEEN you would have to do something like this:

WITH new_meetings AS (/* ... */)
SELECT
    *
FROM
    meetings, new_meetings
WHERE
    (new_meetings.starts_at > meetings.starts_at AND new_meetings.starts_at < meetings.ends_at)
    OR
    (new_meetings.ends_at > meetings.starts_at AND new_meetings.ends_at < meetings.ends_at)
    OR
    (meetings.starts_at > new_meetings.starts_at AND meetings.starts_at < new_meetings.ends_at)
    OR
    (meetings.ends_at > new_meetings.starts_at AND meetings.ends_at < new_meetings.ends_at)
    OR
    (meetings.starts_at = new_meetings.starts_at AND meetings.ends_at = new_meetings.ends_at);

       starts_at     │        ends_at      │ id │       starts_at     │        ends_at
─────────────────────┼─────────────────────┼────┼─────────────────────┼────────────────────
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ A  │ 2021-10-01 11:10:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ B  │ 2021-10-01 11:20:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ C  │ 2021-10-01 11:20:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ D  │ 2021-10-01 11:10:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ E  │ 2021-10-01 11:15:00 │ 2021-10-01 12:00:00

The query correctly identifies scenarios A - E as overlaps, and does not identify the back-to-back scenarios F and G as overlaps. This is what you wanted. However, this condition is pretty crazy! It can easily get out of control.

This is where the following operator in PostgreSQL proves itself as extremely valuable:

WITH new_meetings AS (
    SELECT
        id,
        starts_at::timestamptz as starts_at,
        ends_at::timestamptz as ends_at
    FROM (VALUES
        ('A', '2021-10-01 11:10 UTC', '2021-10-01 11:55 UTC'),
        ('B', '2021-10-01 11:20 UTC', '2021-10-01 12:05 UTC'),
        ('C', '2021-10-01 11:20 UTC', '2021-10-01 11:55 UTC'),
        ('D', '2021-10-01 11:10 UTC', '2021-10-01 12:05 UTC'),
        ('E', '2021-10-01 11:15 UTC', '2021-10-01 12:00 UTC'),
        ('F', '2021-10-01 12:00 UTC', '2021-10-01 12:10 UTC'),
        ('G', '2021-10-01 11:00 UTC', '2021-10-01 11:15 UTC')
    ) as t(
        id,   starts_at,               ends_at
    )
)
SELECT
    *
FROM
    meetings, new_meetings
WHERE
    (new_meetings.starts_at, new_meetings.ends_at)
        OVERLAPS (meetings.starts_at, meetings.ends_at);

       starts_at     │        ends_at      │ id │       starts_at     │        ends_at
─────────────────────┼─────────────────────┼────┼─────────────────────┼────────────────────
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ A  │ 2021-10-01 11:10:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ B  │ 2021-10-01 11:20:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ C  │ 2021-10-01 11:20:00 │ 2021-10-01 11:55:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ D  │ 2021-10-01 11:10:00 │ 2021-10-01 12:05:00
 2021-10-01 11:15:00 │ 2021-10-01 12:00:00 │ E  │ 2021-10-01 11:15:00 │ 2021-10-01 12:00:00

This is it! Using the OVERLAPS operator you can replace those 5 complicated conditions, and keep the query short and simple to read and understand.

Read the whole story
bernhardbock
27 days ago
reply
Share this story
Delete

Developer Tools secrets that shouldn’t be secrets

1 Share
Update: As this is blowing up on Hackernews I added information to each of the tips in which environment they are supported in parenthesis after each heading. When I state “Chromium browsers”, this refers to all browsers that use the Chromium core and also feature all the Developer Tools. This is Chrome, Microsoft Edge, Brave and many more. As a reminder: Microsoft Edge is the browser that comes with Windows 10/11 and is based on Chromium and thus from a platform perspective simular to Chrome. They differ in UX and services around the core. Edge Developer Tools work closely with Google on bringing the work we add to the product back into the Chromium Core. But some of the things I am talking about here are experiments and exclusively in Microsoft Edge, which is available on Windows, Mac and Linux.

This is a talk that I’ve given at CityJS this September. I am a principal product manager for developer tools in Microsoft Edge and these are things I encountered during working on the tools, documenting them and going through user feedback.

You can watch the recording of the talk on Youtube .

Here’s a write-up of all the things I covered:

1. Console is much more than `log()`!
(All browsers with developer tools following the standard)

There is no doubt that, besides the Elements tool, Console is the most used part of the browser developer tools. Specificially, people love to debug by putting a `console.log()` in their code to learn what’s going on. There are a few problems with that, and there are better ways to debug scripts, but as this is what people do, let’s talk how to make that experience better.

The first problem is log messages that aren’t removed when a product goes live clogging up the Console. Finding the information you’re looking for becomes daunting and the best way to work with that is to learn about the console filtering options available to you . Using these you can filter the reporting of the console to the things you care about and block out a lot of the noise.

filtering options in the console tool

What is that you’re logging?

The next problem with using `console.log()` is that we seem to only log values and forget to add where they come from. For example, when you use the following code, you get a list of numbers, but you don’t know what is what.

console.log(width)
console.log(height)

console.log(width) console.log(height)

The easiest way to work around that issue is to wrap the things you want to log in curly braces. The console then logs both the name and the value of what you want to know about.

console.log({width})
console.log({height})

console.log({width}) console.log({height})

Using curly braces around variables in log messages logs their name and their value

Adding to your console vocabulary

Examples of warn, info and error messages and how they are displayed in the console

In addition to `console.log()` you have a lot more methods you can use . For example, `console.warn()` logs a warning, `console.info()` an informational message, and `console.error()` an error message. This not only results in slighty different displays in the console, but it also gives your messages a different log level, which means it is easier to filter for them.

Errors and assertions in Console

The error method of console shows an error, and assert is a shortcut for an if statement with a console.log inside

Displaying an error in the console is different to throwing an error, but it still is a good idea to show the severity of an issue to the person maintaining or debugging the product. Another interesting method is `console.assert()`, which only logs a message when a certain condition is met. Often you find yourself writing an `if` statement with a `console.log()` inside. Using `assert()` makes that one redundant and you have one less thing to worry about when cleaning up your debugging code.

Tracing where something came from

Example of using console.trace() to track back where a call came from

Often you find yourself adding a `console.log(‘called’)` or similar to test if a certain functionality is even triggered. Once you have that the next thing you normally want to find out what called that method. That’s what `console.trace()` is for, as it doesn’t only tell you that something was called, but also where the call came from.

Grouping console messages

If you have a lot to log, you can use `console.group(‘name’)` and `console.groupEnd(‘name’)` to wrap the messages in collapsible and expandable messages in the Console. You can even define if the groups should be expanded or collapsed by default.

An example of defining groups in the console

Displaying and filtering lots of information in the console as tables

If you want to display a lot of of information as a log, it can become daunting to read the information. The `console.table()` method displays array-like data as a table in the console, and you can filter what you want to display by giving it an array of the properties you want to see.

For example, you can use `let elms = document.querySelectorAll(‘:is(h1,p,script’)` to get all H1, paragraph and script elements from the document and `console.table(elms)` to display this information as a table. As the different elements have a boatload of attributes and properties, the resulting table is pretty unreadable. If you filter down to what you are interested in by using `console.table(elms,[‘nodeName’, ‘innerText’, ‘offsetHeight’])` you get a table with only these properties and their values.

Code example using console.table() and its filtering options

The table structure is maintained when you copy and paste this information, which makes it a wonderful tool to get data into Excel or Word, for example.

Blinging it up: $() and $$()

The console comes with a lot of convenience methods you can use called the Console Utilities . Two very useful ones are `$()` and `$$()` which are replacements for `document.querySelector()` and `document.querySelectorAll()` respectively. These not only return the nodeList you expect, but also cast the results to arrays, which means you can use `map()` and `filter()` on the results directly. The following code would grab all the links of the current document and return an Array with objects that contain only the `href` and `innerText` properties of each link as `url` and `text` properties.

$$('a').map(a => {
  return {url: a.href, text: a.innerText}
})

$$('a').map(a => { return {url: a.href, text: a.innerText} })

An example how the $$ function returns a collection of HTML elements that you can filter like any other array

2. You can log without source access – live expressions and logpoints
(Chromium browsers)

The normal way to add a `console.log()` is to put it inside your code at the place you want to get the information. But you can also get insights into code you can’t access and change. Live expressions are a great way to log information without changing your code. They are also incredible to log values that change constantly without flooding the console and thus slowing down your product. You can see the difference in the following screencast:

Logpoints are a special kind of breakpoint. You can right-click any line in a JavaScript in the Sources tool of the Developer Tools and set a logpoint. You get asked to provide an expression you’d like to log and will get its value in the console when the line of code is executed. This means you can technically inject a `console.log()` anywhere on the web. I wrote about logpoints back in August and you can see a demo in the following screencast:

3. You can log outside the browser – VS Code debugger
(Chromium Browsers and VS Code)

When you start a debugging session in Visual Studio Code, you can spawn a browser instance and the Debug Console becomes the Console you are used to from the browser developer tools. I blogged about this in July in detail, so you can read up there how to do that . There is also more in the official documentation.

You can also watch this one minute video of me showing the functionality:

4. You can inject code into any site – snippets and overrides.
(Chromium Browsers)

Snippets are a way in Developer Tools to run a script against the current web site. You can use the Console Utilities in these scripts and it is a great way to write and store complex DOM manipulation scripts you normally execute in the Console. You can run your scripts in the window context of the current document either from the snippets editor or from the command menu. In the latter case, start your command with an ! and type the name of the snippet you want to run.

Overrides allow you to store local copies of remote scripts and override them when the page loads. This is great if you have, for example, a slow build process for your whole application and you want to try something out. It is also a great tool to replace annoying scripts from third party web sites without having to use a browser extension.

5. You can inspect and debug much more than you know!
(Chromium Browsers)

You may know the Chromium developer tools from browsers like Google Chrome, Brave or Microsoft Edge, but they are available in a lot more environments. Any app that’s based on Electron can have them enabled and you can use the Tools to peek under the hood and see how the product was done. This works, for example, in GitHub Desktop, Visual Studio Code, or you can even debug the Developer Tools of the browser using Developer Tools!

If you inspect the Developer Tools, you will see that they are written in HTML, CSS and TypeScript. It is an exciting environment to use these technologies, as you you know the rendering engine your code will run in – something you never know on the web.

Inspecting the Chromium Developer tools with another instance of the developer tools

Edge Developer Tools in Visual Studio Code
(Microsoft Edge via a VS Code extension)

The embeddable nature of the tools also allowed us to offer you a way to use them outside the browser. The Microsoft Edge Tools for Visual Studio Code extension brings the tools to Visual Studio Code. That way you can use the visual debugging tools right next to your code editor and you don’t need to jump between the two all the time.This also ties in with the “Console in Visual Studio Code” trick mentioned earlier. When you start a debugging session and you click the Developer Tools icon, the tools will open or – the first time – you will be prompted to install the extension.

Inspect button in the debug bar of Visual Studio Code

Microsoft Edge Developer tools open in an instance of Visual Studio Code

6. Some dirty secrets…

Working intimately with developer tools and getting feedback and usage information taught me a few dirty secrets. The first one is that whilst we are all super excited about all the amazing features of developer tools, users only use a very small percentage of them. Many things heralded as the best thing since sliced bread in presentations and video tutorials are hardly every opened, let alone used. I thought this was about a lack of documentation and we spent a massive amount of time to update the DevTools documentation to ensure everything in them is described and explained, but that wasn’t it. Documentation is something people seem to go to as a last resort when they are stuck and Google/Stack Overflow/Social channels didn’t yield any results.

Developer tools have become complex and are overwhelming – a few ideas how to fix that
(Microsoft Edge)

It might be that the plain fact is that the Developer Tools of browsers grew organically over the years and can be incredibly overwhelming to look at. And that bothers me and I think we should do better. Here’s my mantra when it comes to tools for developers:

Developer tools should not expect people to be experts but turn them into experts over time.

We’re working on a few ideas to make that easier, and you will soon see those in Microsoft Edge. One idea we had is a “Focus Mode”. Instead of showing you all the tools and tabs we sorted the tools into different use cases, like “Elements/CSS debugging”, “Sources/JavaScript Debugging” or “Network inspection”. We then show only the relevant tools and hide all the ones that may be confusing or in the way.

Developer tools in focus mode, showing only what's needed in the current context

Another feature we are working on are “informational overlays”. You get a help button that allows you to turn on overlays for the developer tools, explaining what each of the tools is, how to use it and providing links to the documentation. We hope that this would make it easier for people to learn about more features.

Developer tools covered by overlays explaining what each of them are,

There is still a disconnect between authoring code and debugging the outcome
(Microsoft Edge)

Whilst it is amazing what tools provide us these days there is still a disconnect between authoring and debugging. Most of the time we write our code, create the app and then go to the browser to see what doesn’t work. We then use the browser developer tools to tweak and fix these issues. And then comes the big issue we still need to fix: how do you get the changes you created using the browser developer tools back into your code? Most of the time, the answer is “copy and paste or try to remember what needs changing”.

We’re currently working on two ways to make this easier. One is to replace the in-devtools editor with Visual Studio Code when it is available and to change files on the hard drive as you use the browser developer tools. The other is part of the VS Code extension and changes the source code in the editor as you use the developer tools but still gives you the final say in changing the file on disk. I described the problem and the possible solutions on the Edge blog or you can watch the following two screencasts to see them in action.

CSS Mirroring in Visual Studio Code:

What if… Visual Studio Code became the editor of in-browser Developer Tools?

7. You’re the audience and the clients of Developer Tools!
(Applies to all browsers, but channels shown here are Microsoft Edge only)

As a developer, you are the main audience for Developer Tools. We are open to your feedback and many of the recent changes to the tools are direct results from demands from outside developers. We try to make this as easy as possible by providing in-context ways to contact us directly. For example, the Visual Studio Code extension has prominent links and buttons for you to report issues and request features.

Screenshot of the in-context links provided in the VS Code extension to demand new features, file bugs and learn abour experiments

The source code of the extension is also on GitHub and you can file issues there.

The in-browser developer tools also have a direct button to give us feedback. To make it easier for you to provide actionable feedback, the button includes a lot of automatic information.

The feedback tool built into the browser developer tools of Microsoft Edge

It records automatically what URL the issue happened on, takes a screenshot to include and offers to send diagnostic data. We also ask for you to provide an email in case we need more information and you can add attachments and info how to recreate the issue. We check this feedback daily, and a lot of great inventions and bug fixes came from that source.

Read the whole story
bernhardbock
35 days ago
reply
Share this story
Delete

Assess resource consumption with Ansible callback plugins

1 Share
Assess resource consumption with Ansible callback plugins
When you need metrics about playbook execution, callback plugins can help you drill down into the data and troubleshoot issues.
Alexon Oliveira Thu, 10/7/2021 at 4:50pm
Image
Linux scripting shootout: The backtick operator vs $ parens
Image by Arek Socha from Pixabay

You can use Ansible callback plugins to get detailed information about your playbooks' metrics and machine resource consumption. Having this knowledge and the associated tools can be very useful for troubleshooting and getting a deeper understanding of Ansible plugins.

Topics:   Ansible   Monitoring   Linux administration  
Read the whole story
bernhardbock
40 days ago
reply
Share this story
Delete

55 GiB/s FizzBuzz

1 Comment

This program is most conveniently built using gcc. Save it as fizzbuzz.S (that's a capital S as the extension), and build using the commands

gcc -mavx2 -c fizzbuzz.S
ld -o fizzbuzz fizzbuzz.o

Run as ./fizzbuzz piped into one command, e.g. ./fizzbuzz | pv > /dev/null (as suggested in the question), ./fizzbuzz | cat, or ./fizzbuzz | less. To simplify the I/O, this will not work (producing an error on startup) if you try to output to a file/terminal/device rather than a pipe. Additionally, this program may produce incorrect output if piped into two commands (e.g. ./fizzbuzz | pv | cat > fizzbuzz.txt), but only in the case where the middle command uses the splice system call; this is either a bug in Linux (very possible with system calls this obscure!) or a mistake in the documentation of the system calls in question (also possible). However, it should work correctly for the use case in the question, which is all that matters on CGCC.

This program is somewhat system-specific; it requires the operating system to be a non-ancient version of Linux, and the processor to be an x86-64 implementation that supports AVX2. (Most moderately recent processors by Intel and AMD have AVX2 support, including the Ryzen 9 mentioned in the question, and almost all use the x86-64 instruction set.) However, it avoids assumptions about the system it's running on beyond those mentioned in the header, so there's a decent chance that if you can run Linux, you can run this.

The program outputs a quintillion lines of FizzBuzz and then exits (going further runs into problems related to the sizes of registers). This would take tens of years to accomplish, so hopefully counts as "a very high astronomical number" (although it astonishes me that it's a small enough timespan that it might be theoretically possible to reach a number as large as a quintillion without the computer breaking).

As a note: this program's performance is dependent on whether it and the program it outputs to are running on sibling CPUs or not, something which will be determined arbitrarily by the kernel when you start it. If you want to compare the two possible timings, use taskset to force the programs onto particular CPUs: taskset 1 ./fizzbuzz | taskset 2 pv > /dev/null versus taskset 1 ./fizzbuzz | taskset 4 pv > /dev/null. (The former will probably run faster, but might be slower on some CPU configurations.)

Discussion

I've spent months working on this program. I've long thought that "how fast can you make a FizzBuzz" would be a really interesting question for learning about high-performance programming, and when I subsequently saw this question posted on CGCC, I pretty much had to try.

This program aims for the maximum possible single-threaded performance. In terms of the FizzBuzz calculation itself, it is intended to sustain a performance of 64 bytes of FizzBuzz per 4 clock cycles (and is future-proofed where possible to be able to run faster if the relevant processor bottleneck – L2 cache write speed – is ever removed). This is faster than a number of standard functions. In particular, it's faster than memcpy, which presents interesting challenges when it comes to I/O (if you try to output using write then the copies in write will take up almost all the runtime – replacing the I/O routine here with write causes the performance on my CPU to drop by a factor of 5). As such, I needed to use much more obscure system calls to keep I/O-related copies to a minimum (in particular, the generated FizzBuzz text is only sent to main memory if absolutely necessary; most of the time it's stored in the processor's L2 cache and piped into the target program from there, which is why reading it from a sibling CPU can boost performance – the physical connection to the L2 cache is shorter and higher bandwidth than it would be to a more distant CPU).

On my computer (which has a fairly recent, but not particularly powerful, Intel processor), this program generates around 31GiB of FizzBuzz per second. I'll be interested to see how it does on the OP's computer.

I did experiment with multithreaded versions of the program, but was unable to gain any speed. Experiments with simpler programs show that it could be possible, but any gains may be small; the cost of communication between CPUs is sufficiently high to negate most of the gains you could get by doing work in parallel, assuming that you only have one program reading the resulting FizzBuzz (and anything that writes to memory will be limited by the write speed of main memory, which is slower than the speed with which the FizzBuzz can be generated).

The program

This isn't , so my explanation of the program and its algorithm are given as comments in the program itself. (I still had to lightly golf the program, and especially the explanation, to fit this post within the 65536 byte size limit.)

The program is written in a "literate" assembly style; it will be easiest to understand if you read it in order, from start to end. (I also added a number of otherwise useless line labels to separate the program into logical groups of instructions, in order to make the disassembly easier to read, if you're one of the people who prefers to read assembly code like that.)

.intel_syntax prefix

// Header files.
#include <asm/errno.h>
#include <asm/mman.h>
#include <asm/unistd.h>
#define F_SETPIPE_SZ 1031 // not in asm headers, define it manually

// The Linux system call API (limited to 4 arguments, the most this
// program uses). 64-bit registers are unsuffixed; 32-bit have an "e"
// suffix.
#define ARG1 %rdi
#define ARG1e %edi
#define ARG2 %rsi
#define ARG2e %esi
#define ARG3 %rdx
#define ARG3e %edx
#define ARG4 %r10
#define ARG4e %r10d
#define SYSCALL_RETURN %rax
#define SYSCALL_RETURNe %eax
#define SYSCALL_NUMBER %eax

// %rax, %rcx, %rdx, %ymm0-3 are general-purpose temporaries. Every
// other register is used for just one or two defined purposes; define
// symbolic names for them for readability. (Bear in mind that some of
// these will be clobbered sometimes, e.g. OUTPUT_LIMIT is clobbered
// by `syscall` because it's %r11.)
#define OUTPUT_PTR %rbx
#define BYTECODE_IP %rbp
#define SPILL %rsi
#define BYTECODE_GEN_PTR %rdi
#define REGEN_TRIGGER %r8
#define REGEN_TRIGGERe %r8d
#define YMMS_AT_WIDTH %r9
#define YMMS_AT_WIDTHe %r9d
#define BUZZ %r10
#define BYTECODE_NEG_LEN %r10
#define FIZZ %r11
#define FIZZe %r11d
#define OUTPUT_LIMIT %r11
#define BYTECODE_END %r12
#define BYTECODE_START %r13
#define BYTECODE_STARTe %r13d
#define PIPE_SIZE %r13
#define LINENO_WIDTH %r14
#define LINENO_WIDTHe %r14d
#define GROUPS_OF_15 %r15
#define GROUPS_OF_15e %r15d
#define LINENO_LOW %ymm4
#define LINENO_MID %ymm5
#define LINENO_MIDx %xmm5
#define LINENO_TOP %ymm6
#define LINENO_TOPx %xmm6
#define LINENO_MID_TEMP %ymm7
#define ENDIAN_SHUFFLE %ymm8
#define ENDIAN_SHUFFLEx %xmm8
#define LINENO_LOW_INCR %ymm9
#define LINENO_LOW_INCRx %xmm9

// The last six vector registers are used to store constants, to avoid
// polluting the cache by loading their values from memory.
#define LINENO_LOW_INIT %ymm10
#define LINENO_MID_BASE %ymm11
#define LINENO_TOP_MAX %ymm12
#define ASCII_OFFSET %ymm13
#define ASCII_OFFSETx %xmm13
#define BIASCII_OFFSET %ymm14
#define BASCII_OFFSET %ymm15


// Global variables.
.bss
.align 4 << 20
// The most important global variables are the IO buffers. There are
// two of these, each with 2MiB of memory allocated (not all of it is
// used, but putting them 2MiB apart allows us to simplify the page
// table; this gives a 30% speedup because page table contention is
// one of the main limiting factors on the performance).
io_buffers:
.zero 2 * (2 << 20)
// The remaining 2MiB of memory stores everything else:
iovec_base:          // I/O config buffer for vmsplice(2) system call
.zero 16
error_write_buffer:  // I/O data buffer for write(2) system call
.zero 1
.p2align 9,0
bytecode_storage:    // the rest is a buffer for storing bytecode
.zero (2 << 20) - 512


// The program starts here. It doesn't use the standard library (or
// indeed any libraries), so the start point is _start, not main.
.text
.globl _start
_start:

// This is an AVX2 program, so check for AVX2 support by running an
// AVX2 command. This is a no-op, but generates SIGILL if AVX2 isn't
// supported.
vpand %ymm0, %ymm0, %ymm0

// Initialize constant registers to their constant values.
vmovdqa LINENO_LOW_INIT, [%rip + lineno_low_init]
vmovdqa LINENO_MID_BASE, [%rip + lineno_mid_base]
vmovdqa LINENO_TOP_MAX, [%rip + lineno_top_max]
vmovdqa ASCII_OFFSET, [%rip + ascii_offset]
vmovdqa BIASCII_OFFSET, [%rip + biascii_offset]
vmovdqa BASCII_OFFSET, [%rip + bascii_offset]

// Initialize global variables to their initial values.
vmovdqa ENDIAN_SHUFFLE, [%rip + endian_shuffle_init]
vmovdqa LINENO_TOP, [%rip + lineno_top_init]

// Check the size of the L2 cache.
//
// This uses the CPUID interface. To use it safely, check what range
// of command numbers is legal; commands above the legal range have
// undefined behaviour, commands within the range might not be
// implemented but will return all-zeros rather than undefined values.
// CPUID clobbers a lot of registers, including some that are normally
// call-preserved, so this must be done first.
mov %eax, 0x80000000 // asks which CPUID extended commands exist
cpuid                // returns the highest supported command in %eax
cmp %eax, 0x80000006 // does 0x80000006 give defined results?
jb bad_cpuid_error

mov %eax, 0x80000006 // asks about the L2 cache size
cpuid                // returns size in KiB in the top half of %ecx
shr %ecx, 16
jz bad_cpuid_error   // unsupported commands return all-0s

// Calculate the desired pipe size, half the size of the L2 cache.
// This value is chosen so that the processor can hold a pipeful of
// data being output, plus a pipeful of data being calculated, without
// needing to resort to slow L3 memory operations.
shl %ecx, 10 - 1     // convert KiB to bytes, then halve
mov PIPE_SIZE, %rcx

// Ask the kernel to resize the pipe on standard output.
mov ARG1e, 1
mov ARG2e, F_SETPIPE_SZ
mov ARG3e, %ecx
mov SYSCALL_NUMBER, __NR_fcntl
syscall
cmp SYSCALL_RETURNe, -EBADF
je pipe_error
cmp SYSCALL_RETURNe, -EPERM
je pipe_perm_error
call exit_on_error
cmp SYSCALL_RETURN, PIPE_SIZE
jne pipe_size_mismatch_error

// Ask the kernel to defragment the physical memory backing the BSS
// (read-write data) segment. This simplifies the calculations needed
// to find physical memory addresses, something that both the kernel
// and processor would otherwise spend a lot of time doing, and
// speeding the program up by 30%.
lea ARG1, [%rip + io_buffers]
mov ARG2e, 3 * (2 << 20)
mov ARG3e, MADV_HUGEPAGE
mov SYSCALL_NUMBER, __NR_madvise
syscall
call exit_on_error

// From now on, OUTPUT_PTR is permanently set to the memory location
// where the output is being written. This starts at the start of the
// first I/O buffer.
lea OUTPUT_PTR, [%rip + io_buffers]


///// First phase of output
//
// The FizzBuzz output is produced in three distinct phases. The first
// phase is trivial; just a hardcoded string, that's left in the
// output buffer, to be output at the end of the second phase.

first_phase:

.section .rodata
fizzbuzz_intro:
.ascii "1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\n"
.text
vmovdqu %ymm0, [%rip + fizzbuzz_intro]
vmovdqu [OUTPUT_PTR], %ymm0
add OUTPUT_PTR, 30


///// Second phase of output
//
// This is a routine implementing FizzBuzz in x86-64+AVX2 assembler in
// a fairly straightforward and efficient way. This isn't as fast as
// the third-phase algorithm, and can't handle large numbers, but will
// introduce some of the basic techniques this program uses.

second_phase_init:

// The outer loop of the whole program breaks the FizzBuzz output into
// sections where all the line numbers contain the same number of
// digits. From now on, LINENO_WIDTH tracks the number of digits in
// the line number. This is currently 2; it ranges from 2-digit
// numbers to 18-digit numbers, and then the program ends.
mov LINENO_WIDTHe, 2

// GROUPS_OF_15 is permanently set to the number of groups of 15 lines
// that exist at this line number width; it's multiplied by 10 whenever
// LINENO_WIDTH is incremented.
//
// A general note about style: often the program uses numbers that are
// statically known to fit into 32 bits, even in a register that's
// conceptually 64 bits wide (like this one). In such cases, the
// 32-bit and 64-bit versions of a command will be equivalent (as the
// 32-bit version zero-extends to 64-bits on a 64-bit processor); this
// program generally uses the 32-bit version, both because it
// sometimes encodes to fewer bytes (saving cache pressure), and
// because some processors recognise zeroing idioms only if they're 32
// bits wide.
mov GROUPS_OF_15e, 6

// Some constants used throughout the second phase, which permanently
// stay in their registers. Note that short string literals can be
// stored in normal integer registers - the processor doesn't care.
mov FIZZ, 0x0a7a7a6946  // "Fizz\n"
mov BUZZ, 0x0a7a7a7542  // "Buzz\n"

.section .rodata
.p2align 5, 0
second_phase_constants:
.byte 0, 0, 0, 0, 0, 0, 0, 0
.byte 1, 0, 0, 0, 0, 0, 0, 0
.text
vmovdqa %xmm3, [%rip + second_phase_constants]

// This program makes extensive use of a number format that I call
// "high-decimal". This is a version of decimal where the digit 0 is
// encoded as the byte 246, the digit 1 as the byte 247, ..., the
// digit 9 as the byte 255. The bytes are stored in the normal
// endianness for the processor (i.e. least significant first), and
// padded to a known length (typically 8 digits) with leading zeroes.
//
// The point of high-decimal is that it allows us to use arithmetic
// operators intended for binary on high-decimal numbers, and the
// carries will work the same way (i.e. the same digits will carry,
// although carries will be 0-based rather than 246-based); all that's
// required is to identify the digits that carried and add 246 to
// them. That means that the processor's binary ALU can be used to do
// additions directly in decimal - there's no need for loops or
// anything like that, and no need to do binary/decimal conversions.
//
// The first use for high-decimal is to store the line number during
// the second phase (it's stored differently in the third phase).
// It's stored it in the top half of %xmm1 (although it's only 64 bits
// wide, it needs to be ine a vector register so that it can be
// inerpreted as 8 x 8 bits when necessary; general-purpose registers
// can't do that). The bottom half of %xmm1 is unused, and frequently
// overwritten with arbitrary data.
.section .rodata
line_number_init:
#define REP8(x) x,x,x,x,x,x,x,x
.byte REP8(0)
.byte 246, 247, 246, 246, 246, 246, 246, 246
.text
vmovdqa %xmm1, [%rip + line_number_init]

// Writing line numbers is nontrivial because x86-64 is little-endian
// but FizzBuzz output is big-endian; also, leading zeroes aren't
// allowed. ENDIAN_SHUFFLE is used to fix both these problems; when
// used to control the vector shuffler, it reverses the order of a
// vector register, and rotates the elements to put the first digit
// (based on LINENO_WIDTH) into the first byte. (This method is used
// by both the second and third phases; the second phase uses only the
// bottom half, with the top half used by the third phase, but they
// are both initialized together.)
.section .rodata
endian_shuffle_init:
.byte 9, 8, 7, 6, 5, 4, 3, 2
.byte 1, 0, 255, 254, 253, 252, 251, 250
.byte 3, 2, 1, 0, 255, 254, 253, 252
.byte 251, 250, 249, 248, 247, 246, 245, 244
.text


second_phase_per_width_init:

// The second phase writing routines are macros.
//
// Fizz and Buzz are trivial. (This writes a little beyond the end of
// the string, but that's OK; the next line will overwrite them.)
#define WRITE_FIZZ   mov [OUTPUT_PTR], FIZZ; add OUTPUT_PTR, 5
#define WRITE_BUZZ   mov [OUTPUT_PTR], BUZZ; add OUTPUT_PTR, 5

// For FizzBuzz, output 32 bits of FIZZ to write "Fizz" with no
// newline, then write a "Buzz" after that.
#define WRITE_FIZZBUZZ \
  mov [OUTPUT_PTR], FIZZe; mov [OUTPUT_PTR + 4], BUZZ; \
  add OUTPUT_PTR, 9

// To write a line number, add 58 to each byte of the line number
// %xmm1, fix the endianness and width with a shuffle, and write a
// final newline.
.section .rodata
ascii_offset:
.byte REP8(58), REP8(58), REP8(58), REP8(58)
.text
#define WRITE_LINENO \
  vpaddb %xmm0, ASCII_OFFSETx, %xmm1; \
  vpshufb %xmm0, %xmm0, ENDIAN_SHUFFLEx; \
  vmovdqu [OUTPUT_PTR], %xmm0; \
  lea OUTPUT_PTR, [OUTPUT_PTR + LINENO_WIDTH + 1]; \
  mov byte ptr [OUTPUT_PTR - 1], 10  // 10 = newline

// Incrementing the line number is fairly easy: add 1 (in the usual
// binary notation, taken from %xmm3) to the high-decimal number, then
// convert any bytes that produced a carry to high-decimal 0s by
// max-ing with 246.
//
// Normally I'd use a separate constant for this, but there randomly
// happens to be an %xmm register with 246s in its top half already
// (it's intended for an entirely different purpose, but it'll do for
// this one too).
#define INC_LINENO \
  vpaddq %xmm1, %xmm3, %xmm1; vpmaxub %xmm1, LINENO_TOPx, %xmm1

// Avoid modulus tests by unrolling the FizzBuzz by 15. (Bear in mind
// that this starts at 10, not 0, so the pattern will have a different
// phase than usual.)
mov %ecx, GROUPS_OF_15e
fifteen_second_phase_fizzbuzz_lines:
WRITE_BUZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZBUZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_BUZZ; INC_LINENO
WRITE_FIZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZ; INC_LINENO
dec %ecx
jnz fifteen_second_phase_fizzbuzz_lines

second_phase_increment_width:

lea GROUPS_OF_15e, [GROUPS_OF_15 + GROUPS_OF_15 * 4]
add GROUPS_OF_15e, GROUPS_OF_15e
inc LINENO_WIDTHe

// Increment every element of the low half of ENDIAN_SHUFFLE to
// adjust it for the new width, while leaving the top half unchanged.
vpcmpeqb %xmm0, %xmm0, %xmm0
vpsubb ENDIAN_SHUFFLE, ENDIAN_SHUFFLE, %ymm0

// The second phase handles line numbers with 2 to 5 digits.
cmp LINENO_WIDTHe, 6
jne second_phase_per_width_init

///// The output routine
//
// Most FizzBuzz routines produce output with `write` or a similar
// system call, but these have the disadvantage that they need to copy
// the data being output from userspace into kernelspace. It turns out
// that when running full speed (as seen in the third phase), FizzBuzz
// actually runs faster than `memcpy` does, so `write` and friends are
// unusable when aiming for performance - this program runs five times
// faster than an equivalent that uses `write`-like system calls.
//
// To produce output without losing speed, the program therefore needs
// to avoid copies, or at least do them in parallel with calculating
// the next block of output. This can be accomplished with the
// `vmsplice` system call, which tells the kernel to place a reference
// to a buffer into a pipe (as opposed to copying the data into the
// pipe); the program at the other end of this pipe will then be able
// to read the output directly out of this program's memory, with no
// need to copy the data into kernelspace and then back into
// userspace. In fact, it will be reading out of this program's
// processor's L2 cache, without main memory being touched at all;
// this is the secret to high-performance programming, because the
// cache is much faster than main memory is.
//
// Of course, it's therefore important to avoid changing the output
// buffer until the program connected to standard output has actually
// read it all. This is why the pipe size needed to be set earlier; as
// long as the amount of output is always at least as large as the
// pipe size, successfully outputting one buffer will ensure that none
// of the other buffer is left in the pipe, and thus it's safe to
// overwrite the memory that was previosuly output. There is some need
// to jump through hoops later on to make sure that `swap_buffers` is
// never called with less than one pipeful of data, but it's worth it
// to get the huge performance boost.

mov %rdx, OUTPUT_PTR
and %edx, (2 << 20) - 1

call swap_buffers
jmp third_phase_init

// Takes the amount of data to output in %rdx, and outputs from the
// buffer containing OUTPUT_PTR.
swap_buffers:
and OUTPUT_PTR, -(2 << 20)  // rewind to the start of the buffer
mov [%rip + iovec_base], OUTPUT_PTR
mov [%rip + iovec_base + 8], %rdx
mov ARG1e, 1
lea ARG2, [%rip + iovec_base]
mov ARG3e, 1
xor ARG4e, ARG4e

// As with most output commands, vmsplice can do a short write
// sometimes, so it needs to be called in a loop in order to ensure
// that all the output is actually sent.
1: mov SYSCALL_NUMBER, __NR_vmsplice
syscall
call exit_on_error
add [ARG2], SYSCALL_RETURN
sub [ARG2 + 8], SYSCALL_RETURN
jnz 1b

xor OUTPUT_PTR, (2 << 20)  // swap to the other buffer
ret


///// Third phase of output
//
// This is the heart of this program. It aims to be able to produce a
// sustained output rate of 64 bytes of FizzBuzz per four clock cycles
// in its main loop (with frequent breaks to do I/O, and rare breaks
// to do more expensive calculations).
//
// The third phase operates primarily using a bytecode interpreter; it
// generates a program in "FizzBuzz bytecode", for which each byte of
// bytecode generates one byte of output. The bytecode language is
// designed so that it can be interpreted using SIMD instructions; 32
// bytes of bytecode can be loaded from memory, interpreted, and have
// its output stored back into memory using just four machine
// instructions. This makes it possible to speed up the FizzBuzz
// calculations by hardcoding some of the calculations into the
// bytecode (this is similar to how JIT compilers can create a version
// of the program with some variables hardcoded, and throw it away on
// the rare occasions that those variables' values change).

third_phase_init:

// Reinitialize ENDIAN_SHUFFLE by copying the initializer stored in
// its high half to both halves. This works in the same way as in the
// second phase.
vpermq ENDIAN_SHUFFLE, ENDIAN_SHUFFLE, 0xEE

// Up to this point, PIPE_SIZE has held the size of the pipe. In order
// to save on registers, the pipe size is from now on encoded via the
// location in which the bytecode program is stored; the bytecode is
// started at iovec_base + PIPE_SIZE (which will be somewhere within
// bytecode_storage), so the same register can be used to find the
// bytecode and to remember the pipe size.
lea %rax, [%rip + iovec_base]
add BYTECODE_START, %rax  // BYTECODE_START is a synonym for PIPE_SIZE

// The bytecode program always holds instructions to produce exactly
// 600 lines of FizzBuzz. At width 6, those come to 3800 bytes long.
lea BYTECODE_END, [BYTECODE_START + 3800]

mov REGEN_TRIGGER, -1  // irrelevant until much later, explained there


third_phase_per_width_init:

// Calculate the amount of output at this LINENO_WIDTH. The result
// will always be divisible by 32, and thus is stored as the number of
// 32-byte units at this width; storing it in bytes would be more
// convenient, but sadly would overflow a 64-bit integer towards the
// end of the program.
lea %ecx, [LINENO_WIDTH * 8 + 47]   // bytes per 15 lines
mov YMMS_AT_WIDTH, GROUPS_OF_15
shr YMMS_AT_WIDTH, 5   // to avoid overflow, divide by 32 first
imul YMMS_AT_WIDTH, %rcx

// This program aims to output 64 bytes of output per four clock
// cycles, which it achieves via a continuous stream of 32-byte writes
// calculated by the bytecode program. One major complication here is
// that the 32-byte writes won't correspond to lines of FizzBuzz; a
// single processor instruction may end up outputting multiple
// different line numbers. So it's no longer possible to have a simple
// line number register, like it was in the second phase.
//
// Instead, the program stores an *approximation* of the line number,
// which is never allowed to differ by 100 or more from the "actual"
// line number; the bytecode program is responsible for fixing up the
// approximation to work out the correct line number to output (this
// allows the same CPU instruction to output digits from multiple
// different line numbers, because the bytecode is being interpreted
// in a SIMD way and thus different parts of the bytecode can fix the
// line number up differently within a single instruction.
//
// The line number is split over three processor registers:
// - LINENO_LOW: stores the line number modulo 200
// - LINENO_MID: stores the hundreds to billions digits
// - LINENO_TOP: stores the ten-billions and more significant digits
// (The parity of the 100s digit is duplicated between LINENO_MID and
// LINENO_LOW; this allows a faster algorithm for LINENO_MID updates.)
//
// Because there's only a need to be within 100 of the real line
// number, the algorithm for updating the line numbers doesn't need to
// run all that often (saving processor cycles); it runs once every
// 512 bytes of output, by simply adding a precalculated value
// (LINENO_LOW_INCR) to LINENO_LOW, then processing the carry to
// LINENO_MID (see later for LINENO_TOP). The amount by which the line
// number increases per 512 bytes of output is not normally going to
// be an integer; LINENO_LOW is therefore stored as a 64-bit fixpoint
// number (in which 2**64 represents "200", e.g. 2**63 would be the
// representation of "the line number is 100 mod 200"), in order to
// delay the accumulation of rounding errors as long as possible. It's
// being stored in a vector register, so there are four copies of its
// value; two of them have 50 (i.e 2**62) added, and two of them have
// 50 subtracted, in order to allow for more efficient code to handle
// the carry to LINENO_MID. Additionally, LINENO_LOW is interpreted as
// a signed number (an older version of this program was better at
// checking for signed than unsigned overflow and I had no reason to
// change).
//
// LINENO_LOW and LINENO_MID are reset every LINENO_WIDTH increase
// (this is because the program can calculate "past" the width
// increase due to not being able to break out of every instruction of
// the main loop, which may cause unwanted carries into LINENO_MID and
// force a reset).

.section .rodata
lineno_low_init:
.byte 0, 0, 0, 0, 0, 0, 0, 192
.byte 0, 0, 0, 0, 0, 0, 0, 64
.byte 0, 0, 0, 0, 0, 0, 0, 192
.byte 0, 0, 0, 0, 0, 0, 0, 64
.text
vmovdqa LINENO_LOW, LINENO_LOW_INIT

// %ecx is the number of bytes in 15 lines. That means that the number
// of 200-line units in 512 bytes is 38.4/%ecx, i.e. 384/(%ecx*10).
// Multiply by 2**64 (i.e. 384*2**64/(%ecx*10) to get LINENO_LOW_INCR.
lea %ecx, [%rcx + %rcx * 4]
add %ecx, %ecx
mov %edx, 384
xor %eax, %eax
div %rcx  // 128-bit divide, %rax = %rdx%rax / %rcx
vpxor LINENO_LOW_INCR, LINENO_LOW_INCR, LINENO_LOW_INCR
vpinsrq LINENO_LOW_INCRx, LINENO_LOW_INCRx, %rax, 0
vpermq LINENO_LOW_INCR, LINENO_LOW_INCR, 0

// LINENO_MID is almost stored in high-decimal, as four eight-digit
// numbers. However, the number represented is the closest line number
// that's 50 mod 100, stored as the two closest multiples of 100 (e.g.
// if the true line number is 235, it's approximated as 250 and then
// stored using the representations for 200 and 300), which is why
// LINENO_LOW needs the offsets of 50 and -50 to easily do a carry. A
// ymm vector holds four 64-bit numbers, two of which hold the value
// that's 0 mod 200, two which hold the value that's 100 mod 200. So
// carries on it are handled using a vector of mostly 246s, with 247s
// in the two locations which are always odd.
.section .rodata
lineno_mid_base:
.byte 246, 246, 246, 246, 246, 246, 246, 246
.byte 247, 246, 246, 246, 246, 246, 246, 246
.byte 246, 246, 246, 246, 246, 246, 246, 246
.byte 247, 246, 246, 246, 246, 246, 246, 246
.text

// This code is some fairly complex vector manipulation to initialise
// LINENO_MID to a power of 10 (handling the case where LINENO_WIDTH
// is so high that the hundreds to billions digits are all zeroes).
mov %edx, 1
mov %eax, 11
sub %eax, LINENO_WIDTHe
cmovbe %eax, %edx
shl %eax, 3
vpxor %xmm0, %xmm0, %xmm0
vpinsrq %xmm0, %xmm0, %rax, 0
vpermq %ymm0, %ymm0, 0
vpcmpeqb LINENO_MID, LINENO_MID, LINENO_MID
vpsrlq LINENO_MID, LINENO_MID, %xmm0
vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID
vpermq %ymm0, LINENO_MID_BASE, 0x55
vpsubb %ymm0, %ymm0, LINENO_MID_BASE
vpaddq LINENO_MID, LINENO_MID, %ymm0
vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID

// LINENO_TOP doesn't need to be initialized for new widths, because
// an overrun by 100 lines is possible, but by 10 billion lines isn't.
// The format consists of two 64-bit sections that hold high-decimal
// numbers (these are always the same as each other), and two that
// hold constants that are used by the bytecode generator.
.section .rodata
lineno_top_init:
.byte 198, 197, 196, 195, 194, 193, 192, 191
.byte 246, 246, 246, 246, 246, 246, 246, 246
.byte 190, 189, 188, 187, 186, 185, 184, 183
.byte 246, 246, 246, 246, 246, 246, 246, 246
.text

// When moving onto a new width, start at the start of the bytecode
// program.
mov BYTECODE_IP, BYTECODE_START


// Generating the bytecode program
//
// The bytecode format is very simple (in order to allow it to be
// interpreted in just a couple of machine instructions):
// - A negative byte represents a literal character (e.g. to produce
//   a literal 'F', you use the bytecode -'F', i.e. -70 = 0xba)
// - A byte 0..7 represents the hundreds..billions digit of the line
//   number respectively, and asserts that the hundreds digit of the
//   line number is even
// - A byte 8..15 represents the hundreds..billions digit of the line
//   number respectively, and asserts that the hundreds digit of the
//   line number is odd
//
// In other words, the bytecode program only ever needs to read from
// LINENO_MID; the information stored in LINENO_LOW and LINENO_TOP
// therefore has to be hardcoded into it. The program therefore needs
// to be able to generate 600 lines of output (as the smallest number
// that's divisible by 100 to be able to hardcode the two low digits,
// 200 to be able to get the assertions about the hundreds digits
// correct, and 3 and 5 to get the Fizzes and Buzzes in the right
// place).

generate_bytecode:

mov BYTECODE_GEN_PTR, BYTECODE_START

// FIZZ and BUZZ work just like in the second phase, except that they
// are now bytecode programs rather than ASCII.
mov FIZZ, 0xf6868697ba  // -"Fizz\n"
mov BUZZ, 0xf686868bbe  // -"Buzz\n"

// %ymm2 holds the bytecode for outputting the hundreds and more
// significant digits of a line number. The most significant digits of
// this can be obtained by converting LINENO_TOP from high-decimal to
// the corresponding bytecode, which is accomplished by subtracting
// from 198 (i.e. 256 - 10 - '0'). The constant parts of LINENO_TOP
// are 198 minus the bytecode for outputting the hundreds to billions
// digit of a number; this makes it possible for a single endian
// shuffle to deal with all 16 of the mid and high digits at once.
.section .rodata
bascii_offset:
.byte REP8(198), REP8(198), REP8(198), REP8(198)
.text
vpsubb %ymm2, BASCII_OFFSET, LINENO_TOP
vpshufb %ymm2, %ymm2, ENDIAN_SHUFFLE

#define GEN_FIZZ  mov [BYTECODE_GEN_PTR], FIZZ; add BYTECODE_GEN_PTR, 5
#define GEN_BUZZ  mov [BYTECODE_GEN_PTR], BUZZ; add BYTECODE_GEN_PTR, 5
#define GEN_FIZZBUZZ \
  mov [BYTECODE_GEN_PTR], FIZZe; \
  mov [BYTECODE_GEN_PTR + 4], BUZZ; add BYTECODE_GEN_PTR, 9
#define GEN_LINENO(units_digit) \
  vmovdqu [BYTECODE_GEN_PTR], %xmm2; \
  lea BYTECODE_GEN_PTR, [BYTECODE_GEN_PTR + LINENO_WIDTH + 1]; \
  mov [BYTECODE_GEN_PTR - 3], %al; \
  mov word ptr [BYTECODE_GEN_PTR - 2], 0xf6d0 - units_digit

// The bytecode generation loop is unrolled to depth 30, allowing the
// units digits to be hardcoded. The tens digit is stored in %al, and
// incremented every ten lines of output. The parity of the hundreds
// digit is stored in %ymm2: one half predicts the hundreds digit to
// be even, the other to be odd, and the halves are swapped every time
// the tens digit carries (ensuring the predictions are correct).
mov %eax, 0xd0
jmp 2f
inc_tens_digit:
cmp %al, 0xc7
je 1f  // jumps every 10th execution, therefore predicts perfectly
dec %eax
ret
1: mov %eax, 0xd0
vpermq %ymm2, %ymm2, 0x4e
ret

2: mov %ecx, 20
thirty_bytecode_lines:
GEN_BUZZ
GEN_LINENO(1)
GEN_FIZZ
GEN_LINENO(3)
GEN_LINENO(4)
GEN_FIZZBUZZ
GEN_LINENO(6)
GEN_LINENO(7)
GEN_FIZZ
GEN_LINENO(9)
call inc_tens_digit
GEN_BUZZ
GEN_FIZZ
GEN_LINENO(2)
GEN_LINENO(3)
GEN_FIZZ
GEN_BUZZ
GEN_LINENO(6)
GEN_FIZZ
GEN_LINENO(8)
GEN_LINENO(9)
call inc_tens_digit
GEN_FIZZBUZZ
GEN_LINENO(1)
GEN_LINENO(2)
GEN_FIZZ
GEN_LINENO(4)
GEN_BUZZ
GEN_FIZZ
GEN_LINENO(7)
GEN_LINENO(8)
GEN_FIZZ
call inc_tens_digit
dec %ecx
jnz thirty_bytecode_lines

generate_bytecode_overrun_area:

// Duplicate the first 512 bytes of the bytecode program at the end,
// so that there's no need to check to see whether BYTECODE_IP needs
// to be looped back to the start of the program any more than once
// per 512 bytes
mov %rax, BYTECODE_START
#define COPY_64_BYTECODE_BYTES(offset) \
vmovdqa %ymm0, [%rax + offset]; \
vmovdqa %ymm3, [%rax + (offset + 32)]; \
vmovdqu [BYTECODE_GEN_PTR + offset], %ymm0; \
vmovdqu [BYTECODE_GEN_PTR + (offset + 32)], %ymm3
COPY_64_BYTECODE_BYTES(0)
COPY_64_BYTECODE_BYTES(64)
COPY_64_BYTECODE_BYTES(128)
COPY_64_BYTECODE_BYTES(192)
COPY_64_BYTECODE_BYTES(256)
COPY_64_BYTECODE_BYTES(320)
COPY_64_BYTECODE_BYTES(384)
COPY_64_BYTECODE_BYTES(448)


// Preparing for the main loop
//
// Work out how long the main loop is going to iterate for.
// OUTPUT_LIMIT holds the address just beyond the end of the output
// that the main loop should produce. The aim here is to produce
// exactly one pipeful of data if possible, but to stop earlier if
// there's a change in digit width (because any output beyond that
// point will be useless: the bytecode will give it the wrong number
// of digits).
calculate_main_loop_iterations:

// Extract the pipe size from BYTECODE_START, in 32-byte units.
// During this calculation, OUTPUT_LIMIT holds the amount of output
// produced, rather than an address like normal.
mov OUTPUT_LIMIT, BYTECODE_START
lea %rdx, [%rip + iovec_base]
sub OUTPUT_LIMIT, %rdx
shr OUTPUT_LIMIT, 5

// Reduce the output limit to the end of this width, if it would be
// higher than that.
cmp OUTPUT_LIMIT, YMMS_AT_WIDTH
cmovae OUTPUT_LIMIT, YMMS_AT_WIDTH

// If there's already some output in the buffer, reduce the amount
// of additional output produced accordingly (whilst ensuring that
// a multiple of 512 bytes of output is produced).
//
// This would be buggy if the YMMS_AT_WIDTH limit were hit at the
// same time, but that never occurs as it would require two width
// changes within one pipeful of each other, and 9000000 lines of
// FizzBuzz is much more than a pipeful in size.
mov %rax, OUTPUT_PTR
and %eax, ((2 << 20) - 1) & -512
shr %eax, 5
sub OUTPUT_LIMIT, %rax

// The amount of output to produce is available now, and won't be
// later, so subtract it from the amount of output that needs to
// be produced now.
sub YMMS_AT_WIDTH, OUTPUT_LIMIT

// Return OUTPUT_LIMIT back to being a pointer, not an amount.
shl OUTPUT_LIMIT, 5
add OUTPUT_LIMIT, OUTPUT_PTR

prepare_main_loop_invariants:

// To save one instruction in the bytecode interpreter (which is very
// valuable, as it runs every second CPU cycle), LINENO_MID_TEMP is
// used to store a reformatted version of LINENO_MID, in which each
// byte is translated from high-decimal to ASCII, and the bytecode
// command that would access that byte is added to the result (e.g.
// the thousands digit for the hundreds-digits-odd version has 10
// added to convert from high-decimal to a pure number, '0' added to
// convert to ASCII, then 9 added because that's the bytecode command
// to access the thousands digit when the hundreds digit is odd, so
// the amount added is 10 + '0' + 9 = 57).
//
// LINENO_MID_TEMP is updated within the main loop, immediately after
// updating LINENO_MID, but because the bytecode interpreter reads
// from it it needs a valid value at the start of the loop.
.section .rodata
biascii_offset:
.byte 58, 59, 60, 61, 62, 63, 64, 65
.byte 66, 67, 68, 69, 70, 71, 72, 73
.byte 58, 59, 60, 61, 62, 63, 64, 65
.byte 66, 67, 68, 69, 70, 71, 72, 73
.text
vpaddb LINENO_MID_TEMP, BIASCII_OFFSET, LINENO_MID

// To save an instruction, precalculate minus the length of the
// bytecode. (Although the value of this is determined entirely by
// LINENO_WIDTH, the register it's stored in gets clobbered by
// system calls and thus needs to be recalculated each time.)
mov BYTECODE_NEG_LEN, BYTECODE_START
sub BYTECODE_NEG_LEN, BYTECODE_END


// The main loop

// The bytecode interpreter consists of four instructions:
// 1. Load the bytecode from memory into %ymm2;
// 2. Use it as a shuffle mask to shuffle LINENO_MID_TEMP;
// 3. Subtract the bytecode from the shuffle result;
// 4. Output the result of the subtraction.
//
// To see why this works, consider two cases. If the bytecode wants to
// output a literal character, then the shuffle will produce 0 for
// that byte (in AVX2, a shuffle with a a negative index produces an
// output of 0), and subtracting the bytecode from 0 then produces the
// character (because the bytecode encoded minus the character). If
// the bytecode instead wants to output a digit, then the shuffle will
// fetch the relevant digit from LINENO_MID_TEMP (which is the desired
// ASCII character plus the bytecode instruction that produces it),
// and subtract the bytecode instruction to just produce the character
// on its own.
//
// This produces an exactly correct line number as long as the line
// number approximation is within 100 of the true value: it will be
// correct as long as the relevant part of LINENO_MID is correct, and
// the worst case is for LINENO_MID to be storing, say, 200 and 300
// (the representation of 250) when the true line number is 400. The
// value in LINENO_MID specifically can be up to 50 away from the
// value of the line number as recorded by LINENO_MID and LINENO_LOW
// together, so as long as the line number registers are within 100,
// LINENO_MID will be within 150 (which is what is required).
//
// This doesn't update the bytecode instruction pointer or the pointer
// into the output buffer; those are updated once every 512 bytes (and
// to "advance the instruction pointer" the rest of the time, the main
// loop is unrolled, using hardcoded offsets with the pointer updates
// baked in).
//
// The bytecode instruction pointer itself is read from %rdx, not
// BYTECODE_IP, so that mid-loop arithmetic on BYTECODE_IP won't cause
// the interpreter to break.
//
// It's important to note one potential performance issue with this
// code: the read of the bytecode from memory is not only misalignable
// (`vmovdqu`); it splits a cache line 3/8 of the time. This causes L1
// split-load penalties on the 3/8 cycles where it occurs. I am not
// sure whether this actually reduces the program's performance in
// practice, or whether the split loads can be absorbed while waiting
// for writes to go through to the L2 cache. However, even if it does
// have a genuine performance cost, it seems like the least costly way
// to read the bytecode; structuring the bytecode to avoid split loads
// makes it take up substantially more memory, and the less cache that
// is used for the bytecode, the more that can be used for the output
// buffers. (In particular, increasing the bytecode to 2400 lines so
// that it's available at all four of the alignments required of it
// does not gain, because it then becomes so large that the processor
// struggles to keep it in L1 cache - it only just fits, and there
// isn't any way for it to know which parts of the cache are meant to
// stay in L1 and which are meant to leave to L2, so there's a large
// slowdown when it guesses wrong.)
#define INTERPRET_BYTECODE(bc_offset, buf_offset) \
  vmovdqu %ymm2, [%rdx + bc_offset]; \
  vpshufb %ymm0, LINENO_MID_TEMP, %ymm2; \
  vpsubb %ymm0, %ymm0, %ymm2; \
  vmovdqa [OUTPUT_PTR + buf_offset], %ymm0

// The main loop itself consists of sixteen uses of the bytecode
// interpreter, interleaved (to give the reorder buffer maximum
// flexibility) with all the other logic needed in the main loop.
// (Most modern processors can handle 4-6 instructions per clock cycle
// as long as they don't step on each others' toes; thus this loop's
// performance will be limited by the throughput of the L2 cache, with
// all the other work (bytecode interpretation, instruction decoding,
// miscellaneous other instructions, etc.) fitting into the gaps while
// the processor is waiting for the L2 cache to do its work.)

.p2align 5
main_loop:
// %rdx caches BYTECODE_IP's value at the start of the loop
mov %rdx, BYTECODE_IP
INTERPRET_BYTECODE(0, 0)

// %ymm1 caches LINENO_LOW's value at the start of the loop
vmovdqa %ymm1, LINENO_LOW
INTERPRET_BYTECODE(32, 32)

// Add LINENO_LOW_INCR to LINENO_LOW, checking for carry; it carried
// if the sign bit changed from 0 to 1. (vpandn is unintuitive; this
// is ~%ymm1 & LINENO_LOW, not %ymm1 & ~LINENO_LOW like the name
// suggests.)
vpaddq LINENO_LOW, LINENO_LOW_INCR, LINENO_LOW
INTERPRET_BYTECODE(64, 64)

vpandn %ymm3, %ymm1, LINENO_LOW
INTERPRET_BYTECODE(96, 96)

vpsrlq %ymm3, %ymm3, 63
INTERPRET_BYTECODE(128, 128)

// Add the carry to LINENO_MID (doubling it; LINENO_MID counts in
// units of 100 but a LINENO_LOW carry means 200).
vpaddb %ymm3, %ymm3, %ymm3
INTERPRET_BYTECODE(160, 160)

vpaddq LINENO_MID, LINENO_MID, %ymm3
INTERPRET_BYTECODE(192, 192)

vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID
INTERPRET_BYTECODE(224, 224)

// Update LINENO_MID_TEMP with the new value from LINENO_MID; this is
// the point at which the new value takes effect. This is done at the
// exact midpoint of the loop, in order to reduce the errors from
// updating once every 512 bytes as far as possible.
vpaddb LINENO_MID_TEMP, BIASCII_OFFSET, LINENO_MID
INTERPRET_BYTECODE(256, 256)

// Update the output and bytecode instruction pointers. The change to
// the output pointer kicks in immediately, but is cancelled out via
// the use of a negative offset until the end of the loop.
add OUTPUT_PTR, 512
INTERPRET_BYTECODE(288, -224)

add BYTECODE_IP, 512
INTERPRET_BYTECODE(320, -192)

// The change to the bytecode instruction pointer doesn't kick in
// immediately, because it might need to wrap back to the start (this
// can be done by adding BYTECODE_NEG_LEN to it); this is why the
// interpreter has a cached copy of it in %rdx.
lea %rax, [BYTECODE_IP + BYTECODE_NEG_LEN]
INTERPRET_BYTECODE(352, -160)

INTERPRET_BYTECODE(384, -128)
// Some modern processors can optimise `cmp` better if it appears
// immediately before the command that uses the comparison result, so
// a couple of commands have been moved slightly to put the `cmp` next
// to the use of its result. With modern out-of-order processors,
// there is only a marginal advantage to manually interleaving the
// instructions being used, and the `cmp` advantage outweighs that.
cmp BYTECODE_IP, BYTECODE_END

cmovae BYTECODE_IP, %rax
INTERPRET_BYTECODE(416, -96)

INTERPRET_BYTECODE(448, -64)

INTERPRET_BYTECODE(480, -32)
cmp OUTPUT_PTR, OUTPUT_LIMIT
jb main_loop

after_main_loop:
// There are two reasons the main loop might terminate: either there's
// a pipeful of output, or the line number has increased in width
// (forcing the generaion of new bytecode to put more digits in the
// numbers being printed). In the latter case, a) the output may have
// overrun slightly, and OUTPUT_PTR needs to be moved back to
// OUTPUT_LIMIT:
mov OUTPUT_PTR, OUTPUT_LIMIT
// and b) there may be less than a pipeful of output, in which case it
// wouldn't be safe to output it and the swap_buffers call needs to be
// skipped. Calculate the pipe size into %rax, the amount of output
// into %rdx (swap_buffers needs it there anyway), and compare.
lea %rax, [%rip + iovec_base]
sub %rax, BYTECODE_START
neg %eax
mov %rdx, OUTPUT_PTR
and %edx, (2 << 20) - 1
cmp %edx, %eax
jb 1f
call swap_buffers

// If all the lines at this width have been exhausted, move to the
// next width.
1: test YMMS_AT_WIDTH, YMMS_AT_WIDTH
jnz check_lineno_top_carry

cmp LINENO_WIDTHe, 18  // third phase handles at most 18 digits
je fourth_phase

inc LINENO_WIDTHe
vpcmpeqb %ymm0, %ymm0, %ymm0
vpsubb ENDIAN_SHUFFLE, ENDIAN_SHUFFLE, %ymm0

lea GROUPS_OF_15, [GROUPS_OF_15 + GROUPS_OF_15 * 4]
add GROUPS_OF_15, GROUPS_OF_15

add BYTECODE_END, 320

jmp third_phase_per_width_init

// So far, the code has kept LINENO_MID and LINENO_LOW updated, but
// not LINENO_TOP. Because 10 billion lines of FizzBuzz don't normally
// have a length that's divisible by 512 (and indeed, vary in size a
// little because 10 billion isn't divisible by 15), it's possible for
// the 10-billions and higher digits to need to change in the middle
// of a main loop iteration - indeed, even in the middle of a single
// CPU instruction!
//
// It turns out that when discussing the line number registers above,
// I lied a little about the format. The bottom seven bytes of
// LINENO_MID do indeed represent the hundreds to hundred millions
// digits. However, the eighth changes in meaning over the course of
// the program. It does indeed represent the billions digit most of
// the time; but when the line number is getting close to a multiple
// of 10 billion, the billions and hundred-millions digits will always
// be the same as each other (either both 9s or both 0s). When this
// happens, the format changes: the hundred-millions digit of
// LINENO_MID represents *both* the hundred-millions and billions
// digits of the line number, and the top byte then represents the
// ten-billions digit. Because incrementing a number causes a row of
// consecutive 9s to either stay untouched, or all roll over to 0s at
// once, this effectively lets us do maths on more than 8 digits,
// meaning that the normal arithmetic code within the main loop can
// handle the ten-billions digit in addition to the digits below.
//
// Of course, the number printing code also needs to handle the new
// representation, but the number printing is done by a bytecode
// program, which can be made to output some of the digits being
// printed multiple times by repeating "print digit from LINENO_MID"
// commands within it. Those commands are generated from COUNTER_TOP
// anyway, so the program just changes the constant portion of
// COUNTER_TOP (and moves print-digit commands into the top half) in
// order to produce the appropriate bytecode changes.
//
// A similar method is used to handle carries in the hundred-billions,
// trillions, etc. digits.
//
// Incidentally, did you notice the apparent off-by-one in the
// initialisation of LINENO_MID within third_phase_per_width_init? It
// causes the "billions" digit to be initialised to 1 (not 0) when the
// line number width is 11 or higher. That's because the alternate
// representation will be in use during a line number width change (as
// higher powers of 10 are close to multiples of 10 billion), so the
// digit that's represented by that byte of LINENO_MID genuinely is a
// 1 rather than a 0.
check_lineno_top_carry:

// The condition to change line number format is:
// a) The line number is in normal format, and the hundred-millions
//    and billions digits are both 9; or
// b) The line number is in alternate format, and the hundred-millions
//    digit is 0.
// To avoid branchy code in the common case (when no format change is
// needed), REGEN_TRIGGER is used to store the specific values of the
// hundred-millions and billions digits that mean a change is needed,
// formatted as two repeats of billions, hundred-millions, 9, 9 in
// high-decimal (thus, when using normal format, REGEN_TRIGGER is
// high-decimal 99999999, i.e. -1 when interpreted as binary). The 9s
// are because vpshufd doesn't have very good resolution: the millions
// and ten-millions digits get read too, but can simply just be masked
// out. The two repeats are to ensure that both halves of LINENO_MID
// (the even-hundreds-digit and odd-hundreds-digit halves) have the
// correct value while changing (changing the format while half the
// register still ended ...98999999 would produce incorrect output).
vpshufd %xmm0, LINENO_MIDx, 0xED
vpextrq %rax, %xmm0, 0
mov %rdx, 0x0000ffff0000ffff
or %rax, %rdx
cmp %rax, REGEN_TRIGGER
jne calculate_main_loop_iterations

cmp REGEN_TRIGGER, -1
jne switch_to_normal_representation


switch_to_alternate_representation:
// Count the number of 9s at the end of LINENO_TOP. To fix an edge
// case, the top bit of LINENO_TOP is interpreted as a 0, preventing
// a 9 being recognised there (this causes 10**18-1 to increment to
// 10**17 rather than 10**18, but the program immediately exits
// before this can become a problem).
vpextrq %rdx, LINENO_TOPx, 1
mov SPILL, %rdx
shl %rdx, 1
shr %rdx, 1
not %rdx
bsf %rcx, %rdx
and %rcx, -8

// Change the format of LINENO_TOP so that the digit above the
// consecutive 9s becomes a reference to the top byte of LINENO_MID,
// and the 9s themselves references to the hundred-millions digit.
// This is done via a lookup table that specifies how to move the
// bytes around.
.section .rodata
alternate_representation_lookup_table:
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 7, 9, 10, 11, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 7, 9, 10, 11, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 7, 10, 11, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 7, 10, 11, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 7, 11, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 7, 11, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 7, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 7, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 7, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 7, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 7, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 7, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 7, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 7, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 6, 7
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 6, 7
.text

lea %rax, [%rip + alternate_representation_lookup_table]
vpshufb LINENO_TOP, LINENO_TOP, [%rax + 4 * %rcx]

// The top byte of LINENO_MID also needs the appropriate digit of
// LINENO_TOP placed there.
mov %rdx, SPILL
shr %rdx, %cl
vpinsrb LINENO_MIDx, LINENO_MIDx, %edx, 7
vpinsrb LINENO_MIDx, LINENO_MIDx, %edx, 15
vpermq LINENO_MID, LINENO_MID, 0x44

// Finally, REGEN_TRIGGER needs to store the pattern of digits that
// will prompt a shift back to the normal representation (the hundred-
// millions digit must be 0, and the value of the billions digit will
// be predictable).
inc %edx
shl %edx, 24
or %edx, 0xF6FFFF
mov REGEN_TRIGGERe, %edx
shl %rdx, 32
or REGEN_TRIGGER, %rdx
jmp generate_bytecode


switch_to_normal_representation:
// Switching back is fairly easy: LINENO_TOP can almost be converted
// back into its usual format by running the bytecode program stored
// there to remove any unusual references into LINENO_MID, then
// restoring the usual references manually. Running the program will
// unfortunately convert high-decimal to ASCII (or in this case zeroes
// because there's no need to do the subtraction), but that can be
// worked around by taking the bytewise maximum of the converted and
// original LINENO_TOP values (high-decimal is higher than bytecode
// references and much higher than zero).
vpsubb %ymm2, BASCII_OFFSET, LINENO_TOP
vpshufb %ymm0, LINENO_MID, %ymm2
vpmaxub LINENO_TOP, LINENO_TOP, %ymm0

// Manually fix the constant parts of lineno_top to contain their
// usual constant values
.section .rodata
lineno_top_max:
.byte 198, 197, 196, 195, 194, 193, 192, 191
.byte 255, 255, 255, 255, 255, 255, 255, 255
.byte 190, 189, 188, 187, 186, 185, 184, 183
.byte 255, 255, 255, 255, 255, 255, 255, 255
.text
vpminub LINENO_TOP, LINENO_TOP_MAX, LINENO_TOP

// The billions digit of LINENO_MID needs to be set back to 0 (which
// is its true value at this point: the same as the hundred-thousands
// digit, which is also 0).
vpsllq LINENO_MID, LINENO_MID, 8
vpsrlq LINENO_MID, LINENO_MID, 8
vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID

mov REGEN_TRIGGER, -1

jmp generate_bytecode


///// Fourth phase
//
// Ending at 999999999999999999 lines would be a little unsatisfying,
// so here's a routine to write the quintillionth line and exit.
//
// It's a "Buzz", which we can steal from the first phase's constant.

fourth_phase:

mov ARG1e, 1
lea ARG2, [%rip + fizzbuzz_intro + 11]
mov ARG3, 5
mov SYSCALL_NUMBER, __NR_write
syscall
call exit_on_error
xor ARG1e, ARG1e
jmp exit


///// Error handling code
//
// This doesn't run in a normal execution of the program, and isn't
// particularly optimised; I didn't comment it much because it isn't
// very interesting and also is fairly self-explanatory.

write_stderr:
mov ARG1e, 2
mov SYSCALL_NUMBER, __NR_write
syscall
ret

inefficiently_write_as_hex:
push %rax
push %rcx
shr %rax, %cl
and %rax, 0xF
.section .rodata
hexdigits: .ascii "0123456789ABCDEF"
.text
lea %rcx, [%rip + hexdigits]
movzx %rax, byte ptr [%rcx + %rax]
mov [%rip + error_write_buffer], %al
lea ARG2, [%rip + error_write_buffer]
mov ARG3e, 1
call write_stderr
pop %rcx
pop %rax
sub %ecx, 4
jns inefficiently_write_as_hex
ret

exit_on_error:
test SYSCALL_RETURN, SYSCALL_RETURN
js 1f
ret

.section .rodata
error_message_part_1: .ascii "Encountered OS error 0x"
error_message_part_2: .ascii " at RIP 0x"
error_message_part_3: .ascii ", exiting program.\n"
.text

1: push SYSCALL_RETURN
lea ARG2, [%rip + error_message_part_1]
mov ARG3e, 23
call write_stderr
pop SYSCALL_RETURN
neg SYSCALL_RETURN
mov %rcx, 8
call inefficiently_write_as_hex
lea ARG2, [%rip + error_message_part_2]
mov ARG3e, 10
call write_stderr
pop %rax  // find the caller's %rip from the stack
sub %rax, 5  // `call exit_on_error` compiles to 5 bytes
mov %rcx, 60
call inefficiently_write_as_hex
lea ARG2, [%rip + error_message_part_3]
mov ARG3e, 19
call write_stderr
mov ARG1e, 74
// fall through

exit:
mov SYSCALL_NUMBER, __NR_exit_group
syscall
ud2

.section .rodata
cpuid_error_message:
.ascii "Error: your CPUID command does not support command "
.ascii "0x80000006 (AMD-style L2 cache information).\n"
.text
bad_cpuid_error:
lea ARG2, [%rip + cpuid_error_message]
mov ARG3e, 96
call write_stderr
mov ARG1e, 59
jmp exit

.section .rodata
pipe_error_message:
.ascii "This program can only output to a pipe "
.ascii "(try piping into `cat`?)\n"
.text
pipe_error:
lea ARG2, [%rip + pipe_error_message]
mov ARG3e, 64
call write_stderr
mov ARG1e, 73
jmp exit

.section .rodata
pipe_perm_error_message_part_1:
.ascii "Cannot allocate a sufficiently large kernel buffer.\n"
.ascii "Try setting /proc/sys/fs/pipe-max-size to 0x"
pipe_perm_error_message_part_2: .ascii ".\n"
.text
pipe_perm_error:
lea ARG2, [%rip + pipe_perm_error_message_part_1]
mov ARG3e, 96
call write_stderr
mov %rax, PIPE_SIZE
mov %ecx, 28
call inefficiently_write_as_hex
lea ARG2, [%rip + pipe_perm_error_message_part_2]
mov ARG3e, 2
call write_stderr
mov ARG1e, 77
jmp exit

.section .rodata
pipe_size_error_message_part_1:
.ascii "Failed to resize the kernel pipe buffer.\n"
.ascii "Requested size: 0x"
pipe_size_error_message_part_2: .ascii "\nActual size: 0x"
pipe_size_error_message_part_3:
.ascii "\n(If the buffer is too large, this may cause errors;"
.ascii "\nthe program could run too far ahead and overwrite"
.ascii "\nmemory before it had been read from.)\n"
.text
pipe_size_mismatch_error:
push SYSCALL_RETURN
lea ARG2, [%rip + pipe_size_error_message_part_1]
mov ARG3e, 59
call write_stderr
mov %rax, PIPE_SIZE
mov %ecx, 28
call inefficiently_write_as_hex
lea ARG2, [%rip + pipe_size_error_message_part_2]
mov ARG3e, 16
call write_stderr
pop %rax
mov %ecx, 28
call inefficiently_write_as_hex
lea ARG2, [%rip + pipe_size_error_message_part_3]
mov ARG3e, 141
call write_stderr
mov ARG1e, 73
jmp exit
Read the whole story
bernhardbock
40 days ago
reply
impressive
Share this story
Delete
Next Page of Stories