Hello again, kiddies. Today's post describes a particular behavior of PostgreSQL that has bitten me on the bum more than once. Do you have any queries that should be using an index, but aren't? Read on for a possible explanation.
I've been working with the UMLS a lot lately. For those who are unfamiliar with this acronym, it stands for the "Unified Medical Language System" and is a monstrously large database containing millions of medical terms and concepts in dozens of languages. While it can be a wonderful resource, the adjectives I'd use to describe it do not include "easy to use." One of its nifty features is that, in addition to simply storing medical terms, it stores a little bit of
semantic information about the terms. In other words, if I know the UMLS code number for "Weil's Disease", I can (in theory) find out that it is a type of "Leptospirosis", and that it is therefore a "Zoonotic bacterial infection" and a "Spirochetal infection".
As you might imagine, this involves writing an SQL query that JOINs data from several tables on a particular set of specified values. Now, given the size of the UMLS, some of these tables have millions of rows. As anybody out there who's done anything with SQL will tell you, you
really want the database engine to do that join using an index scan (as opposed to a simple sequential scan, which would have to visit each and every one of those millions of rows). Luckily, in PostgreSQL, if you are trying to join two tables,
and the column you're joining on is indexed in both of those tables, the system is usually smart enough to do an index scan, and life is good.
Note the use of that all-important word, "usually". There are certain circumstances in which, in spite of both columns being indexed, the database will insist on doing a sequential scan on one or both of the tables. You'll know when this happens because a query that should take under one second will instead take ninety (depending, of course, on how large your tables are). The first step for diagnosing a slow query should always be to use the "
explain
" command. When you do
explain
your mysteriously slow query, the odds are good that you'll see the dreaded "
Seq Scan on...
" listed as one of the steps instead of "
Index Scan using...
".
There are all sorts of reasons why Postgres could be choosing not to use an index, but the reason that has gotten me most often over the years is that the columns involved in comparisons (i.e., the columns that are being specified as the join conditions) are not using the same datatype. The first time I ran into this problem, I was using two tables that were involved in storing customer shopping cart data. For some reason that I've long since forgotten, we'd used
bigint
(i.e.,
int8
) as the datatype for a column in one table, and only used
int
(
int4
) for the corresponding column in the other table. Even though each table had a column index for the relevant column, Postgres was doing a sequential scan, which began to absolutely murder our query's performance once one of the tables got to having more than 100,000 rows or so.
The solution was refreshingly easy: in the query, simply cast one of the columns to the appropriate datatype. In this case, we casted the
int4
column to
int8
, and *boom*, the query ran thousands of times faster.
Today, in the UMLS, I found myself running into the same problem trying to do a join between mrhier (~8,000,000 rows) and mrconso (~ 4,000,000 rows): for some reason, the
mrhier
table's
paui
column is set to
varchar(10)
, whereas every other column storing an AUI is
varchar(9)
. I forced my query to start casting to varchar(9), and things suddenly started using their appropriate indices. I don't know if the difference in data types is a bug in the UMLS SQL load scripts, or what— MRCOLS.RRF tells me that the PAUI column should have a maximum of 9 characters, so that spare character is almost certainly anomalous.
For example, consider the following query:
select str from mrconso where aui in (select distinct mh.paui from mrhier mh where mh.aui = 'A2883423');
Note that I'm using a subquery mostly out of convenience.
explain analyze
gives the following output: (Blogger's screwed up the formatting, you might need to copy and paste into Text Edit or something)
umls=# explain analyze select str from mrconso where aui in (select distinct mh.paui from mrhier mh where mh.aui = 'A2883423');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash IN Join (cost=164014.37..352398.59 rows=43224 width=32) (actual time=2008.163..50013.062 rows=2 loops=1)
Hash Cond: (("outer".aui)::bpchar = "inner".paui)
-> Seq Scan on mrconso (cost=0.00..166340.32 rows=4322332 width=63) (actual time=0.100..43613.254 rows=4322332 loops=1)
-> Hash (cost=164014.36..164014.36 rows=2 width=34) (actual time=0.205..0.205 rows=0 loops=1)
-> Subquery Scan "IN_subquery" (cost=163814.12..164014.36 rows=2 width=34) (actual time=0.189..0.197 rows=2 loops=1)
-> Unique (cost=163814.12..164014.34 rows=2 width=34) (actual time=0.185..0.192 rows=2 loops=1)
-> Sort (cost=163814.12..163914.23 rows=40045 width=34) (actual time=0.184..0.185 rows=3 loops=1)
Sort Key: paui
-> Index Scan using x_mrhier_aui on mrhier mh (cost=0.00..160080.82 rows=40045 width=34) (actual time=0.075..0.096 rows=3 loops=1)
Index Cond: ((aui)::text = 'A2883423'::text)
Total runtime: 50049.380 ms
Crikey! This thing took almost a minute to run! I've highlighted the culprit in red: that dang sequential scan. Let's change the query a little bit, and use the casting trick, like so:
select str from mrconso where aui in (select distinct mh.paui::varchar(9) from mrhier mh where mh.aui = 'A2883423');
I've highlighted the cast. Let's run
explain analyze
again:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=164014.34..337010.71 rows=43224 width=32) (actual time=40.762..87.066 rows=2 loops=1)
-> Subquery Scan "IN_subquery" (cost=164014.34..164214.59 rows=2 width=31) (actual time=0.232..0.261 rows=2 loops=1)
-> Unique (cost=164014.34..164214.57 rows=2 width=34) (actual time=0.229..0.249 rows=2 loops=1)
-> Sort (cost=164014.34..164114.46 rows=40045 width=34) (actual time=0.227..0.232 rows=3 loops=1)
Sort Key: (paui)::character varying(9)
-> Index Scan using x_mrhier_aui on mrhier mh (cost=0.00..160281.04 rows=40045 width=34) (actual time=0.091..0.111 rows=3 loops=1)
Index Cond: ((aui)::text = 'A2883423'::text)
-> Index Scan using x_mrconso_aui on mrconso (cost=0.00..86127.91 rows=21612 width=63) (actual time=43.368..43.382 rows=1 loops=2)
Index Cond: ((mrconso.aui)::text = ("outer".paui)::text)
Total runtime: 87.286 ms
Look, ma, no sequential scans! Pretty remarkable performance improvement, eh? Sub-100-ms vs 50,000 ms! Amazing what a good index will do on sufficiently large tables.
So, long story short, Postgres' query planner is very picky about column data types, even down to how many characters a
varchar
is given. If a query is misbehaving, try experimenting with some casting and see what you get.
Update: For the sake of completeness, I should mention that the PAUI column is not supposed to be 10 characters long- it's supposed to be 9 characters long, but there's a bug in the sql load file that the UMLS gives out wherein the column is set to 8 characters by mistake. To "fix" this bug, I set it to 10, thereby introducing this other bug. This all happened before I had found the master data dictionary for the UMLS; if I were to run into this problem today, I would have been able to discover that the correct column width for PAUI is 9 characters and this never would have happened. However, the larger point of the article--- that it's important to JOIN on similarly-typed columns--- is still valid.