Why is your dbms not using your index?

So recently I was trying to gain some knowledge about indexes using Postgres, going beyond the tutorials I’ve done and so I attempted to make some queries faster using b-tree indexes on fish count data. 

However, before getting into any complex queries I tried out a basic one:

select * from chinook where year > ‘2000’;

When I used explain analyze on this query it gave me:

                                                     QUERY PLAN                                                      

———————————————————————————————————————————————

 Seq Scan on yearly_chinook_sum_view  (cost=0.00..1.15 rows=4 width=40) (actual time=12.559..12.584 rows=12 loops=1)

   Filter: ((year)::text > ‘2000’::text)

 Total runtime: 12.669 ms

(3 rows)

So why is it still doing a sequential scan?

First I looked at my table to make sure I really had added a b-tree index on the table named chinook using \d chinook which confirmed that yes, I have a b-tree index on the year column.

Then I started searching for the issue and found these articles:

http://www.depesz.com/2010/09/09/why-is-my-index-not-being-used/

http://www.postgresonline.com/journal/archives/78-Why-is-my-index-not-being-used.html

There were a few possible reasons why my query wasn’t using the index, but the one that seemed most likely to me was that it wasn’t efficient due to table size. 

Next step, how big is the Chinook table?

https://wiki.postgresql.org/wiki/Disk_Usage

So from Wiki I used this join to find my total disk usage including indexes:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 20;

And for Chinook it gave me:

public.chinook                        | 160 kB

So this is fairly small, but I don’t really feel like doing tests to see where the threshold is. What I did test were queries that would return one row versus a larger percentage of rows.

kenai_fishruns=# explain analyze select * from chinook where year = ‘2009’;

                                                           QUERY PLAN                                                            

———————————————————————————————————————————————

 Index Scan using chinook_year_index on chinook  (cost=0.28..9.67 rows=48 width=44) (actual time=30.929..31.151 rows=48 loops=1)

   Index Cond: ((year)::text = ‘2009’::text)

 Total runtime: 31.244 ms

(3 rows)

VERSUS:

kenai_fishruns=# explain analyze select * from chinook where year > ‘2009’;

                                              QUERY PLAN                                               

———————————————————————————————————————————————

 Seq Scan on chinook  (cost=0.00..13.45 rows=203 width=44) (actual time=0.033..1.258 rows=203 loops=1)

   Filter: ((year)::text > ‘2009’::text)

   Rows Removed by Filter: 393

 Total runtime: 1.350 ms

(4 rows)

So this started pointing me in the right direction, I figured it had to do with the amount of rows returned versus some error in how I indexed my tables, so I googled and stackoverflow came to the rescue again:

http://stackoverflow.com/questions/5203755/why-does-postgresql-perform-sequential-scan-on-indexed-column

Basically the answer to this post says that if a query returns approximately more than 5-10% (roughly) of the total rows, then it will use a sequential scan versus an index scan.

This makes sense with my little test between the queries returning the one row versus the many rows it would be returning for the second query. Unfortunately this stackoverflow post didn’t cite where to find this info in the docs, but that’s research for another day. At least now I can rest easy knowing the general why behind Postgres choosing to use a sequential scan versus an index scan.

Big O and other algorithm resources

Here are some resources for us non-CS majors who became programmers anyway:

Big O:

http://www.perlmonks.org/?node_id=227909

https://www.youtube.com/watch?v=V6mKVRU1evU

Basic Search Algorithms:

http://algorithms.openmymind.net/

http://interactivepython.org/runestone/static/pythonds/index.html

I will add resources as I come upon them and feel free to add yours in the notes and I’ll add them to the post.

Homebrew should create a data folder or at least a pg_log folder

So I was following some advice from one of the tutorials I took at PyCon2014 called Postgres for Python People and since it was more of a list of stuff you should implement when using Postgres rather than a hands on tutorial, I thought I’d better get to it a month later.  I have Postgres installed via Hombrew on my Mac and realized there was no folder for my log files, it’s just plopped in under the Postgres folder:  /usr/local/var/postgres/server.log

So I created a pg_log folder under /usr/local/var/postgres/

then changed my homebrew.mxcl.postgresql.plist (which is a simple xml file) to point to the pg_log folder:

<key>StandardErrorPath</key>
<string>/usr/local/var/postgres/pg_log/server.log</string>

Then I shut down the Postgres server via launchctl:

launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

Then restarted it again via launchctl:

launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

And now your server.log files are under pg_log folder!

Find out what those flags in grep mean

If you don’t want to sort through a long blog post on grep flags and simply want a more interactive solution, you won’t get any help by typing (bad pun intended):

grep —help

or any other combo of help in the command line. So, I recently learned that you use:

man grep

and that will give you all the dirty on grep in a very readable format.  SO next time you’re typing in something like:

ps auxw | grep postgres | grep — -D

and want to know what the heck the -D flag is for just man grep and you’ll find out this: -D action—devices=action

man grep

Specify the demanded action for devices, FIFOs and sockets.  The default action is `read’, which means, that they are read as if they were normal files.  If the action is set to `skip’, devices will be silently skipped.