I wanted to use PostgreSQL‘s array_agg function from Django to help bound the number of queries a complex page makes.

I managed to make it work, but it was a bit painful. I’d love to hear from anyone else that’s managed to make it work, especially if you didn’t have to hack as hard as me :-)

First, let’s define some models for us to play with:

from django.db import models

class Location(models.Model):
    name = models.CharField(max_length=1024)

class Hurricane(models.Model):
    year = models.IntegerField()
    name = models.CharField(max_length=1024)
    location = models.ForeignKey(Location)
 

The problem I want to solve is: for a given Location, show the hurricanes that occurred for the last ten years in which there were hurricanes.

In this simplified case, this is probably quite easy to solve without being as fancy as I am about to be, but the thing that makes it non-trivial is the bit at the end: displaying from the last ten years is easy (“WHERE YEAR > 2002“) but there may be years where there were not hurricanes. The trick I want to use is to GROUP ON year, use array_agg to collect the ids of the hurricanes for each year, and use LIMIT to ensure I only get 10 years worth. Then I can gather the Hurricane ids up in Python code and issue another query to fetch them all in a single query. I’m aiming for something like this:

hurr=> SELECT year, array_agg(id) FROM hurr_hurricane
hurr-> WHERE hurr_hurricane.location_id = 1
hurr-> GROUP BY year ORDER BY year LIMIT 10;
 year | array_agg
------+-----------
 2004 | {2,1}
 2006 | {3}
 2007 | {4}
(3 rows)
 

I’ve never quite gotten my head around Django’s support for aggregates, but it’s not hard to get Django to emit a query of the right shape:

>>> qs = Hurricane.objects.filter(
...     location=loc).values('year').annotate(Sum('id'))[:10]
>>> print qs.query
SELECT "hurr_hurricane"."year", SUM("hurr_hurricane"."id") AS "id__sum"
FROM "hurr_hurricane" WHERE "hurr_hurricane"."location_id" = 1
GROUP BY "hurr_hurricane"."year", "hurr_hurricane"."year"
ORDER BY "hurr_hurricane"."year" LIMIT 10
>>> pprint(list(qs))
[{'id__sum': 3, 'year': 2004},
 {'id__sum': 3, 'year': 2006},
 {'id__sum': 4, 'year': 2007}]
 

We don’t want to SUM though, we want to array_agg. I don’t really know to what extent it’s actually supported, but googling can find enough of a clue on how to use custom aggregates with Django. Culting the approprate cargo leads to:

from django.db.models.sql.aggregates import Aggregate as SQLAggregate
from django.db.models import Aggregate

class SQLArrayAgg(SQLAggregate):
    sql_function = 'array_agg'

class ArrayAgg(Aggregate):
    name = 'ArrayAgg'
    def add_to_query(self, query, alias, col, source, is_summary):
        klass = SQLArrayAgg
        aggregate = klass(
            col, source=source, is_summary=is_summary, **self.extra)
        query.aggregates[alias] = aggregate
 

and then:

>>> qs = Hurricane.objects.filter(
...     location=loc).values('year').annotate(ArrayAgg('id')).order_by('year')[:10]
>>> print qs.query
SELECT "hurr_hurricane"."year", array_agg("hurr_hurricane"."id") AS "id__arrayagg"
FROM "hurr_hurricane" WHERE "hurr_hurricane"."location_id" = 1
GROUP BY "hurr_hurricane"."year", "hurr_hurricane"."year"
ORDER BY "hurr_hurricane"."year" ASC LIMIT 10
 

Yay! Except:

>>> list(qs)
[]
 

Huh? This sent me around the houses for ages and ages, but it turns out that there is an easy way of seeing the problem:

>>> for hurricane in qs:
...    print hurricane
Traceback (most recent call last):
    ...
    ...
File "/srv/lava/.cache/eggs/Django-1.4.1-py2.7.egg/django/db/models/sql/query.py", line 316, in convert_values
    return connection.ops.convert_values(value, field)
File "/srv/lava/.cache/eggs/Django-1.4.1-py2.7.egg/django/db/backends/__init__.py", line 843, in convert_values
    return int(value)
TypeError: int() argument must be a string or a number, not 'list'
 

(Something in executing list(qs) is swallowing the exception – I presume this is a bug, I’ll file it soon if it’s unreported.)

So let’s look at where the exception comes from (django/db/backends/__init__.py, Django version 1.4.1 below):

def convert_values(self, value, field):
    """Coerce the value returned by the database backend into a consistent type that
    is compatible with the field type.
    """
    internal_type = field.get_internal_type()
    if internal_type == 'DecimalField':
        return value
    elif internal_type and internal_type.endswith('IntegerField') or internal_type == 'AutoField':
        return int(value)
    elif internal_type in ('DateField', 'DateTimeField', 'TimeField'):
        return value
    # No field, or the field isn't known to be a decimal or integer
    # Default to a float
    return float(value)
 

Django appears to assume that all aggregates return numeric types. This is a bit annoying (another bug to file?), but there is a slimy hack we can pull: DecimalFields are assumed to be the correct type already and do not get converted. So here’s the final, working, array_agg support:

from django.db.models import DecimalField

class SQLArrayAgg(SQLAggregate):
    sql_function = 'array_agg'

class ArrayAgg(Aggregate):
    name = 'ArrayAgg'
    def add_to_query(self, query, alias, col, source, is_summary):
        klass = SQLArrayAgg
        aggregate = klass(
            col, source=source, is_summary=is_summary, **self.extra)
        aggregate.field = DecimalField() # vomit
        query.aggregates[alias] = aggregate
 

And to prove it works:

>>> pprint(list(qs))
[{'id__arrayagg': [2, 1], 'year': 2004},
 {'id__arrayagg': [3], 'year': 2006},
 {'id__arrayagg': [4], 'year': 2007}]
 

Yay! Now: does it really have to be this hard?