Django ORM Is All I Need... – NOT! – Unexpected results



Django ORM Is All I Need... – NOT! – Unexpected results

0 0


doiainn

Django ORM Is All I Need... NOT

On Github bbenko / doiainn

Django ORM Is All I Need...

NOT!

By Božidar Benko / @bbenko / CTO @VuMedi

Django ORM is great

I love it and use it all the time.

I'm productive in it (more than in SQL).

2 (Wrong) Mantras

All CRUD should be done with stored procedures.

old school

ORM is all I need.

site owners with no users

Pareto 80-20

  • Use ORM whenever I can
  • Know it's limitations
  • Know which (SQL) code it produces
  • Know the database
  • Tweak the db
  • Write SQL

ORM Limitations

  • Unexpected results
  • Slow execution
  • Some things not supported

Unexpected results

At VuMedi, we have users in various groups.

Most interesting for us are 2 surgeon groups:

  • Surgeon
  • Flagged

How many surgeons do we have?

Easy

User.objects.filter(groups__name__in=['Surgeon', 'Flagged']).count()

Too Much

Hey BB, are you sure? That seems a bit too much.

- Off course I'm sure, but OK, to check, I'll write the same thing using Qs.

User.objects.filter(Q(groups__name='Surgeon') | \
                    Q(groups__name='Flagged')).count()

One meeelion again!

Counting Twice?

No, seriously, that's too much.

Some users are both 'Surgeon' and 'Flagged'.

Maybe you are counting those twice.

Trust me, I'm an engineer

Hey, I know Boolean algebra.

1 OR 1 == 1

Not 2!

But I'll write some tests.

The Test

house = User.objects.create_user(username='house',
                                 email='house@vumedi.com',
                                 password='house123')
self.assertEqual(User.objects.count(), 1)  # only House

# House is surgeon and flagged
house.groups.add(self.surgeon, self.flagged)
self.assertEqual(User.objects.filter(groups=self.surgeon).count(), 1)
self.assertEqual(User.objects.filter(groups=self.flagged).count(), 1)

# how many users that are surgeon or flagged?
# only one user in db, it must be one
surgeons = User.objects.filter(groups__in=[self.surgeon,
                                           self.flagged])
self.assertEqual(surgeons.count(), 1)
						

The Fail

(doiainn)➜  doiainn git:(master) ✗ ./manage.py test
Creating test database for alias 'default'...
F
======================================================================
FAIL: test_m2m_count (app.tests.AppTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/bb/code/doiainn/code/doiainn/app/tests.py", line 27, in test_m2m_count
    self.assertEqual(surgeons.count(), 1)
AssertionError: 2 != 1

----------------------------------------------------------------------
Ran 1 test in 0.137s

FAILED (failures=1)
Destroying test database for alias 'default'...						

Same with Qs

        surgeons = User.objects.filter(Q(groups=self.surgeon) |
                                       Q(groups=self.flagged))
        self.assertEqual(surgeons.count(), 2)
						

SQL produced

SELECT "auth_user"."id",
       "auth_user"."password",
       "auth_user"."last_login",
       "auth_user"."is_superuser",
       "auth_user"."username",
       ...
FROM "auth_user"
INNER JOIN "auth_user_groups" ON
	("auth_user"."id" = "auth_user_groups"."user_id")
WHERE ("auth_user_groups"."group_id" = 1
       OR "auth_user_groups"."group_id" = 2);

1|pbkdf2_sha256$20000$SMZHa8Fn8nG1$cGdKi...
1|pbkdf2_sha256$20000$SMZHa8Fn8nG1$cGdKi...

... because ...

SELECT COUNT(*) FROM auth_user_groups WHERE user_id=1;
2
					

Distinct helps

self.assertEqual(surgeons.distinct().count(), 1)
						
(doiainn)➜  doiainn git:(master) ✗ ./manage.py test
Creating test database for alias 'default'...
.
----------------------------------------------------------------------
Ran 1 test in 0.153s

OK
Destroying test database for alias 'default'...
						

Slow Execution

Django Admin is great. You use it to build your MVP.

Your table grows beyond 12M rows.

Change list view and search are now slow.

Total Count

Admin displays total number of records in change list view.

Having 12M+ rows, total count doesn't have to be exact.

When filtered, it should be exact.

Postgres Has Count Approximation

SELECT COUNT(*) FROM myapp_mymodel;
count
----------
12871557
(1 row)
Time: 47299.633 ms

SELECT reltuples FROM pg_class WHERE relname = 'myapp_mymodel';
reltuples
-------------
 1.28186e+07
(1 row)
Time: 0.410 ms
						

DjangoSnippets have the code

Django Admin Speedup For Big Tables on Postgres

class LargeTableChangeList(ChangeList):
    def get_results(self, request):
        ...
        if not self.query_set.query.where:
            full_result_count = result_count
        else:
            try:
                cursor = connection.cursor()
                cursor.execute("SELECT reltuples
            		    FROM pg_class WHERE relname = %s",
                    [self.root_query_set.query.model._meta.db_table])
                full_result_count = int(cursor.fetchone()[0])
            except:
                full_result_count = self.root_query_set.count()
        ...
						

Django 1.8 introduced ModelAdmin.show_full_result_count to disable count.

But search is still slow

MDN field

It was a traffic usage table with MDN field.

MDN is 10 digit number, but can have leading zeros, so I used CharField.

mdn = models.CharField(...)
						

Admin default search

class MyModel(admin.ModelAdmin):
    search_fields = ['mdn', ]
						

does icontains - case insensitive contains

I don't need contains

class MyModel(admin.ModelAdmin):
    search_fields = ['=mdn', ]
						

does iexact - case insensitive exact

MDN Is Digits Only

I don't need case sensitivity.

But, at the time, you could not further customize the search.

# Apply keyword searches.
def construct_search(field_name):
    ...
    elif field_name.startswith('='):
        return "%s__iexact" % field_name[1:]
						

From Django 1.6, you can use ModelAdmin.get_search_results.

SQL DEBUGGING

SELECT "myapp_mymodel"."id", "myapp_mymodel"."mdn”, ...
FROM "myapp_mymodel"
WHERE UPPER("myapp_mymodel"."mdn"::text) = UPPER(‘1234567890’)
LIMIT 100;
						

UPPER does case insensitivity.

It took ~7.5 seconds to execute this query.

Without UPPER, it took ~6 milliseconds (there was a btree index on a field).

Index On Expressions

Can't remove UPPER.

But you can do indexes on functions in Postgres.

Even better, you can do index on any expression.

Postgres Rocks

I can create an index on UPPER(mdn)

CREATE INDEX myapp_mymodel_mdn_upper ON myapp_mymodel(UPPER(mdn));
CREATE INDEX
Time: 377055.345 ms
						

The whole admin view with search results now comes back in under 2 seconds!

Slow Execution II

@VuMedi, our impression table is big.

=> SELECT COUNT(*) FROM tracking_impression;
   count
-----------
 244116752
(1 row)
						

System and customer analytics (mostly) need only the last month of that data.

Quering this table is pretty slow.

Partitioning To The Rescue

Ordinary indexes don't help.

But, I can partition the data by months.

Postgres Partitioning

Splitting one large table into smaller physical pieces.

  • improved query performance
  • reduced index size for the heavily used parts (fit in RAM)
  • faster bulk loads and deletes
  • seldom-used data can be migrated to cheap storage

Partitioning Implementation

  • via table inheritence
  • one master table
  • each partition is created as child of the master table
  • master table is empty
  • table constraints can be added to define allowed keys in each partition
  • index on key column(s) can be created in each child
  • insert trigger to redirect data from master to appropriate partition

Partitioning By Months

Master table / Django Model

class Impression(models.Model):
    user = models.ForeignKey(User)
    when = models.DateTimeField(auto_now_add=True)
    content_type = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey()
    ...
						
CREATE TABLE "tracking_impression" (
     "id" serial NOT NULL PRIMARY KEY,
     "user_id" integer NOT NULL REFERENCES "auth_user" ("id") DEFERRABLE INITIALLY DEFERRED,
     "when" timestamp with time zone NOT NULL,
     "content_type_id" integer NOT NULL
     		   REFERENCES "django_content_type" ("id")
     		   DEFERRABLE INITIALLY DEFERRED,
     "object_id" integer CHECK ("object_id" >= 0) NOT NULL,
     ...
);
						

Create Partition For Each Month

CREATE TABLE tracking_impression_2015_09 (
    CHECK ( "when" >= '2015-09-01' AND "when" < '2015-10-01' )
) INHERITS (tracking_impression);

CREATE INDEX ON tracking_impression_2015_09 USING btree(user_id);
CREATE INDEX ON tracking_impression_2015_09 USING btree("when");
						

Insert Trigger

We want to say:

INSERT INTO tracking_impression ...
						

and have the data be redirected into the appropriate partition table.

CREATE OR REPLACE FUNCTION tracking_impression_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF ( NEW."when" >= '2015-09-01' AND
         NEW."when" < '2015-10-01' ) THEN
        INSERT INTO tracking_impression_2015_09 VALUES (NEW.*);
    ELSIF ( NEW."when" >= '2015-08-01' AND
            NEW."when" < '2015-09-01' ) THEN
        INSERT INTO tracking_impression_2015_08 VALUES (NEW.*);
    ...
    ELSE
        RAISE EXCEPTION 'Date out of range.';
    END IF;
    RETURN NULL;
END;
$$
LANGUAGE plpgsql;
						

Smart Partitioning Trigger

CREATE OR REPLACE FUNCTION trg_tracking_impression_partition()
  RETURNS trigger AS $func$
DECLARE
   _tablename text := 'tracking_impression_' ||
                      to_char(NEW."when", 'YYYY_MM');
BEGIN
   IF NOT EXISTS (
      SELECT 1
      FROM   pg_catalog.pg_class c
      JOIN   pg_catalog.pg_namespace n ON n.oid = c.relnamespace
      WHERE  n.nspname = 'public'
      AND    c.relname = _tablename
      ) THEN ...
  						
      ) THEN ...
       EXECUTE format('CREATE TABLE %I
                       (CHECK ("when" >= %L
                       AND "when" <  %L))
                       INHERITS (public.tracking_impression)'
        , _tablename
        , to_char(NEW."when",     'YYYY-MM-01')
        , to_char((date_trunc('MONTH', NEW."when")
          + INTERVAL '1 MONTH')::date, 'YYYY-MM-DD')
        );

       EXECUTE format('CREATE INDEX ON %I USING btree(user_id)',
                      _tablename);
       EXECUTE format('CREATE INDEX ON %I USING btree("when")',
                      _tablename);
   END IF; ...
						
   ...
   EXECUTE 'INSERT INTO ' || quote_ident(_tablename)
           || ' VALUES ($1.*) RETURNING id' USING NEW;
   RETURN NULL;
END
$func$ LANGUAGE plpgsql SET search_path = public;

-- trigger
CREATE TRIGGER tracking_impression_insert_trigger
BEFORE INSERT ON tracking_impression
FOR EACH ROW EXECUTE PROCEDURE trg_tracking_impression_partition();
						

The Data And The Speed

INSERT INTO
tracking_impression(user_id, "when", content_type_id, object_id, ..)
SELECT user_id, "when", content_type_id, object_id, ..
FROM  tracking_impression_np;
						
-- old non partitioned table
=> SELECT "tracking_impression".*
FROM "tracking_impression_np"
WHERE ("tracking_impression_np"."when"
BETWEEN '2014-09-05 00:00:00-07:00'
AND '2014-10-04 00:00:00-07:00' AND ...;
Time: 2983993.619 ms
-- new partitioned table
=> SELECT "tracking_impression".*
FROM "tracking_impression"
WHERE ("tracking_impression"."when"
BETWEEN '2014-09-05 00:00:00-07:00'
AND '2014-10-04 00:00:00-07:00' AND ...;
Time: 263.702 ms
						

Django Compatibility

Almost (everything) works out of the box.

return query.get_compiler(using=using).execute_sql(return_id)
.../django/db/models/sql/compiler.py", line 914, in execute_sql
    return self.connection.ops.fetch_returned_insert_id(cursor)
.../django/db/backends/__init__.py", line 529,
in fetch_returned_insert_id
    return cursor.fetchone()[0]
TypeError: 'NoneType' object is unsubscriptable
						
class Impression(models.Model):
	...
    def save(self, *args, **kwargs):
        from django.db import connection
        connection.features.can_return_id_from_insert = False
        super(Impression, self).save(*args, **kwargs)
        connection.features.can_return_id_from_insert = True
						

That's My Man, Postgres

Some Things Not Supported

Hey BB, there are duplicates when I sort in model's admin

by due_date field from not closed event.

Duplicates again

class Model(models.Model):
    ...

class ModelEvent(models.Model):
    model = models.ForeignKey(Model)
    due_date = models.DateField(...)
    closed = models.BooleanField(default=False)
						

This is how admin prepares queryset:

In [7]: Model.objects.all()\
                     .order_by('modelevent__due_date').distinct()
Out[7]: [<Model: A>, <Model: B>, <Model: B>, <Model: C>, ...
						

2 events for model B --> model B twice in queryset.

Not even distinct helps

SELECT DISTINCT "model"."id",
                ...,
                "modelevent"."due_date"
FROM "model"
LEFT OUTER JOIN "modelevent" ON ("model"."id" = "modelevent"."model_id")
ORDER BY "modelevent"."due_date" ASC
                        

Take Only The First Event

Annotate on min due_date.

In [7]: Model.objects.all()\
                     .annotate(Min('modelevent__due_date'))\
                     .order_by('modelevent__due_date__min')\
                     .distinct()
Out[7]: [<Model: A>, <Model: B>, <Model: C>, ...
						

That solved the duplicates.

But we need to use only closed=False in event.

No Filter In Annotate

Write SQL. Use extra.

Model.objects.all().extra(select={'modelevent__due_date__min':
                            """SELECT MIN(due_date) FROM app_model
                               WHERE model_id=app_model.id
                               AND closed=false"""})
						

Questions?

VuMedi is hiring.
Django ORM Is All I Need... NOT! By Božidar Benko / @bbenko / CTO @VuMedi