On Github bbenko / doiainn
By Božidar Benko / @bbenko / CTO @VuMedi
I love it and use it all the time.
I'm productive in it (more than in SQL).
All CRUD should be done with stored procedures.
old schoolORM is all I need.
site owners with no usersAt VuMedi, we have users in various groups.
Most interesting for us are 2 surgeon groups:
How many surgeons do we have?
User.objects.filter(groups__name__in=['Surgeon', 'Flagged']).count()
Hey BB, are you sure? That seems a bit too much.
- Off course I'm sure, but OK, to check, I'll write the same thing using Qs.
User.objects.filter(Q(groups__name='Surgeon') | \ Q(groups__name='Flagged')).count()
One meeelion again!
No, seriously, that's too much.
Some users are both 'Surgeon' and 'Flagged'.
Maybe you are counting those twice.
Hey, I know Boolean algebra.
1 OR 1 == 1
Not 2!
But I'll write some tests.
house = User.objects.create_user(username='house', email='house@vumedi.com', password='house123') self.assertEqual(User.objects.count(), 1) # only House # House is surgeon and flagged house.groups.add(self.surgeon, self.flagged) self.assertEqual(User.objects.filter(groups=self.surgeon).count(), 1) self.assertEqual(User.objects.filter(groups=self.flagged).count(), 1) # how many users that are surgeon or flagged? # only one user in db, it must be one surgeons = User.objects.filter(groups__in=[self.surgeon, self.flagged]) self.assertEqual(surgeons.count(), 1)
(doiainn)➜ doiainn git:(master) ✗ ./manage.py test Creating test database for alias 'default'... F ====================================================================== FAIL: test_m2m_count (app.tests.AppTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/bb/code/doiainn/code/doiainn/app/tests.py", line 27, in test_m2m_count self.assertEqual(surgeons.count(), 1) AssertionError: 2 != 1 ---------------------------------------------------------------------- Ran 1 test in 0.137s FAILED (failures=1) Destroying test database for alias 'default'...
surgeons = User.objects.filter(Q(groups=self.surgeon) | Q(groups=self.flagged)) self.assertEqual(surgeons.count(), 2)
SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", ... FROM "auth_user" INNER JOIN "auth_user_groups" ON ("auth_user"."id" = "auth_user_groups"."user_id") WHERE ("auth_user_groups"."group_id" = 1 OR "auth_user_groups"."group_id" = 2); 1|pbkdf2_sha256$20000$SMZHa8Fn8nG1$cGdKi... 1|pbkdf2_sha256$20000$SMZHa8Fn8nG1$cGdKi...
... because ...
SELECT COUNT(*) FROM auth_user_groups WHERE user_id=1; 2
self.assertEqual(surgeons.distinct().count(), 1)
(doiainn)➜ doiainn git:(master) ✗ ./manage.py test Creating test database for alias 'default'... . ---------------------------------------------------------------------- Ran 1 test in 0.153s OK Destroying test database for alias 'default'...
Django Admin is great. You use it to build your MVP.
Your table grows beyond 12M rows.
Change list view and search are now slow.
Admin displays total number of records in change list view.
Having 12M+ rows, total count doesn't have to be exact.
When filtered, it should be exact.
SELECT COUNT(*) FROM myapp_mymodel; count ---------- 12871557 (1 row) Time: 47299.633 ms SELECT reltuples FROM pg_class WHERE relname = 'myapp_mymodel'; reltuples ------------- 1.28186e+07 (1 row) Time: 0.410 ms
Django Admin Speedup For Big Tables on Postgres
class LargeTableChangeList(ChangeList): def get_results(self, request): ... if not self.query_set.query.where: full_result_count = result_count else: try: cursor = connection.cursor() cursor.execute("SELECT reltuples FROM pg_class WHERE relname = %s", [self.root_query_set.query.model._meta.db_table]) full_result_count = int(cursor.fetchone()[0]) except: full_result_count = self.root_query_set.count() ...
Django 1.8 introduced ModelAdmin.show_full_result_count to disable count.
It was a traffic usage table with MDN field.
MDN is 10 digit number, but can have leading zeros, so I used CharField.
mdn = models.CharField(...)
class MyModel(admin.ModelAdmin): search_fields = ['mdn', ]
does icontains - case insensitive contains
class MyModel(admin.ModelAdmin): search_fields = ['=mdn', ]
does iexact - case insensitive exact
I don't need case sensitivity.
But, at the time, you could not further customize the search.
# Apply keyword searches. def construct_search(field_name): ... elif field_name.startswith('='): return "%s__iexact" % field_name[1:]
From Django 1.6, you can use ModelAdmin.get_search_results.
SELECT "myapp_mymodel"."id", "myapp_mymodel"."mdn”, ... FROM "myapp_mymodel" WHERE UPPER("myapp_mymodel"."mdn"::text) = UPPER(‘1234567890’) LIMIT 100;
UPPER does case insensitivity.
It took ~7.5 seconds to execute this query.
Without UPPER, it took ~6 milliseconds (there was a btree index on a field).
Can't remove UPPER.
But you can do indexes on functions in Postgres.
Even better, you can do index on any expression.
I can create an index on UPPER(mdn)
CREATE INDEX myapp_mymodel_mdn_upper ON myapp_mymodel(UPPER(mdn)); CREATE INDEX Time: 377055.345 ms
The whole admin view with search results now comes back in under 2 seconds!
@VuMedi, our impression table is big.
=> SELECT COUNT(*) FROM tracking_impression; count ----------- 244116752 (1 row)
System and customer analytics (mostly) need only the last month of that data.
Quering this table is pretty slow.
Ordinary indexes don't help.
But, I can partition the data by months.
Splitting one large table into smaller physical pieces.
Master table / Django Model
class Impression(models.Model): user = models.ForeignKey(User) when = models.DateTimeField(auto_now_add=True) content_type = models.ForeignKey(ContentType) object_id = models.PositiveIntegerField() content_object = generic.GenericForeignKey() ...
CREATE TABLE "tracking_impression" ( "id" serial NOT NULL PRIMARY KEY, "user_id" integer NOT NULL REFERENCES "auth_user" ("id") DEFERRABLE INITIALLY DEFERRED, "when" timestamp with time zone NOT NULL, "content_type_id" integer NOT NULL REFERENCES "django_content_type" ("id") DEFERRABLE INITIALLY DEFERRED, "object_id" integer CHECK ("object_id" >= 0) NOT NULL, ... );
CREATE TABLE tracking_impression_2015_09 ( CHECK ( "when" >= '2015-09-01' AND "when" < '2015-10-01' ) ) INHERITS (tracking_impression); CREATE INDEX ON tracking_impression_2015_09 USING btree(user_id); CREATE INDEX ON tracking_impression_2015_09 USING btree("when");
We want to say:
INSERT INTO tracking_impression ...
and have the data be redirected into the appropriate partition table.
CREATE OR REPLACE FUNCTION tracking_impression_insert_trigger() RETURNS TRIGGER AS $$ BEGIN IF ( NEW."when" >= '2015-09-01' AND NEW."when" < '2015-10-01' ) THEN INSERT INTO tracking_impression_2015_09 VALUES (NEW.*); ELSIF ( NEW."when" >= '2015-08-01' AND NEW."when" < '2015-09-01' ) THEN INSERT INTO tracking_impression_2015_08 VALUES (NEW.*); ... ELSE RAISE EXCEPTION 'Date out of range.'; END IF; RETURN NULL; END; $$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION trg_tracking_impression_partition() RETURNS trigger AS $func$ DECLARE _tablename text := 'tracking_impression_' || to_char(NEW."when", 'YYYY_MM'); BEGIN IF NOT EXISTS ( SELECT 1 FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'public' AND c.relname = _tablename ) THEN ...
) THEN ... EXECUTE format('CREATE TABLE %I (CHECK ("when" >= %L AND "when" < %L)) INHERITS (public.tracking_impression)' , _tablename , to_char(NEW."when", 'YYYY-MM-01') , to_char((date_trunc('MONTH', NEW."when") + INTERVAL '1 MONTH')::date, 'YYYY-MM-DD') ); EXECUTE format('CREATE INDEX ON %I USING btree(user_id)', _tablename); EXECUTE format('CREATE INDEX ON %I USING btree("when")', _tablename); END IF; ...
... EXECUTE 'INSERT INTO ' || quote_ident(_tablename) || ' VALUES ($1.*) RETURNING id' USING NEW; RETURN NULL; END $func$ LANGUAGE plpgsql SET search_path = public; -- trigger CREATE TRIGGER tracking_impression_insert_trigger BEFORE INSERT ON tracking_impression FOR EACH ROW EXECUTE PROCEDURE trg_tracking_impression_partition();
INSERT INTO tracking_impression(user_id, "when", content_type_id, object_id, ..) SELECT user_id, "when", content_type_id, object_id, .. FROM tracking_impression_np;
-- old non partitioned table => SELECT "tracking_impression".* FROM "tracking_impression_np" WHERE ("tracking_impression_np"."when" BETWEEN '2014-09-05 00:00:00-07:00' AND '2014-10-04 00:00:00-07:00' AND ...; Time: 2983993.619 ms -- new partitioned table => SELECT "tracking_impression".* FROM "tracking_impression" WHERE ("tracking_impression"."when" BETWEEN '2014-09-05 00:00:00-07:00' AND '2014-10-04 00:00:00-07:00' AND ...; Time: 263.702 ms
Almost (everything) works out of the box.
return query.get_compiler(using=using).execute_sql(return_id) .../django/db/models/sql/compiler.py", line 914, in execute_sql return self.connection.ops.fetch_returned_insert_id(cursor) .../django/db/backends/__init__.py", line 529, in fetch_returned_insert_id return cursor.fetchone()[0] TypeError: 'NoneType' object is unsubscriptable
class Impression(models.Model): ... def save(self, *args, **kwargs): from django.db import connection connection.features.can_return_id_from_insert = False super(Impression, self).save(*args, **kwargs) connection.features.can_return_id_from_insert = True
Hey BB, there are duplicates when I sort in model's admin
by due_date field from not closed event.
class Model(models.Model): ... class ModelEvent(models.Model): model = models.ForeignKey(Model) due_date = models.DateField(...) closed = models.BooleanField(default=False)
This is how admin prepares queryset:
In [7]: Model.objects.all()\ .order_by('modelevent__due_date').distinct() Out[7]: [<Model: A>, <Model: B>, <Model: B>, <Model: C>, ...
2 events for model B --> model B twice in queryset.
SELECT DISTINCT "model"."id", ..., "modelevent"."due_date" FROM "model" LEFT OUTER JOIN "modelevent" ON ("model"."id" = "modelevent"."model_id") ORDER BY "modelevent"."due_date" ASC
Annotate on min due_date.
In [7]: Model.objects.all()\ .annotate(Min('modelevent__due_date'))\ .order_by('modelevent__due_date__min')\ .distinct() Out[7]: [<Model: A>, <Model: B>, <Model: C>, ...
That solved the duplicates.
But we need to use only closed=False in event.
Write SQL. Use extra.
Model.objects.all().extra(select={'modelevent__due_date__min': """SELECT MIN(due_date) FROM app_model WHERE model_id=app_model.id AND closed=false"""})