Canonical Voices

What David Owen talks about

David Owen

This last Saturday, I gave a presentation on classification, starting with a basic linear classifier and building it up to k-class, n-dimensional classification. I've been asked for the material I used, so here are the links:
Continue reading "Material from my presentation on Classification"

Read more
David Owen

Working with wrap-around

Sometimes we need to work with numbers that wrap-around, such as angles or indexes into a circular array. There are some tricks for dealing with these cases.
Continue reading "Working with wrap-around"

Read more
David Owen

A circular sector is the part of a circle bounded by two radii and an arc connecting them. Determining whether a point lies in a sector is not quite as easy as rectangular bounds-checking. There are a few different approaches: θ-comparison, line-sides, and dot-products.
Continue reading "Bounds-checking with a circular sector"

Read more
David Owen

There's a simple process you can use to hit your big goals, reliably.

Perhaps you've tried and failed in the past. Maybe you can hit smaller goals, but have trouble maintaining focus on bigger goals, or connecting them to your day-to-day life.

I used to struggle with this, too, until I figured out the right process. Then, I hit every single big goal I had for the next several years—and all of them ahead of schedule!

And, the process is pretty simple.


Continue reading "How to hit your goals"

Read more
David Owen

I recently finished up the main development effort of a new project. It incorporates a lot of database techniques I'd used in previous projects, and several that I hadn't.

Description

The application is written in Django, but uses SQLAlchemy to the nearly complete exclusion of Django's own ORM. The only thing that touches Django's ORM is the session module.

The view-handling functions perform first-level security checks, input scrubbing and coercion from strings to other types, and reporting errors to the user. These functions then call into logic functions, which perform most of the interaction with the database. Database interactions use SQLAlchemy as basically a wrapper over the DBAPI, using strings for queries and passing parameters positionally.

The database is PostgreSQL (which has an excellent set of features). Database views and database functions do all the heavy lifting; the logic functions themselves mostly use their own parameters to filter or sort view results (at the database), do a simple join or two, or perform data updates. The views do a lot of aggregation, including statistics, modeling, and forecasts; much of the aggregation uses window functions.

A migration system handles all schema updates, which include: data structure (CREATE/ALTER/DROP TABLE), data (UPDATE), views (CREATE/DROP VIEW), functions (CREATE/DROP FUNCTION), and grants (GRANT/REVOKE).

Migrations are managed by a small set of simple bash scripts called pg-migrate. Migrations have a single total order, with individual migrations identified by contiguous positive ordinals. A migration script is a single SQL file, named with its ordinal, a hyphen, and some brief descriptive text allowed in filenames, and a .sql extension. The bash scripts automatically wrap individual migrations in transaction blocks during execution.

What went well

Using database views is a huge win for abstraction.

Views are the first level of named abstraction for queries, like functions are for code in general-purpose languages. By giving names to parts of complicated queries, they become much easier to understand.

It's very easy to play with queries at psql's REPL until they're right, then freeze them into views. If you decide that a query's grouping, partitioning, or aggregating must be changed, it's often easier to modify a query than it would be to modify the equivalent nested and sequential loops in non-declarative code.

With views, program logic becomes very simple—just selecting from a view, maybe filtering and ordering, but very little post-processing.

Views don't lose optimization. The query planner essentially expands all views used in a query to derive one base-level query, then applies its optimization heuristics over that. Thus, a database can optimize across named-view boundaries. Typically, a procedural language cannot optimize across calls between different compilation units, and an object-oriented language cannot even optimize across any dynamic dispatch calls (i.e. invoking a method on an object). What might require looping over the data dozens of times in a typically-factored object-oriented solution, when translated into equivalent database views, could very well be accomplished in a single pass over the data.

Migrations were also a big success. Deploys were quick—I could do several a day. Many deploys happened with zero downtime, without using any kind of database or application fail-over.

Migrations were designed to stop at useful points if a human needed to update data before continuing.

What went poorly

I put site-specific data into some early migrations. Major mistake! I had to fix-up early migrations in order to cleanly build databases for new sites.

Modifying existing views is sometimes difficult. Changing the output columns or column-types of a view means you must DROP it first, then CREATE it over again. Otherwise, you can get away with CREATE OR REPLACE. In either case, you must pull up the definition from the last migration (the view's definition in a database dump is annotated to show type information and scope-resolution, but this is too clumsy to edit as a primary source).

If a view must be DROP'd for an update, every view that references it must also be DROP'd then CREATE'd again. Because of this, a migration which changes a single view can get very large by pulling in every dependent view.

In this project, editing views was different from editing the program. The source files for a program are a snapshot of the entire program—to change the program, change the snapshot, then reload the program. Editing the views, on the other hand, required applying changes through a diff (the migration). The snapshot could not be edited in place. Diffs naturally have a different language than snapshots, as they express how to modify something that already exists while maintaining integrity, whereas snapshots express the structure that will exist, ex nihilo.

I also lacked any sort of rigor in defining views near the beginning of the project. Some views unnecessarily included all of the underlying tables' fields. When the structure of those tables changed, the views had to change as well.

Queries in Python are painful. They're all strings, so they don't wrap like code, and can't bind directly with variables.

Paths to improvement

A migration should be named with only a number. Comments at the beginning of the file can provide explanatory text. The migration scripts would be augmented to parse these leading comments and display them to the user, when requested.

Split schema definition into two parts, tentatively called structure and world. Structure would be expressed in migration files, and include operations to alter domains, tables, table constraints, or to update existing data to accommodate changing structure or constraints. It would also include any data that is not site-specific (site-specific data being banned from migrations). World would be a snapshot description of everything else—views, functions, etc.

To perform a migration, the migration scripts would drop everything in world, apply any pending migration scripts, then reload everything in world. Consideration would have to be given to reloading world in the event that a migration fails. Dropping world would also be disadvantageous to migrations that would benefit from defined views in order to update existing data.

The system should allow world to be split into multiple files, for some modicum of modularity. The system would have an option to discover a working load-order and save that order to a file, so that the modules may be reloaded quickly and without trial-and-error, for production migrations.

Views should be classified by type. I have discovered three basic types:

  1. An extension view, which extends a table with some extra, easy-to-calculate or easy-to-join fields. Such views seldom have aggregations, and joins are usually one-to-one. Extension views should include every field of the base table. Code will usually query an extension view instead of its underlying table, to get the extension fields in addition to the base fields. Extension views are named after the underlying table, with _ext appended.
  2. A processing view. These views may perform significant aggregation and joining over several tables, to provide some well-defined, narrowly-scoped, output. Processing views include only the applicable primary keys (or candidate keys) from the underlying tables, plus their own contributions. Processing views are named according to their purpose.
  3. A summary view. Summary views are based on a single table or single extension view, but join information from related processing views. They include most or all fields from all data sources. Summary views are named with an appended _summary.

This scheme maximally decouples processing views, being the most complex, from changes in underlying tables. Extension and summary views have simple definitions, often using * in their select lists, and so are not so annoying to update to match participating tables.

I would also consider using SQLAlchemy's table-definition features. However, it can't be the definitive source from which the database is built (because you can't express migrations there; it's a snapshot language, not a diff language). This would allow better code-formatting and comprehension by IDEs of table and field names and such. This would violate the DRY principle (don't repeat yourself), though.

Read more
David Owen

Style-checkers check code for certain statically-determinable properties. These properties may affect dynamic behavior (e.g. all references are definitely initialized before use), or they may be purely textual (e.g. code is indented in a particular way).

Teams use style-checkers to ensure some typical dynamic and textual standard in the code base. The idea is that common errors can be prevented, and that a common textual style will aid readability, especially between code from different authors.1

In reality, style-checking is far from ideal, especially regarding textual consistency. Some section of code may have notably worse readability if wrapped to the project’s standard of 80 columns. Some object allocation may not have a matching deällocation because it is live for the program’s entire run.

No matter the reason, exceptions to the rules always come up, and this is where style-checking becomes really difficult: how do you handle the exceptions? If you do nothing, exceptions will accumulate until the style-checker’s report is overwhelming. At best, the report will be ignored; at worst, it will become demoralizing to developers. In such a situation, style-checking may be worse than no style-checking at all.

To keep the report positively motivating, you must make a change somewhere:

  • You could alter the code to make it compliant with the style rules. This may directly compromise either the design integrity or actual readability of the code.
  • You could alter the style rules so that the case is no longer exceptional. But the new rule will probably be more complicated, have more edge cases, and still have other (and new) exceptions. Besides, which is better: writing rules, or writing code?
  • You could mark the exception in the code with some kind of meta-comment, so that the style-checker ignores it on future runs; but, this adds clutter to the code and runs counter to the goal of readability.
  • You could mark the exception somewhere else. This keeps the code clean, and the developers in control, but this database would have to be intelligent enough to keep up with code motion as the project develops. The database should also be version-controlled with the source that it annotates.

This last approach would yield a “perfect” solution: it would only ever mark newly-introduced problems in the code. A “conservative,” approximate solution (one that marks every new problem, and might mark a few old) is possible if we make the following assumptions:

  • Code is under version-control;
  • The style-checker is run before or as part of the process of committing code to the main branch; and,
  • Genuine problems, when reported, will actually be addressed.

Given these constraints, we can see that a style-checker could report on errors associated only with code features that are new or changed in the current version, relative to the latest main-line version. If this is the case, it would assume that any style exception in the main-line is there intentionally, and should not be reported on again (it would have been reported when it was previously committed)—it would not need to track exceptions in a separate database. I call this “differential style-checking.”

The code features that the style-checker tracks could be either lines of text, or nodes of parse-trees. The latter might be more accurate, but the former might be easier to implement and good enough in most cases.

Here is a proof-of-concept. It works on only one Python file at a time, and the output is messy, but it demonstrates the essential principles. pylint version 0.19.0, as packaged in Ubuntu 10.04, is the style-checker, with which I have no experience prior to writing this.

#!/bin/bash

 
usage() {
    echo $0 trunk_dir proposed_dir file
    exit 1
}


comment() {
    # $1 the file to comment
    # $2 the line number
    # $3 the comment to append
    head -n $(( $2 - 1 )) $1 >/tmp/tmp.py
    local target_line=$(head -n $2 $1 |tail -n1)
    echo "$target_line #{$3}" >>/tmp/tmp.py
    tail -n +$(( $2 + 1 )) $1 >>/tmp/tmp.py
    mv /tmp/tmp.py $1
}


process() {
    local type
    local line
    local msg
    local old_ifs="$IFS"
    IFS=":"
    while read type line msg
    do
        [ "$msg" ] && comment $1 $line "$type:$msg"
    done
    IFS="$old_ifs"
}


TRUNK=$1; shift
PROPOSED=$1; shift
FILE=$1; shift

[ -d "$TRUNK" -a -d "$PROPOSED" -a "$FILE" ] || usage
[ "$1" ] && usage

STRIPPED=${FILE%.py}
MODULE=${STRIPPED//\//.}

pushd $TRUNK >/dev/null
cp $FILE /tmp/left.py
pylint -rn $MODULE 2>/dev/null |process /tmp/left.py
popd >/dev/null

pushd $PROPOSED >/dev/null
cp $FILE /tmp/right.py
pylint -rn $MODULE 2>/dev/null |process /tmp/right.py
popd >/dev/null

diff -u /tmp/{left,right}.py |egrep '^[+@]'

For a sample run, I will use the Trac project, which I greatly respect. At revision 10114, the PostgreSQL backend received a change, which we will analyze.


cd /tmp
svn co -r10113 http://svn.edgewall.org/repos/trac/branches/0.12-stable 10113
svn co -r10114 http://svn.edgewall.org/repos/trac/branches/0.12-stable 10114
diffstyle 10113 10114 trac/db/postgres_backend.py

The (wrapped) output is:

+++ /tmp/right.py       2011-04-09 09:59:48.274189942 -0600
@@ -237,6 +237,11 @@
+    def update_sequence(self, cursor, table, column='id'): #{C:PostgreSQLConnection.update_sequence: Missing docstring} #{R:PostgreSQLConnection.update_sequence: Method could be a function}
+        cursor.execute("""
+            SELECT setval('%s_%s_seq', (SELECT MAX(id) FROM %s))
+            """ % (table, column, table))
+

We see that, according to pylint’s default checks, two new (potential) problems were introduced: the new method doesn’t have a docstring, and the new method could be a function outside of a class.2 Running pylint normally on postgres_backend.py revision 10114 produces 47 warnings. (The entire Trac packages yields 10065.)

pylint -rn trac.db.postgres_backend |egrep '^[A-Z]' |wc -l
pylint -rn trac |egrep '^[A-Z]' |wc -l
End notes
  1. I believe that style-checkers can effectively guard against some types of runtime errors, just as compilers can. I don’t believe that style-checkers can be an effective judge of textual style.
  2. In this case, the style-checker got it wrong. This must be a method, because it is invoked polymorphically; although, it need not be an instance method—it could be a class method.

Read more
David Owen

During most of college, I had an awesome sleeping habit. My alarm was set for 7:30 am every day, without exception. And, every day, without exception, I would wake up at 7:28 am, two minutes ahead of the alarm.

It didn’t matter when I went to bad the night before. 10:00 pm, 2:00 am, it didn’t matter; though, if I had stayed up late, I would get sleepy sooner and go to bed earlier next time.

But near the end of college, that magic broke. I started struggling to maintain a decent wake schedule, waking up closer to 10:00 am, even with a normal day-time job.

Although I still don’t have the amazing abilities I had in college, I was able to get myself back to a good wake schedule. Here’s how I do it.

Reinforce your alarm response

On a day that you have some extra time, set your normal alarm for a few minutes from now. Get in bed, close your eyes, and relax. As soon as your alarm goes off, jump out of bed. Yes, really: jump. Don’t just roll out—be active. Repeat this several times.

This is especially useful if you find yourself sleeping through your alarm, and not remembering that it even went off.

Align your eating schedule

Your sleeping rhythm is affected by several inputs, one of those being digestion.

Eat breakfast soon after waking. I usually eat something within 10 minutes of waking, although an hour or so might be just as good. Don’t put off breakfast! It tells your body that this is a time to be awake.

Don’t eat right before sleeping. If you must eat something, keep it small and light. Fat and protein extend your digestion, and will make you sleep in. Avoid simple carbohydrates as well, as they will keep you awake.

Fast to make drastic adjustments

If your sleep schedule is way off of where you want it to be, skip dinner. The lack of digestion lets your body sleep more easily, and you’ll be more likely to eat a larger breakfast in the morning, kicking your digestion into full gear.

If you constantly struggle to keep your sleep schedule set, you might consider fasting (dinner) on a regular basis.

Get plenty of sunshine

Sun shining on your skin creates vitamin D and other chemicals that are essential to effective use of other nutrients, good mood, and even your sleep cycle.

If you work inside most of the day, take regular breaks to go outside. Walk around the building at the least, or find some nearby landscaping that you like and lay down in the sun.

If you really can’t get natural sunlight, many people recommend full-spectrum lamps as fairly good substitutes.

Have something to look forward to

If you go to sleep with nothing but worries, you’re much more likely to want to sleep in. The day has nothing positive, so why not put it off, right?

So, about 30-60 minutes before you go to bed, think of one or two very positive things for the coming day. Then, make sure to visualize those as vividly as possible as you fall asleep. When your alarm goes off in the morning, you’ll wake up excited, even if you’re not consciously thinking about those things.

Don’t try to come up with something positive once you’re already in bed—you’re likely to fall asleep before you do, or to not have enough visualization time for them to sink in well enough. Take the time to prepare for tomorrow!

You might be in genuinely stressful or negative situations right now that are near impossible to avoid thinking about. That’s okay; acknowledge it, but always keep it in terms of some form of improvement, and make something to look forward to. Your life is your own, to make better, or not, as you wish.

Be consistent

Set your alarm, get out of bed on time, eat early, enjoy the sun, and think positive, every single day, even if it’s a weekend, holiday, or vacation. Good, well-chosen habits are a key to success.

Read more
David Owen

Test

Just testing my group’s aggregator…

Read more