Decimail Server - Using the Database

Using the Database

1. Message Storage

Messages are stored in the table "messages". This stores one message per row, with the headers and body each stored in their own binary field. PostgreSQL copes with arbitary-size fields like this automatically using something called "TOAST", which work as far as I can see; I have stored several multi-megabyte messages.

The primary key to the messages table is the msg_id. msg_ids are integers and are allocated from the sequence "msg_ids".

Other fields in the messages table cache values from the message headers that might be wanted for queries. These are the subject, date, from address and rfc822_messageid (not to be confused with the msg_id; this is the value of the Message-ID: header from the message, which is needed to track replies.). A separate table, "recipients", records the addresses from the To:, Cc: and Bcc: fields. It includes the address, the header type and the corresponding msg_id.

Here is an example. Given these two messages:

From: a@foo.com To: b@blah.com, c@def.com Subject: Hello Date: 1/1/2004 Hello

From: b@blah.com To: a@foo.com Cc: x@y.com Subject: Re: Hello Date: 2/1/2004 Hello to you too!

We store the following in the tables (note that I have omitted the rfc822_messageid column):

messages
msg_id	subject	msgdate	from_addr	headers	body
1	Hello	1/1/2004	a@foo.com	From: a@foo.com To: b@blah.com, c@def.com Subject: Hello Date: 1/1/2004	Hello
2	Re: Hello	2/1/2004	b@blah.com	From: b@blah.com To: a@foo.com Cc: x@y.com Subject: Re: Hello Date: 2/1/2004	Hello to you too!

recipients
recip_type	addr	msg_id
to	b@blah.com	1
to	c@def.com	1
to	a@foo.com	2
cc	x@y.com	2

These tables are created by the create_db.sql script. You should only run this script once, to create a new empty message store. If you run it again, ALL YOUR MESSAGES WILL BE DELETED. If, for example, a new version of Decimail redefines these tables, rather than re-run this script you should issue ALTER TABLE or similar commands to bring your existing tables' formats up to date.

2. tsearch2 Indexing

To support mailboxes that contain messages matching certain kerwords in their content, messages are indexed using the tsearch2 text searching extension as they are inserted. You should certainly read the tsearch2 documentation in order to understand what is going on.

At present, the first 32kbytes of the first text part of each message are indexed. To support future expansion, the index table, part_tsearch, is keyed on the message id and the part name in IMAP format.

3. Flags

IMAP defines various flags that can be associated with messages to inidcate states such as seen/unseen, deleted, replied-to etc. Decimail stores these flags in the table imap_message_flags, with each row recording one flag for one message.

IMAP defines a "Seen" flag. Since generally more messages are Seen than Unseen, Decimail choses to instead store an Unseen flag, and to reverse the logic when necessary.

4. Mailboxes

Decimail mailboxes are defined in terms of an SQL query that returns the msg_ids of the messages that belong in that mailbox. These queries are stored in a table called "mailboxes".

Defining interesting queries to go in this table is the main thing that makes Decimail more "intelligent" than other mail servers. This section describes some possible queries.

The "mailboxes" table stores the queries along with the names of the mailboxes that they define. Generally the mailboxes are not defined individually but rather a group of mailboxes are defined automatically in some way. For example a group of mailboxes for individual correspondents can be defined based on a table of names and email addresses. The mailboxes table is defined as the union of a collection of such tables.

When storing SQL query strings in an SQL table, care is needed with quoting. The PostgreSQL documentation has some ghastly examples where eight or ten quotes seem to be needed. Luckily none of the following examples need anything that extreme.

4.1. Chronological Mailboxes

One of the most obvious ways to group messages is by date. To do this one simply defines queries such as the following:

select msg_id from messages where msgdate between timestamp with time zone '2004-1-1 00:00:00' and timestamp with time zone '2004-2-1-00:00:00'

This one selects all messages from January 2004. Clearly a series of such mailboxes can be generated automatically.

4.2. Recent Messages

It is possible to select all of today's messages using a query like the following:

select msg_id from messages where msgdate > now()::timestamp with time zone - '1 day'::interval

mailboxes.sql creates similar mailboxes for this week and this month. An important issue here is that the value of now() changes (obviously!). The consequence is that messages expire from the definition of "today" as they become more than 24 hours old. This is fine in principle, but it confuses IMAP, because it does not expect messages' positions in the mailbox to change. Decimail fixes this as follows. When a mailbox query is read the string "now()" is replaced by the current date and time. This value continues to be used for as long as the mailbox is kept open. A consequence of this is that you must not use other functions that indirectly access now(), such as age().

4.3. Per-Correspondent Mailboxes

Another obvious way to group messages is by correspondent. Messages from a particular address can be found as follows:

select msg_id from messages where from_addr='xxx@yyy.com';

Messages to an address can be found using the recipients table:

select msg_id from recipients where addr='xxx@yyy.com';

To find all messages to or from an address (including cc and bcc messages), one can use find the union of these queries:

select msg_id from messages where from_addr='xxx@yyy.com' union select msg_id from recipients where addr='xxx@yyy.com';

If an address book table is defined, per-correspondent mailboxes can be created automatically.

5. Efficiency and Indexes

It's fairly important that mailbox queries can be executed reasonably quickly as the IMAP daemon runs one each time a mailbox is selected, and users perceive the execution time as a delay between selecting a mailbox and seeing a listing of its content.

The easiest way to measure the execution time of a query is using the EXPLAIN ANALYSE command from the psql command line. For example, here is the query for "Today's messages" again:

select msg_id from messages where msgdate > now()::timestamp with time zone - '1 day'::interval

You can time this as follows: (note that the output below is not completely up to date with the current database structure)

$ psql -d decimail -U decimail => explain analyse select msg_id from messages -> where msgdate > now()::timestamp with time zone - '1 day'::interval; QUERY PLAN ------------------------------------------------------------------------------------- Index Scan using messages_by_date on messages (cost=0.00..858.75 rows=1227 width=4) (actual time=0.249..0.508 rows=27 loops=1) Index Cond: (msgdate > (now() - '1 day'::interval)) Total runtime: 1.001 ms (3 rows)

The important bit is at the bottom where it reports the total execution time as 1 millisecond. (This is with about 5000 messages to consider on a year-2000-vintage Celeron 500 machine.) I don't think that many users would notice that latency.

Here is a run for a slightly more complex query which selects all of today's non-mailing-list messages:

=> explain analyse select msg_id from messages m -> where m.msgdate > now()::timestamp with time zone - '1 day'::interval -> and not exists ( -> select 1 from to_addrs ta -> join mailing_lists ml on (ta.addr = ml.submit_addr) -> where ta.msg_id = m.msg_id -> ); QUERY PLAN ---------------------------------------------------------------------------------------- Index Scan using messages_by_date on messages m (cost=0.00..41478.35 rows=614 width=4) (actual time=0.439..4.073 rows=25 loops=1) Index Cond: (msgdate > (now() - '1 day'::interval)) Filter: (NOT (subplan)) SubPlan -> Hash Join (cost=3.11..33.12 rows=1 width=0) (actual time=0.110..0.110 rows=0 loops=26) Hash Cond: ("outer".submit_addr = "inner".addr) -> Seq Scan on mailing_lists ml (cost=0.00..20.00 rows=1000 width=32) (actual time=0.008..0.013 rows=2 loops=26) -> Hash (cost=3.11..3.11 rows=3 width=25) (actual time=0.054..0.054 rows=0 loops=26) -> Index Scan using to_addrs_by_msg_id on to_addrs ta (cost=0.00..3.11 rows=3 width=25) (actual time=0.035..0.041 rows=1 loops=26) Index Cond: (msg_id = $0) Total runtime: 4.619 ms (11 rows)

This takes longer to run as for each message it looks up its To: address to see if it's in the table of mailing lists. But 4.7ms is still quick enough that you don't need to worry.

Getting these quick execution times depends on a couple of things. First, there can often be more than one way of writing a query in SQL, and some can be hugely more efficient than others. Second, you need to make sure that suitable indexes are present.

Considering the first point, I will not try to present a tutorial on writing efficient queries here, but the PostgreSQL documentation and mailing lists will often have the answers. Just one suggestion is to avoid the INTERSECT and EXCEPT operators. For example the query above for "all today's messages except mailing list messages" could be coded literally as "(all today's messages) EXCEPT (all mailing list messages)". Look how slow that is:

=> explain analyse select msg_id from messages m -> where m.msgdate > now()::timestamp with time zone - '1 day'::interval -> EXCEPT select msg_id from to_addrs ta -> join mailing_lists ml on (ta.addr = ml.submit_addr); QUERY PLAN ---------------------------------------------------------------------------------------- SetOp Except (cost=2467.36..2491.13 rows=476 width=4) (actual time=156.049..156.175 rows=25 loops=1) -> Sort (cost=2467.36..2479.24 rows=4755 width=4) (actual time=155.789..155.910 rows=94 loops=1) Sort Key: msg_id -> Append (cost=0.00..2176.94 rows=4755 width=4) (actual time=0.192..155.356 rows=94 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..871.02 rows=1227 width=4) (actual time=0.188..0.560 rows=26 loops=1) -> Index Scan using messages_by_date on messages m (cost=0.00..858.75 rows=1227 width=4) (actual time=0.172..0.424 rows=26 loops=1) Index Cond: (msgdate > (now() - '1 day'::interval)) -> Subquery Scan "*SELECT* 2" (cost=69.83..1305.92 rows=3528 width=4) (actual time=112.279..154.560 rows=68 loops=1) -> Merge Join (cost=69.83..1270.64 rows=3528 width=4) (actual time=112.263..154.241 rows=68 loops=1) Merge Cond: ("outer".addr = "inner".submit_addr) -> Index Scan using to_addrs_by_addr on to_addrs ta (cost=0.00..1117.92 rows=11987 width=29) (actual time=0.033..116.033 rows=11795 loops=1) -> Sort (cost=69.83..72.33 rows=1000 width=32) (actual time=0.105..0.197 rows=46 loops=1) Sort Key: ml.submit_addr -> Seq Scan on mailing_lists ml (cost=0.00..20.00 rows=1000 width=32) (actual time=0.027..0.038 rows=2 loops=1) Total runtime: 156.833 ms (15 rows)

A few points about using EXPLAIN ANALYSE:

The results are meaningless until you have a reasonable number of messages in your database.
PostgreSQL needs to analyse the database periodically. This may happen automatically as the result of a cron job or similar. If the database is not being analysed it will not find the best way to execute the query.
If you have relatively few messages and plenty of memory you will see an improvement running EXPLAIN ANALYSE a second or subsequent time due to caching. It is up to you whether to use the from-disk or the from-memory numbers, but be consistent.

The other consideration is indexes. They speed up queries by replacing linear searches through a table with logarithmic searches in a B tree. A number of indexes are created by the indexes.sql script but you may need others if you use different attributes in your queries.

To measure the benefit of an index, try dropping (deleting) it. (You can always recreate indexes.) Here is the example of "all today's messages" again:

=> drop index messages_by_date; => explain analyse select msg_id from messages -> where msgdate > now()::timestamp with time zone - '1 day'::interval; QUERY PLAN ---------------------------------------------------------------------------------------- Seq Scan on messages (cost=0.00..2426.21 rows=1319 width=4) (actual time=478.630..479.568 rows=26 loops=1) Filter: (msgdate > (now() - '1 day'::interval)) Total runtime: 479.814 ms (3 rows)

The query time is increased from 1ms to 480ms. The thing to look out for is anything that says "Seq Scan" - it could probably benefit from adding an index. See the PostgreSQL documentation for details.

6. Multiple Users

The preceding sections have all glossed over the fact that the database stores messages for more than one user. This section describes the implications of having more than one person using Decimail.

You should be aware that Decimail is weak in some of these areas because I use it as an essentially single-user system, and so don't need most of these features.

6.1. Identifying Recipients

The SMTP daemon is responsible for working out which user an incoming message is for; it does this by consulting the incoming_addresses table. This gives a series of address paterns and corresponding usernames. By using patterns rather than fixed address strings it is possible to indicate that all messages to a particular domain go to one user. There can be more than one row per user.

6.2. Multiple Recipients

The messages table includes an owner attribute that indicates which user owns the message.

One potential feature that is not implemented is the ability to store a message only once when it has been sent to multiple recipients. This is particularly useful in a corporate context where large documents can be circulated to large numbers of recipients. Storage requirements could be substantially reduced if such messages were stored only once. To achieve this it would be necessary to factor-out the owner attribute into a separate table.

In this scenario it is necessary for each user to have their own set of flags for a message so that they each have their own idea of whether the message has been read or deleted. This is implemented; the flags table records a username as well as a msg_id. (Expunging messages poses a challenge that has not yet been considered.)

6.3. Mailbox Queries

Mailbox queries need to be limited to those messages owned by the current user: when users can define their own mailbox queries it is undesireable for users to be able to define queries that allow them to read other people's email. This is done by means of a set of views created when the IMAP LOGIN command is executed. For each of the tables that the mailbox query may use (i.e. messages, recipients, imap_message_flags and also actions and imap_mailbox_flags) a view with the same name but with "u_" prepended (i.e. u_messages etc.) is created. The mailbox queries are written in terms of these views.

At present there is nothing to stop mailbox queries from reading the raw tables and hence other users' messages. In the longer term I will add some sort of permissions mechanism so that only the views are accessible.

6.4. Common Mailboxes

In a multi-user system, users can share a set of "common" mailbox definitions. I don't mean that they actually share a mailbox (though this would also be possible in principle), but rather that every user has a mailbox called, for example, "January 2004", containing their own messages from that month. One can envisage an arangement where a shared address book plus private address books exist and all users have by-correspondent mailboxes for those addresses in the shared address book.

mailboxes.sql defines these common mailboxes in the table common_mailboxes (actually a view which is a union of the different types of common mailbox), and then near the end of the file it instantiates them per user using this bit of SQL:

create view common_mailboxes_per_user as select username, mailbox_name, query from users cross join common_mailboxes;

7. Setting Up the Database

During installation you need to set up the database to suit your preferences and local environment. If you want to get up and running quickly I suggest that:

You run the scripts in the sql directory as described in the Installation document.
You make a copy of example.sql and edit it to define a username and password, incoming email address pattern and so on; you can invent the details if you like.
Run this script. The mailboxes table should be populated automatically. You can check if this has done the right thing by running psql interactively and looking at the mailboxes table.
As mentioned in example.sql you probably don't want to clear and regenerate the users table very often, so having run your script once I suggest that you comment-out the line(s) that add users.

Subsequently you'll probably want to make greater changes. I'm not sure to what extent I should supply "standard" or "example" database setups. Feedback is invited, along with your examples of useful mailbox queries.