Messages are stored in the table "messages". This stores one message per row, with the headers and body each stored in their own binary field. PostgreSQL copes with arbitary-size fields like this automatically using something called "TOAST", which work as far as I can see; I have stored several multi-megabyte messages.
The primary key to the messages table is the msg_id. msg_ids are integers and are allocated from the sequence "msg_ids".
Other fields in the messages table cache values from the message headers that might be wanted for queries. These are the subject, date, from address and rfc822_messageid (not to be confused with the msg_id; this is the value of the Message-ID: header from the message, which is needed to track replies.). A separate table, "recipients", records the addresses from the To:, Cc: and Bcc: fields. It includes the address, the header type and the corresponding msg_id.
Here is an example. Given these two messages:
We store the following in the tables (note that I have omitted the rfc822_messageid column):
messages | |||||
---|---|---|---|---|---|
msg_id | subject | msgdate | from_addr | headers | body |
1 | Hello | 1/1/2004 | a@foo.com | From: a@foo.com To: b@blah.com, c@def.com Subject: Hello Date: 1/1/2004 | Hello |
2 | Re: Hello | 2/1/2004 | b@blah.com | From: b@blah.com To: a@foo.com Cc: x@y.com Subject: Re: Hello Date: 2/1/2004 | Hello to you too! |
recipients | ||
---|---|---|
recip_type | addr | msg_id |
to | b@blah.com | 1 |
to | c@def.com | 1 |
to | a@foo.com | 2 |
cc | x@y.com | 2 |
These tables are created by the create_db.sql script. You should only run this script once, to create a new empty message store. If you run it again, ALL YOUR MESSAGES WILL BE DELETED. If, for example, a new version of Decimail redefines these tables, rather than re-run this script you should issue ALTER TABLE or similar commands to bring your existing tables' formats up to date.
To support mailboxes that contain messages matching certain kerwords in their content, messages are indexed using the tsearch2 text searching extension as they are inserted. You should certainly read the tsearch2 documentation in order to understand what is going on.
At present, the first 32kbytes of the first text part of each message are indexed. To support future expansion, the index table, part_tsearch, is keyed on the message id and the part name in IMAP format.
IMAP defines various flags that can be associated with messages to inidcate states such as seen/unseen, deleted, replied-to etc. Decimail stores these flags in the table imap_message_flags, with each row recording one flag for one message.
IMAP defines a "Seen" flag. Since generally more messages are Seen than Unseen, Decimail choses to instead store an Unseen flag, and to reverse the logic when necessary.
Decimail mailboxes are defined in terms of an SQL query that returns the msg_ids of the messages that belong in that mailbox. These queries are stored in a table called "mailboxes".
Defining interesting queries to go in this table is the main thing that makes Decimail more "intelligent" than other mail servers. This section describes some possible queries.
The "mailboxes" table stores the queries along with the names of the mailboxes that they define. Generally the mailboxes are not defined individually but rather a group of mailboxes are defined automatically in some way. For example a group of mailboxes for individual correspondents can be defined based on a table of names and email addresses. The mailboxes table is defined as the union of a collection of such tables.
When storing SQL query strings in an SQL table, care is needed with quoting. The PostgreSQL documentation has some ghastly examples where eight or ten quotes seem to be needed. Luckily none of the following examples need anything that extreme.
One of the most obvious ways to group messages is by date. To do this one simply defines queries such as the following:
This one selects all messages from January 2004. Clearly a series of such mailboxes can be generated automatically.
It is possible to select all of today's messages using a query like the following:
mailboxes.sql creates similar mailboxes for this week and this month. An important issue here is that the value of now() changes (obviously!). The consequence is that messages expire from the definition of "today" as they become more than 24 hours old. This is fine in principle, but it confuses IMAP, because it does not expect messages' positions in the mailbox to change. Decimail fixes this as follows. When a mailbox query is read the string "now()" is replaced by the current date and time. This value continues to be used for as long as the mailbox is kept open. A consequence of this is that you must not use other functions that indirectly access now(), such as age().
Another obvious way to group messages is by correspondent. Messages from a particular address can be found as follows:
Messages to an address can be found using the recipients table:
To find all messages to or from an address (including cc and bcc messages), one can use find the union of these queries:
If an address book table is defined, per-correspondent mailboxes can be created automatically.
It's fairly important that mailbox queries can be executed reasonably quickly as the IMAP daemon runs one each time a mailbox is selected, and users perceive the execution time as a delay between selecting a mailbox and seeing a listing of its content.
The easiest way to measure the execution time of a query is using the EXPLAIN ANALYSE command from the psql command line. For example, here is the query for "Today's messages" again:
You can time this as follows: (note that the output below is not completely up to date with the current database structure)
The important bit is at the bottom where it reports the total execution time as 1 millisecond. (This is with about 5000 messages to consider on a year-2000-vintage Celeron 500 machine.) I don't think that many users would notice that latency.
Here is a run for a slightly more complex query which selects all of today's non-mailing-list messages:
This takes longer to run as for each message it looks up its To: address to see if it's in the table of mailing lists. But 4.7ms is still quick enough that you don't need to worry.
Getting these quick execution times depends on a couple of things. First, there can often be more than one way of writing a query in SQL, and some can be hugely more efficient than others. Second, you need to make sure that suitable indexes are present.
Considering the first point, I will not try to present a tutorial on writing efficient queries here, but the PostgreSQL documentation and mailing lists will often have the answers. Just one suggestion is to avoid the INTERSECT and EXCEPT operators. For example the query above for "all today's messages except mailing list messages" could be coded literally as "(all today's messages) EXCEPT (all mailing list messages)". Look how slow that is:
A few points about using EXPLAIN ANALYSE:
The other consideration is indexes. They speed up queries by replacing linear searches through a table with logarithmic searches in a B tree. A number of indexes are created by the indexes.sql script but you may need others if you use different attributes in your queries.
To measure the benefit of an index, try dropping (deleting) it. (You can always recreate indexes.) Here is the example of "all today's messages" again:
The query time is increased from 1ms to 480ms. The thing to look out for is anything that says "Seq Scan" - it could probably benefit from adding an index. See the PostgreSQL documentation for details.
The preceding sections have all glossed over the fact that the database stores messages for more than one user. This section describes the implications of having more than one person using Decimail.
You should be aware that Decimail is weak in some of these areas because I use it as an essentially single-user system, and so don't need most of these features.
The SMTP daemon is responsible for working out which user an incoming message is for; it does this by consulting the incoming_addresses table. This gives a series of address paterns and corresponding usernames. By using patterns rather than fixed address strings it is possible to indicate that all messages to a particular domain go to one user. There can be more than one row per user.
The messages table includes an owner attribute that indicates which user owns the message.
One potential feature that is not implemented is the ability to store a message only once when it has been sent to multiple recipients. This is particularly useful in a corporate context where large documents can be circulated to large numbers of recipients. Storage requirements could be substantially reduced if such messages were stored only once. To achieve this it would be necessary to factor-out the owner attribute into a separate table.
In this scenario it is necessary for each user to have their own set of flags for a message so that they each have their own idea of whether the message has been read or deleted. This is implemented; the flags table records a username as well as a msg_id. (Expunging messages poses a challenge that has not yet been considered.)
Mailbox queries need to be limited to those messages owned by the current user: when users can define their own mailbox queries it is undesireable for users to be able to define queries that allow them to read other people's email. This is done by means of a set of views created when the IMAP LOGIN command is executed. For each of the tables that the mailbox query may use (i.e. messages, recipients, imap_message_flags and also actions and imap_mailbox_flags) a view with the same name but with "u_" prepended (i.e. u_messages etc.) is created. The mailbox queries are written in terms of these views.
At present there is nothing to stop mailbox queries from reading the raw tables and hence other users' messages. In the longer term I will add some sort of permissions mechanism so that only the views are accessible.
In a multi-user system, users can share a set of "common" mailbox definitions. I don't mean that they actually share a mailbox (though this would also be possible in principle), but rather that every user has a mailbox called, for example, "January 2004", containing their own messages from that month. One can envisage an arangement where a shared address book plus private address books exist and all users have by-correspondent mailboxes for those addresses in the shared address book.
mailboxes.sql defines these common mailboxes in the table common_mailboxes (actually a view which is a union of the different types of common mailbox), and then near the end of the file it instantiates them per user using this bit of SQL:
During installation you need to set up the database to suit your preferences and local environment. If you want to get up and running quickly I suggest that:
Subsequently you'll probably want to make greater changes. I'm not sure to what extent I should supply "standard" or "example" database setups. Feedback is invited, along with your examples of useful mailbox queries.