Email Reloaded

So the long and short of it is, I’m loading all the email I receive into a database using a fun combination of Procmail, Spam Assassin, and a sprinkling of command line PHP. I’m very excited about this, more excited than I’ve been about a new project in a while. For me, email has been steadily waning in utility for the past year, and I want to breathe new life into it. I’m tired of folders. I’m tired of slow searching. I don’t want to hand my email over to someone else, even if it’s Google. I don’t want to deal with mbox or IMAP or maildir or any of that junk. Those are implementation details of various servers and clients.

Mirroring my email into a MySQL database has some interesting ramifications. Imagine instant Gmail-type searching using FULLTEXT or LIKE. Imagine instant email backup using MySQL replication. Think email RSS feeds, keyed on searches or senders or anything. Don’t forget the interesting metrics that can be extracted from this as well. Right now I’ve replaced my timely dozen with an counter running since this morning. If you send me an email, you’ll see it increment live. If it increments the spam counter you may want to resend it and reword your mortgage suggestion. This is the most basic of a hundred interesting things that can be culled from this data.

I want to hear your wildest dreams. Besides the obvious search, backup, and statistics benefits, what can you imagine this system doing? What would you like email to address? (groan…) What email metadata is interesting? (I’m currently tracking subject, date sent, date received, from, the message itself, and spam status.) What statistics would be interesting to you? Is anyone even interested in this or am I just spinning my wheels?

Today my mail lives in 400 MB of mbox folders I access using IMAP. Tomorrow I want something better.

42 thoughts on “Email Reloaded

  1. Interesting. I’m curious to how the db is organized. I’m immediately thinking of linking tables that validate email addresses to one individual for starters.

  2. Sounds like this.
    See also dbmail.

    I’ve always thought that email was a fairly “solved” problem. The dominant infrastructure(s) work well enough, even for folks who save every piece of mail (for C-Y-A, or other reasons). I rarely have need to check messages greater than six months old, so a regular archiving strategy (ending in long-term storage on CD) serves me well enough. I keep IMAP folders open for the current year. I have on disk, but unsubscribed, last year’s folders. Everything (current and old) gets burned to fresh CDs on a regular basis.

    About the only truly convenient thing I can think of would be a combination GMail “conversations” plus web page archive (a la MailMan’s archives), so I could review entire conversations in one go, without having to click through individual messages. htaccess or other restrictions could keep the private conversations private.

    I suppose being able to annotate (a la wiki editing) saved emails might be of value for some classes of saved mail.

  3. This reminds me of ZOE, but if it was in PHP I’d most likely buy a copy (depending on the price) or install it here at my home.

    1. Auto-Labelling with ‘filters’
    2. Mail flow stats [average in/out]
    3. Keep track of X-Headers – MDaemon and other mail servers add x-headers that are great for filters and inserting other useful information
    4. In a message view links to other messages by person
    5. In a message view links to other messages on this date
    6. In the message view links to other messages by other recipients
    7. List of links in a message
  4. I’d definitely like to see that database schema and what tools you’re using against it… What’s needed now is a database-backed IMAP server. Imagine defining virtual folders based on a db query! dbmail’s a good start, but always seemed lacking somehow…

  5. Certainly this gets my wheels turning, and – after only a few moments of meditation, I’ve thought of a few useful things that I can see coming of this.

    1. The extensibility of having custom fields. I mean, “Spam_status” is one that you’ve added, but I can definitely see use for things like “flags”, “importance”, “to-do” etc. There are limitless possibilities there.

    2. The simple fact that anyone with a basic knowledge of PHP and mySQL could put together a UI for this system in a matter of hours.

    3. The amount of filters that could be derived from simple SQL queries is staggering.

    4. For businesses: Having specific email messages being public while most being private.

    5. For businesses: Not ever having to worry about the “Leave a copy of message on server.” being checked because all messages reside on the server no matter what.

    6. Limitless space. I’m running OSX and XP Pro at home, both of which I can run mySQL on. With this system I could “host” my own email utilizing Gigs upon Gigs of space.

    7. Portability. Enough said.

    8. Admin level / user level email access. Being able to restrict your secretary to reading your email, but not being able to reply to it. Alot of possibilities here.

    9. Encrypting attachments galore.

    10. Need a new email address? Simple replicate the table. cdevroe_email, info_email, spam_email whatever.

    11. Integration with WordPress!!!

    12. Automatic photogalleries based on Photo attachments in email – rather than needing to save all the photos and/or opening each photo one at a time.

    There’s a dozen. I’ll think of more, and post them on my site at some time.

  6. I just wish I could continue using all the different email addresses I have and still access them from one place, one application that runs on my server, entirely. The different accounts are of different types – yahoo, IMAP, POP3, for starters. Forwarding them all to one address is not what I want to do. I want it on different servers, but there should be a central repository for all my email, for when I am not using my Laptop, and Thunderbird. Maybe it’s already possible, maybe it’s not, yet.

  7. Pingback: One Fine Jay
  8. Hmm… certainly a very intresting idea! I would love to have something like that over my pity little 1GB Gmail account 😉 Also, anyone besides me noticed that he’s now the number 1 Matt on the internet according to Google? Heheh… according to your about section, I think it’s time to take all your sites offline. But that would be a horrible thing…

  9. I definitely want to get the code out there, but I want to get the DB schema firmed up first so we don’t have to deal with upgrades and handling funky data.

    Skippy, I don’t need to go back very far very often, but when I do need to go back I want it right then without any waiting. Often I’ll spend forever trying to find what folder the message is in or if it’s on a different account or whatever. I can’t imagine how long it’d take me if I had stuff scattered on CDs. Thanks for pointing out those two projects, I’ll check them out to see if there’s anything good that can be used.

    Randy, I always loved the idea of ZOE but hated the implementation. It was huge and slow and brought my computer to a crawl while it “indexed.” This would be instantaneous. Thanks for the ideas!

  10. Matt, If you want any help on any web-based front-end code, let me know. This sounds like a project I’d love to be involved with.

    Tracking the Message-ID, References, and In-Reply-To headers is a must so that you can build threaded lists. Also, there should be some mechanism for adding messages you send from another mail client–the best way would be to trigger a script in your MTA that would add outgoing messages to the DB rather than having to add some address to the CC on every email.

  11. Pingback: Transient Savant
  12. Colin, thanks for the cool ideas!

    Carthik, there’s nothing to keep you from mirroring the database in any number of locations or from feeding the database from any number of email addresses. I could see this being very useful to businesses doing support or similar. I’m going to adapt something like this to do better archives for the WordPress mailing lists.

    Alan, you missed the party.

    KillAllDash9, I actually haven’t tackled threading yet. I need to look at it closer.

  13. I’ve been archiving my e-mail in a MySQL database for appr. 2 years now (over 200.000 e-mails). Procmail inserts them using a home-brewed script. It also does automatic mailing list detection (i.e. scans various headers such as x-mailing-list-name) and the like. Parsing e-mails is easy – the difficulty is handling attachements and different character encodings.

  14. Sounds like ZOE did it first, but like you, I hated the implementation. Ultimately, I’d like such a system to become my single-point “knowledge base”, as far as email is concerned anyway. It should be able to deal with multiple accounts, so I could stuff my yahoo/hotmail/whatever into it and access it from a single location. I think RSS/Atom feeds for individual “conversations” ala Gmail would be neat. Then later, why bother with SMTP/POP3/IMAP at all? XML-RPC-based email clients could do everything over port 80 and everything would happen from a single location. Everyone could eventually have their own personal “mail server”.

    I think I need a towel.

  15. Matt,
    This sounds way cool. At one of the Supernova sessions, Karl Jacobs of Cloudmark said something really insightful. I can’t quote exactly, but it was something along the lines of “our email clients know so much about us and our social networks, we should be data mining them and using that information to help users deal with email.” I’ve been thinking about it ever since, and your project plays right into it.

    Metadata I would be interested in:

    How quickly I respond to email, sorted by length of email, number of addresses, name — that would help my email client inbox decide how to present new mail to me, and what to put on the top of the “Action” list.

    Email I don’t respond to, sorted by similar factors, so email client could start making good guesses as to what goes into the “News” or “FYI” or “Spam” boxes.

    Social network diagrams from my email. Who is the person most often copied on email from me? To me? sorted by other addressees – could be used to prepopulate send to and cc to fields.

    Then start mixing it in with logs of IM/IRC conversations stored in MySQL, for similar analysis.

    More thoughts available if you are interested.

    At the moment I use 7+ different incoming email addresses plus filters to do a mediocre job of organizing my email as it arrives, but your approach sounds a lot more promising.

    I would love to be an Alpha tester.


  16. I would love to see labels ala Gmail. I would love to see auto sorting by Author ala Opera. I would love to be able to build my own rules. A great thread tracking system would be cool too. Gmail has a handle on threading, but IBM had a couple of pretty nifty ideas about email as well (Remail: specifically threading.

    Your biggest challenge will be the UI. Although after looking at your website and WordPress I don’t think you will have much of a problem with UI 🙂

  17. Matt,
    If you don’t need to run back to older archives with any regularity, then near-line storage (CD, or other removable long-term storage media) is where it belongs (imho). Chances are, if you need to retreive an ancient email, that you know enough about that message to be able to quickly grep the mbox or Maildir stored on the CD. If you don’t know what you’re grepping for than SQL SELECTs and LIKEs aren’t going to help too much more (although I’ll grant that the _output_ from these searches can be better presented, to help find the needle in the haystack).
    Tushar Burman says:

    XML-RPC-based email clients could do everything over port 80 and everything would happen from a single location.

    I shudder to think what kind of spamming potential this mechanism would have; but I also recognize that by comparing incoming “emails” of this type against the body of known senders is suddenly _much_ easier.

  18. I shudder to think what kind of spamming potential this mechanism would have; but I also recognize that by comparing incoming “emails” of this type against the body of known senders is suddenly _much_ easier.

    I didn’t mean using XML-RPC over SMTP; just using it to relay outgoing mail (authenticated, of course). This way, outgoing mail would also be archived, regardless of physical location.

  19. The only thing that concerns me here is scalability — how much email can one person reasonably store on a hosting account? Mind you, this sort of thing is for power users, and I’m sure a power user will have a dedicated box available [whether leased or a home-brewed box with third-party DNS pointing at it], but, people’s comments to a six-month threshhold notwithstanding, the volume of email that I get would get > 100 MB in a short time.

  20. You could track all the contacts (recipients, or inline of message). Then link them with others and setup a social network by sharing our your list. Maybe using a hash for the email address instead of the address itself to keep people from knowing an email address. Attach a blog URI to the email addy and socialize some more.

  21. More to the point…will this help me find a woman?

    Actually, for my simple needs, the ability to quickly find a particular item is paramount. As well, a system of smart filters, which become more refined over time, would be nice to store and relate each item.

    If you look at iTunes, there are myriad of ways to relate your music to create a specific play list. No reason why email can’t be done in the same way.

    Right now, virtually every email client that I have used sucks big time in the department of ease of search. Even Mozilla Thunderbird’s search is lame.

    If having MullenMail on my colophon gets me chicks, then I’m 100% for it.

  22. Check out the CPAN Perl module Mail::Store for a similar way of doing this. It may be useful for ideas on object abstraction as well as database schema needed. I’ve been thinking about doing something similar with that module very soon. Combine with Spam Assassin and some easy Template toolkit code and you’ve got a nice component based spam proof webmail system.

  23. It’s a very interesting idea but it almost seems overly redundant to a point. Depending on the type of e-mail you get you could eat up a ton of space very quickly. How are you addressing attachments? Will they be stored in the file system or as a blob in the database?

  24. I am very interested in seeing some code for the database part of this. I have recently been looking around for a good webmail client that would suite my needs and have not found anything so ive decided to make my own and this would make my job like 20 times easier seeing as I know nothing about email. Do you have a timeline in mind for rolling out with somthing?


  25. The first thing I thought of was conversations like newsgroups/forums/threads. But I’d want to have it sortable in a myriad of ways. I wouldn’t want to be restricted to a “sort by [this]” pre-set options. I’d want to specify how I want it sorted myself, using things like “sort by [how many times a word shows up in an email]”, or whatever. I’m dreaming here, not thinking of the logistical headache it would be to create (maybe plugins would be something to allow).

  26. Hello Matt,

    I’ve been googling around for this for a long time. I definitely want to store email in MySQL db for LONG time, but didn’t find a suitable solution until now. So do you have the scripts for that ? How is it going the project now ? I will definitely try this.
    In my opinion, storing the emails in database could change email and its applications A LOT.
    Best of luck, and tell me if you need any help.