James Mailbox
  1. James Mailbox
  2. MAILBOX-44

[gsoc2011] Design and implement a distributed mailbox using Hadoop

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4
    • Component/s: None
    • Labels:

      Description

      Context: The mailbox subproject (http://james.apache.org/mailbox/) supports maildir, SQL database (via JPA) and Java Content Repository (JCR) as technology for mail storage. This flexibility is achieved thanks to a API design that abstracts mail storage from the mail protocols.

      Task: We need to implement mailbox storage as a distributed system on top of Hadoop HDFS. The James mailbox API will be used. A first step is to design how to interact with Hadoop (native api, gora incubator at apache,...) and deal with specific performance questions related to mail loading/parsing in a distributed system (use map/reduce or not, use existing local lucene indexes for search,...). The second step is to implement the HDFS mailbox (maildir mailbox is similar because is stores mails as a file and can be an inspiration). A single James server will still be deployed because we don't have any distributed UID generation.

      Mentor: eric at apache dot org

      Complexity: medium

        Activity

        Hide
        Eric Charles added a comment -

        ... and Ioan, don't forget to delete the apache-extra repo (http://code.google.com/a/apache-extras.org/p/mailbox-hdfs/) or to clearly indicate on the project home page that the code is now part of the official apache james mailbox repo (if you want to keep the hg repo as reminder

        Thx again

        Eric

        Show
        Eric Charles added a comment - ... and Ioan, don't forget to delete the apache-extra repo ( http://code.google.com/a/apache-extras.org/p/mailbox-hdfs/ ) or to clearly indicate on the project home page that the code is now part of the official apache james mailbox repo (if you want to keep the hg repo as reminder Thx again Eric
        Hide
        Eric Charles added a comment -

        Well done Ioan!

        Show
        Eric Charles added a comment - Well done Ioan!
        Hide
        Ioan Eugen Stan added a comment -

        I think you can close this one. I have just committed the code base to trunk and it looks ok. I also committed the integration tests, now going for improvements and finishing integration.

        Show
        Ioan Eugen Stan added a comment - I think you can close this one. I have just committed the code base to trunk and it looks ok. I also committed the integration tests, now going for improvements and finishing integration.
        Hide
        Eric Charles added a comment -

        I've created http://wiki.apache.org/james/GSoC2011HBaseMailbox
        Feel free to further comment this JIRA, MAILBOX-72 (requirement for distributed mailbox) or update the wiki page.
        The final goal is to have a enough detailed wiki page with datamodel...

        Show
        Eric Charles added a comment - I've created http://wiki.apache.org/james/GSoC2011HBaseMailbox Feel free to further comment this JIRA, MAILBOX-72 (requirement for distributed mailbox) or update the wiki page. The final goal is to have a enough detailed wiki page with datamodel...
        Hide
        Norman Maurer added a comment - - edited

        @Stack:

        Hope this makes it more clear:

        messagesMetaData(CF): {
        mailboxId/uid:

        { uid: 1, mailboxId: 184e-ske1-igk2-gj71 flags.recent: true, flags.deleted: true, flags.seen: true, flags.deleted: false, flags.seen: false, flags.flagged: true, bodyOctets: 19484 fullContentOctets: 10304 properties: namespace::localname::value;;namespace2::localname2::value2 headers: byte[], mediaType: text, subType: plain, textualLineCount: 24 }

        }

        messagesContent(CF): {
        mailboxId/uid:

        { 1: byte[], 2: byte[], 3: byte[] }

        }

        Then I have secondary indexes on the messagesMetaData CF to be able to get all messages which belongs to mailbox X and have the deleted flag set etc.

        I used RP and used the secondary indexes for "filter" the right messages.

        Does it explain it a bit more ?

        Show
        Norman Maurer added a comment - - edited @Stack: Hope this makes it more clear: messagesMetaData(CF): { mailboxId/uid: { uid: 1, mailboxId: 184e-ske1-igk2-gj71 flags.recent: true, flags.deleted: true, flags.seen: true, flags.deleted: false, flags.seen: false, flags.flagged: true, bodyOctets: 19484 fullContentOctets: 10304 properties: namespace::localname::value;;namespace2::localname2::value2 headers: byte[], mediaType: text, subType: plain, textualLineCount: 24 } } messagesContent(CF): { mailboxId/uid: { 1: byte[], 2: byte[], 3: byte[] } } Then I have secondary indexes on the messagesMetaData CF to be able to get all messages which belongs to mailbox X and have the deleted flag set etc. I used RP and used the secondary indexes for "filter" the right messages. Does it explain it a bit more ?
        Hide
        Eric Charles added a comment -

        Hi there, and tks to Stack to join and help us in this design.

        I've added on MAILBOX-72 some food for the brains.

        You can see on https://issues.apache.org/jira/secure/attachment/12482691/Datamodel-mailbox-0.2.png the interfaces that the HBase store will have to implement.
        There's no option there, but the implementation is really free to implement it as it wants.

        First the tables:

        • If you look at the classes, we could have Mailbox, Subscription and Message tables.
        • A row per mailbox, subscription and message
        • The unanswered question are: 1. The structure of the rowkey? - 2. Header and Property as separate table or as additional column to the message row.

        Second the queries:

        Finally the index to help optimize the search

        • solr to the rescue can help
        • I like lucene on hbase on-going work, especially when it will be done
        • In the meantime, we could rely on custom hbase scanners (inefficient due to full table scan)

        Waiting on your feedbacks.
        Tks,

        • Eric
        Show
        Eric Charles added a comment - Hi there, and tks to Stack to join and help us in this design. I've added on MAILBOX-72 some food for the brains. You can see on https://issues.apache.org/jira/secure/attachment/12482691/Datamodel-mailbox-0.2.png the interfaces that the HBase store will have to implement. There's no option there, but the implementation is really free to implement it as it wants. First the tables: If you look at the classes, we could have Mailbox, Subscription and Message tables. A row per mailbox, subscription and message The unanswered question are: 1. The structure of the rowkey? - 2. Header and Property as separate table or as additional column to the message row. Second the queries: The implemented SQL queries are on https://issues.apache.org/jira/browse/MAILBOX-72?focusedCommentId=13049883&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13049883 Some are simple Get (efficient), some not. We will need for to use the HBase scanners (existing one, maybe also specific one we will have to implement). For the IMAP built queries (especially for search), this can lead to a full scan of the table (see following point) Finally the index to help optimize the search solr to the rescue can help I like lucene on hbase on-going work, especially when it will be done In the meantime, we could rely on custom hbase scanners (inefficient due to full table scan) Waiting on your feedbacks. Tks, Eric
        Hide
        stack added a comment -

        First of welcome

        Thank you.

        Thats sweet that you have the prior experience hacking this on top of a store already. I defer to experience!

        Why separate row for message metadata and content if you don't mind me asking rather than a message per row with say content in one column family and metadata in another (Probably best to have cells no bigger than N MB in HBase too... we say > 10MB is usually to avoided so that splitting across cells probably applies to HBase too).

        Did you use order preserving partitioner?

        Random IMAP querys sounds ugly.

        Show
        stack added a comment - First of welcome Thank you. Thats sweet that you have the prior experience hacking this on top of a store already. I defer to experience! Why separate row for message metadata and content if you don't mind me asking rather than a message per row with say content in one column family and metadata in another (Probably best to have cells no bigger than N MB in HBase too... we say > 10MB is usually to avoided so that splitting across cells probably applies to HBase too). Did you use order preserving partitioner? Random IMAP querys sounds ugly.
        Hide
        Norman Maurer added a comment -

        @stack:

        First of welcome

        I wrote a few of the other mailbox implementations in JAMES. So maybe I can answer your questions (concerns) I also wrote a prototype for a mailbox on top of cassandra which is not to different in terms of "limitations".

        So here we go:

        I think putting all the mail in one row for a mailbox will not work. As really big mailboxes are quite common these days. This will just limit the distribution a lot (as you already pointed out). So let me try to explain how I did the schema for cassandra maybe it also fits for hbase (I had not the time to dig deeper into it).

        • one row for the mailbox meta data (mailboxId, uidvalidity, namespace, username ...).
        • one row for the message metadata ( mailboxId, uid, size, headers, flags, messagecontentId...).
        • one row per message content where I splitted the messagecontent in 1mb parts and put each "raw" byte[] in a new column. This makes sure we don't get to big column (not sure if this is also needed for hbase, in cassandra big columns are a problem)

        For queries there a the following:

        • retrieve all messages which have the recent flag set
        • retrieve all messages which have the sent flag set
        • retrieve all messages with uid <=> X
        • retrieve all messages with the deleted flag set
        • retrieve all mailboxes with name like '%X%'

        Then IMAP also allows to build your own search query. Which is really problematic with nosql stores or even if sql stores. As it mainly allow the user todo any kind of filtering, which in fact just suck when you don't have the indexes set. So we have a lucene index for that atm. I plan to write one in SOLR too.

        Threading is not supported atm but is on my todo list.

        Hope this helps, just ask if you need more infos

        Show
        Norman Maurer added a comment - @stack: First of welcome I wrote a few of the other mailbox implementations in JAMES. So maybe I can answer your questions (concerns) I also wrote a prototype for a mailbox on top of cassandra which is not to different in terms of "limitations". So here we go: I think putting all the mail in one row for a mailbox will not work. As really big mailboxes are quite common these days. This will just limit the distribution a lot (as you already pointed out). So let me try to explain how I did the schema for cassandra maybe it also fits for hbase (I had not the time to dig deeper into it). one row for the mailbox meta data (mailboxId, uidvalidity, namespace, username ...). one row for the message metadata ( mailboxId, uid, size, headers, flags, messagecontentId...). one row per message content where I splitted the messagecontent in 1mb parts and put each "raw" byte[] in a new column. This makes sure we don't get to big column (not sure if this is also needed for hbase, in cassandra big columns are a problem) For queries there a the following: retrieve all messages which have the recent flag set retrieve all messages which have the sent flag set retrieve all messages with uid <=> X retrieve all messages with the deleted flag set retrieve all mailboxes with name like '%X%' Then IMAP also allows to build your own search query. Which is really problematic with nosql stores or even if sql stores. As it mainly allow the user todo any kind of filtering, which in fact just suck when you don't have the indexes set. So we have a lucene index for that atm. I plan to write one in SOLR too. Threading is not supported atm but is on my todo list. Hope this helps, just ask if you need more infos
        Hide
        stack added a comment -

        All mail in a single row in hbase would mean that the mailbox would be changed 'atomically' since row updates in hbase are so but downsides might be that that some users would have really big mailboxes and gigabyte-sized rows; this might mess w/ balance and distribution of across the cluster (perhaps).

        If you did put them all in a single row, in hbase columns are sorted too; if the column qualifier were a reverse order date you could encounter mail in order of newest first. HBase has versioning too so you could stamp mail into hbase and write the mail receipt date as the cell version. Naturally it returns versions in order of newest first.

        How would you do threading? Does James support this? What else does James support that you expect the db to provide?

        Show
        stack added a comment - All mail in a single row in hbase would mean that the mailbox would be changed 'atomically' since row updates in hbase are so but downsides might be that that some users would have really big mailboxes and gigabyte-sized rows; this might mess w/ balance and distribution of across the cluster (perhaps). If you did put them all in a single row, in hbase columns are sorted too; if the column qualifier were a reverse order date you could encounter mail in order of newest first. HBase has versioning too so you could stamp mail into hbase and write the mail receipt date as the cell version. Naturally it returns versions in order of newest first. How would you do threading? Does James support this? What else does James support that you expect the db to provide?
        Hide
        Ioan Eugen Stan added a comment -

        Thank you for the input, I appreciate it and I will look into it, it seems very promising.
        My first idea was to store all the users emails in a single row, but I couldn't figure how to access the emails in an efficient manner.
        I hope I will get my hands on that book soon, but until then I will see what I can get from other sources.

        We are currently discussing the requirements and constraints about building a NoSQL storage here: https://issues.apache.org/jira/browse/MAILBOX-72. For now, the discussion is targeting HBase, but I think it can be adapted to other NoSQL implementations. We will publish the schema details there.

        Show
        Ioan Eugen Stan added a comment - Thank you for the input, I appreciate it and I will look into it, it seems very promising. My first idea was to store all the users emails in a single row, but I couldn't figure how to access the emails in an efficient manner. I hope I will get my hands on that book soon, but until then I will see what I can get from other sources. We are currently discussing the requirements and constraints about building a NoSQL storage here: https://issues.apache.org/jira/browse/MAILBOX-72 . For now, the discussion is targeting HBase, but I think it can be adapted to other NoSQL implementations. We will publish the schema details there.
        Hide
        stack added a comment -

        @Loan Going the Gora route will allow you swap stores. I've not used it so am not up on the costs that come with the indirection (if any).

        You'll need to figure a schema design for your store. I'd suggest you study how James does queries currently and make a list. This will be the key input feeding your schema design. For example, in the coming "HBase: The Definitive Guide", Lars has some discussion of HBase as a mail store. Rows are sorted in HBase so he arrives at a row key schema that looks like this:

        <userid><date in reversed chronological order so you see newest mail first><message-id><attachment-id>
        

        You can start up a scan to see all mail from a user and you'll see the latest first. Mail will be grouped by mail id. If attachments ids are their sequence number, then they'll be encountered in order (you'll probably need to zero pad some of the attributes above). This is just an example. You may end up w/ different row key design after you've studied James queries.

        Show
        stack added a comment - @Loan Going the Gora route will allow you swap stores. I've not used it so am not up on the costs that come with the indirection (if any). You'll need to figure a schema design for your store. I'd suggest you study how James does queries currently and make a list. This will be the key input feeding your schema design. For example, in the coming "HBase: The Definitive Guide", Lars has some discussion of HBase as a mail store. Rows are sorted in HBase so he arrives at a row key schema that looks like this: <userid><date in reversed chronological order so you see newest mail first><message-id><attachment-id> You can start up a scan to see all mail from a user and you'll see the latest first. Mail will be grouped by mail id. If attachments ids are their sequence number, then they'll be encountered in order (you'll probably need to zero pad some of the attributes above). This is just an example. You may end up w/ different row key design after you've studied James queries.
        Hide
        Eric Charles added a comment -

        Further to [1] and [2], an extra layer (HBase) upon Hadoop will be used
        [1] http://markmail.org/message/hojdn5ugyxsq2pft
        [2] http://markmail.org/message/5q7hixtxiioa6rse

        Show
        Eric Charles added a comment - Further to [1] and [2] , an extra layer (HBase) upon Hadoop will be used [1] http://markmail.org/message/hojdn5ugyxsq2pft [2] http://markmail.org/message/5q7hixtxiioa6rse
        Hide
        Eric Charles added a comment -

        Ioan, we will use apache-extras (backed by google) for your source code repository.
        You can create one via http://code.google.com/a/apache-extras.org/hosting/createProject
        It provides:
        Mercurial and Subversion code hosting
        ==> Apache is SVN for now, but I think you better know Mercurial. So choose what you want. Choose Apache 2 Licence.
        Download/release hosting
        Integrated source code browsing and code review tools
        An issue tracker and project wiki
        ==> Don't use this, use the Apache James JIRA and Apache James Wiki.
        For the pom, just inspire you from the mailbox-jpa pom (rename it to mailbox-hdfs update dependencies,...)

        Show
        Eric Charles added a comment - Ioan, we will use apache-extras (backed by google) for your source code repository. You can create one via http://code.google.com/a/apache-extras.org/hosting/createProject It provides: Mercurial and Subversion code hosting ==> Apache is SVN for now, but I think you better know Mercurial. So choose what you want. Choose Apache 2 Licence. Download/release hosting Integrated source code browsing and code review tools An issue tracker and project wiki ==> Don't use this, use the Apache James JIRA and Apache James Wiki. For the pom, just inspire you from the mailbox-jpa pom (rename it to mailbox-hdfs update dependencies,...)
        Hide
        Eric Charles added a comment -

        So you build and run james, you've got hadoop setup. Cool!
        The next step would be to create a maven project, declare the needed dependencies to james mailbox and hadoop libraries, and make a few attemps:
        1. Access the mailbox (create session,...) from a java test case (see [1] and [2] for inspiration, I will try to commit more focused examples tomorrow)
        2. Access a hadoop cluster based on Mini(MR)Cluster : these are the classes hadoop uses for testing without having to deploy a real cluster.

        Also have a look at gora documentation. This will be useful when we will have to decide on how to access the hdfs files,... and don't forget to subscribe to hadoop and gora mailing lists.

        [1] https://svn.apache.org/repos/asf/james/mailbox/trunk/jpa/src/test/java/org/apache/james/mailbox/jpa/JPAMailboxManagerTest.java
        [2] https://svn.apache.org/repos/asf/james/server/trunk/container-spring/src/main/java/org/apache/james/container/spring/tool/James23Importer.java

        Show
        Eric Charles added a comment - So you build and run james, you've got hadoop setup. Cool! The next step would be to create a maven project, declare the needed dependencies to james mailbox and hadoop libraries, and make a few attemps: 1. Access the mailbox (create session,...) from a java test case (see [1] and [2] for inspiration, I will try to commit more focused examples tomorrow) 2. Access a hadoop cluster based on Mini(MR)Cluster : these are the classes hadoop uses for testing without having to deploy a real cluster. Also have a look at gora documentation. This will be useful when we will have to decide on how to access the hdfs files,... and don't forget to subscribe to hadoop and gora mailing lists. [1] https://svn.apache.org/repos/asf/james/mailbox/trunk/jpa/src/test/java/org/apache/james/mailbox/jpa/JPAMailboxManagerTest.java [2] https://svn.apache.org/repos/asf/james/server/trunk/container-spring/src/main/java/org/apache/james/container/spring/tool/James23Importer.java
        Hide
        Ioan Eugen Stan added a comment -

        I have installed Hadoop on my machine and run the wordcount example. Now all I have to do is figure out how to put all things together . I guess I will have to get to know a little bit of Hadoop API.

        Show
        Ioan Eugen Stan added a comment - I have installed Hadoop on my machine and run the wordcount example. Now all I have to do is figure out how to put all things together . I guess I will have to get to know a little bit of Hadoop API.
        Hide
        Eric Charles added a comment -

        The maibox component injection is achieved by the server project with context files you can find in
        https://svn.apache.org/repos/asf/james/server/trunk/container-spring/src/main/config/james/context/

        There are some functional tests for the different mailbox impl in
        http://svn.apache.org/repos/asf/james/mailbox-integration-tester/trunk/
        These are tests for the imap protocol using the mailbox impl.

        The mailbox project in it self only contains some basic testing for now.

        Having a dependency injection modulie in the mailbox project (without the need to have a server) is on my todo.

        Btw, which missing plugins exceptions have you received. If needed, you may remove from you local maven repo ($HOME/.m2/repository/...) the bad plugins, your next build should download it again from the internet maven repositories.

        Show
        Eric Charles added a comment - The maibox component injection is achieved by the server project with context files you can find in https://svn.apache.org/repos/asf/james/server/trunk/container-spring/src/main/config/james/context/ There are some functional tests for the different mailbox impl in http://svn.apache.org/repos/asf/james/mailbox-integration-tester/trunk/ These are tests for the imap protocol using the mailbox impl. The mailbox project in it self only contains some basic testing for now. Having a dependency injection modulie in the mailbox project (without the need to have a server) is on my todo. Btw, which missing plugins exceptions have you received. If needed, you may remove from you local maven repo ($HOME/.m2/repository/...) the bad plugins, your next build should download it again from the internet maven repositories.
        Hide
        Norman Maurer added a comment -

        Hi there,

        for the mailbox api part I suggest you to have a look at the jpa implementation. This will give you a feeling about what needs to get done. After that have a look at the store module which contains everything you need to write your implementation. It already have many abstract base classes which just needs to get extended. Once you got the idea its really straight forward

        And yes james use spring to load the right classes depending on the .xml files

        Show
        Norman Maurer added a comment - Hi there, for the mailbox api part I suggest you to have a look at the jpa implementation. This will give you a feeling about what needs to get done. After that have a look at the store module which contains everything you need to write your implementation. It already have many abstract base classes which just needs to get extended. Once you got the idea its really straight forward And yes james use spring to load the right classes depending on the .xml files
        Hide
        Ioan Eugen Stan added a comment -

        I installed James 3.0m2 on localhost. Installation was easy, I just had to disable exim so james could bind to port 25. I succesfully sent an email and configured Icedove (Mozilla Thunderbird to get the mail by IMAP).

        I also succesfully built james (trunk) on my machine but I had to disable the test building because maven complained about missing plug-ins on tests.

        Last, I had a look on James mailbox API. Didn't know where to start, but got it: it's big.

        My first try was to find the mailbox dependency in James Server but I couldn't find it. Luckily I had just read about Dependency Injection (http://martinfowler.com/articles/injection.html) and spring. James is using dependency injection fitting in the right mailbox API at runtime, based on config files. Right?

        Show
        Ioan Eugen Stan added a comment - I installed James 3.0m2 on localhost. Installation was easy, I just had to disable exim so james could bind to port 25. I succesfully sent an email and configured Icedove (Mozilla Thunderbird to get the mail by IMAP). I also succesfully built james (trunk) on my machine but I had to disable the test building because maven complained about missing plug-ins on tests. Last, I had a look on James mailbox API. Didn't know where to start, but got it: it's big. My first try was to find the mailbox dependency in James Server but I couldn't find it. Luckily I had just read about Dependency Injection ( http://martinfowler.com/articles/injection.html ) and spring. James is using dependency injection fitting in the right mailbox API at runtime, based on config files. Right?
        Hide
        Eric Charles added a comment -
        Show
        Eric Charles added a comment - My favorite tutorial for hadoop setup: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
        Hide
        Ioan Eugen Stan added a comment -

        I have read rfc 5322 and I'm starting on HDFS and Hadoop.

        Show
        Ioan Eugen Stan added a comment - I have read rfc 5322 and I'm starting on HDFS and Hadoop.
        Hide
        Eric Charles added a comment -

        +1
        I just read your application on google-melange and it's ok to me.
        Good job

        Show
        Eric Charles added a comment - +1 I just read your application on google-melange and it's ok to me. Good job
        Hide
        Ioan Eugen Stan added a comment -

        I have added your recomandations to my application. Thanks for all the help.
        Please keep in mind that many of these are new to me and right now I am a bit overwhelmed by that.
        I see a lot of new names and it's a bit discouraging.
        I wish to see the project complete so please keep things simple for me until I can manage all this information.
        First make it run and then make it run fast.

        • Ioan
        Show
        Ioan Eugen Stan added a comment - I have added your recomandations to my application. Thanks for all the help. Please keep in mind that many of these are new to me and right now I am a bit overwhelmed by that. I see a lot of new names and it's a bit discouraging. I wish to see the project complete so please keep things simple for me until I can manage all this information. First make it run and then make it run fast. Ioan
        Hide
        Eric Charles added a comment -

        Robert,

        Regarding distributed uid generation, we have defined https://issues.apache.org/jira/browse/IMAP-271 [gsoc2011] Design and implement Distributed UID generation

        I must reread your post and rereread the RFCs to have a better idea.

        I suppose this doesn't change anything on Ioan application's scope. If mails are persisted, and we have a solution for uid, we have a distributed james. But uid is not in-scope here. wdyt?

        • Eric
        Show
        Eric Charles added a comment - Robert, Regarding distributed uid generation, we have defined https://issues.apache.org/jira/browse/IMAP-271 [gsoc2011] Design and implement Distributed UID generation I must reread your post and rereread the RFCs to have a better idea. I suppose this doesn't change anything on Ioan application's scope. If mails are persisted, and we have a solution for uid, we have a distributed james. But uid is not in-scope here. wdyt? Eric
        Hide
        Robert Burrell Donkin added a comment -

        A distributed email server is an interesting topic

        There are a number of different ways which might reasonably approach the problem. Take a look at the way UIDs are defined in IMAP [1]. The strong uniqueness qualities may only be required within a mailbox, not universally. Though mailboxes can be shared, requirements for maintenance message sequence number limit how well concurrency access to a single mailbox will scale.

        This suggests to me that the framers of the IMAP standard considered the possibility that distribution might happen between the protocol and mailbox tiers. In the scenario, the servers handling client connections and handling mailboxes would operate in separate processes, potentially separated by a network. Each mailbox could then be located close to dedicated storage.

        I believe that a consequence of this engineering decision by the standards group may be that a fully distributed UID may be not really be necessary. I suspect that using HBase[3] or Cassandra [4] to store UIVALIDITY+UID keyed by mailbox name (perhaps using Gora[5]) would be good enough.

        [1] http://tools.ietf.org/html/rfc3501

        2.3.1.1. Unique Identifier (UID) Message Attribute

        A 32-bit value assigned to each message, which when used with the
        unique identifier validity value (see below) forms a 64-bit value
        that MUST NOT refer to any other message in the mailbox or any
        subsequent mailbox with the same name forever. Unique identifiers
        are assigned in a strictly ascending fashion in the mailbox; as each
        message is added to the mailbox it is assigned a higher UID than the
        message(s) which were added previously. Unlike message sequence
        numbers, unique identifiers are not necessarily contiguous.

        The unique identifier of a message MUST NOT change during the
        session, and SHOULD NOT change between sessions. Any change of
        unique identifiers between sessions MUST be detectable using the
        UIDVALIDITY mechanism discussed below. Persistent unique identifiers
        are required for a client to resynchronize its state from a previous
        session with the server (e.g., disconnected or offline access
        clients); this is discussed further in [IMAP-DISC].

        Associated with every mailbox are two values which aid in unique
        identifier handling: the next unique identifier value and the unique
        identifier validity value.

        The next unique identifier value is the predicted value that will be
        assigned to a new message in the mailbox. Unless the unique
        identifier validity also changes (see below), the next unique
        identifier value MUST have the following two characteristics. First,
        the next unique identifier value MUST NOT change unless new messages
        are added to the mailbox; and second, the next unique identifier
        value MUST change whenever new messages are added to the mailbox,
        even if those new messages are subsequently expunged.

        Note: The next unique identifier value is intended to
        provide a means for a client to determine whether any
        messages have been delivered to the mailbox since the
        previous time it checked this value. It is not intended to
        provide any guarantee that any message will have this
        unique identifier. A client can only assume, at the time
        that it obtains the next unique identifier value, that
        messages arriving after that time will have a UID greater
        than or equal to that value.

        The unique identifier validity value is sent in a UIDVALIDITY
        response code in an OK untagged response at mailbox selection time.
        If unique identifiers from an earlier session fail to persist in this
        session, the unique identifier validity value MUST be greater than
        the one used in the earlier session.

        Note: Ideally, unique identifiers SHOULD persist at all
        times. Although this specification recognizes that failure
        to persist can be unavoidable in certain server
        environments, it STRONGLY ENCOURAGES message store
        implementation techniques that avoid this problem. For
        example:

        1) Unique identifiers MUST be strictly ascending in the
        mailbox at all times. If the physical message store is
        re-ordered by a non-IMAP agent, this requires that the
        unique identifiers in the mailbox be regenerated, since
        the former unique identifiers are no longer strictly
        ascending as a result of the re-ordering.

        2) If the message store has no mechanism to store unique
        identifiers, it must regenerate unique identifiers at
        each session, and each session must have a unique
        UIDVALIDITY value.

        3) If the mailbox is deleted and a new mailbox with the
        same name is created at a later date, the server must
        either keep track of unique identifiers from the
        previous instance of the mailbox, or it must assign a
        new UIDVALIDITY value to the new instance of the
        mailbox. A good UIDVALIDITY value to use in this case
        is a 32-bit representation of the creation date/time of
        the mailbox. It is alright to use a constant such as
        1, but only if it guaranteed that unique identifiers
        will never be reused, even in the case of a mailbox
        being deleted (or renamed) and a new mailbox by the
        same name created at some future time.

        4) The combination of mailbox name, UIDVALIDITY, and UID
        must refer to a single immutable message on that server
        forever. In particular, the internal date, [RFC-2822]
        size, envelope, body structure, and message texts
        (RFC822, RFC822.HEADER, RFC822.TEXT, and all BODY[...]
        fetch data items) must never change. This does not
        include message numbers, nor does it include attributes
        that can be set by a STORE command (e.g., FLAGS).

        [2] http://tools.ietf.org/html/rfc3501

        2.3.1.2. Message Sequence Number Message Attribute

        A relative position from 1 to the number of messages in the mailbox.
        This position MUST be ordered by ascending unique identifier. As
        each new message is added, it is assigned a message sequence number
        that is 1 higher than the number of messages in the mailbox before
        that new message was added.

        Message sequence numbers can be reassigned during the session. For
        example, when a message is permanently removed (expunged) from the
        mailbox, the message sequence number for all subsequent messages is
        decremented. The number of messages in the mailbox is also
        decremented. Similarly, a new message can be assigned a message
        sequence number that was once held by some other message prior to an
        expunge.

        In addition to accessing messages by relative position in the
        mailbox, message sequence numbers can be used in mathematical
        calculations. For example, if an untagged "11 EXISTS" is received,
        and previously an untagged "8 EXISTS" was received, three new
        messages have arrived with message sequence numbers of 9, 10, and 11.
        Another example, if message 287 in a 523 message mailbox has UID
        12345, there are exactly 286 messages which have lesser UIDs and 236
        messages which have greater UIDs.

        [3] http://hbase.apache.org/
        [4] http://cassandra.apache.org/
        [5] http://incubator.apache.org/gora/

        Show
        Robert Burrell Donkin added a comment - A distributed email server is an interesting topic There are a number of different ways which might reasonably approach the problem. Take a look at the way UIDs are defined in IMAP [1] . The strong uniqueness qualities may only be required within a mailbox, not universally. Though mailboxes can be shared, requirements for maintenance message sequence number limit how well concurrency access to a single mailbox will scale. This suggests to me that the framers of the IMAP standard considered the possibility that distribution might happen between the protocol and mailbox tiers. In the scenario, the servers handling client connections and handling mailboxes would operate in separate processes, potentially separated by a network. Each mailbox could then be located close to dedicated storage. I believe that a consequence of this engineering decision by the standards group may be that a fully distributed UID may be not really be necessary. I suspect that using HBase [3] or Cassandra [4] to store UIVALIDITY+UID keyed by mailbox name (perhaps using Gora [5] ) would be good enough. [1] http://tools.ietf.org/html/rfc3501 2.3.1.1. Unique Identifier (UID) Message Attribute A 32-bit value assigned to each message, which when used with the unique identifier validity value (see below) forms a 64-bit value that MUST NOT refer to any other message in the mailbox or any subsequent mailbox with the same name forever. Unique identifiers are assigned in a strictly ascending fashion in the mailbox; as each message is added to the mailbox it is assigned a higher UID than the message(s) which were added previously. Unlike message sequence numbers, unique identifiers are not necessarily contiguous. The unique identifier of a message MUST NOT change during the session, and SHOULD NOT change between sessions. Any change of unique identifiers between sessions MUST be detectable using the UIDVALIDITY mechanism discussed below. Persistent unique identifiers are required for a client to resynchronize its state from a previous session with the server (e.g., disconnected or offline access clients); this is discussed further in [IMAP-DISC] . Associated with every mailbox are two values which aid in unique identifier handling: the next unique identifier value and the unique identifier validity value. The next unique identifier value is the predicted value that will be assigned to a new message in the mailbox. Unless the unique identifier validity also changes (see below), the next unique identifier value MUST have the following two characteristics. First, the next unique identifier value MUST NOT change unless new messages are added to the mailbox; and second, the next unique identifier value MUST change whenever new messages are added to the mailbox, even if those new messages are subsequently expunged. Note: The next unique identifier value is intended to provide a means for a client to determine whether any messages have been delivered to the mailbox since the previous time it checked this value. It is not intended to provide any guarantee that any message will have this unique identifier. A client can only assume, at the time that it obtains the next unique identifier value, that messages arriving after that time will have a UID greater than or equal to that value. The unique identifier validity value is sent in a UIDVALIDITY response code in an OK untagged response at mailbox selection time. If unique identifiers from an earlier session fail to persist in this session, the unique identifier validity value MUST be greater than the one used in the earlier session. Note: Ideally, unique identifiers SHOULD persist at all times. Although this specification recognizes that failure to persist can be unavoidable in certain server environments, it STRONGLY ENCOURAGES message store implementation techniques that avoid this problem. For example: 1) Unique identifiers MUST be strictly ascending in the mailbox at all times. If the physical message store is re-ordered by a non-IMAP agent, this requires that the unique identifiers in the mailbox be regenerated, since the former unique identifiers are no longer strictly ascending as a result of the re-ordering. 2) If the message store has no mechanism to store unique identifiers, it must regenerate unique identifiers at each session, and each session must have a unique UIDVALIDITY value. 3) If the mailbox is deleted and a new mailbox with the same name is created at a later date, the server must either keep track of unique identifiers from the previous instance of the mailbox, or it must assign a new UIDVALIDITY value to the new instance of the mailbox. A good UIDVALIDITY value to use in this case is a 32-bit representation of the creation date/time of the mailbox. It is alright to use a constant such as 1, but only if it guaranteed that unique identifiers will never be reused, even in the case of a mailbox being deleted (or renamed) and a new mailbox by the same name created at some future time. 4) The combination of mailbox name, UIDVALIDITY, and UID must refer to a single immutable message on that server forever. In particular, the internal date, [RFC-2822] size, envelope, body structure, and message texts (RFC822, RFC822.HEADER, RFC822.TEXT, and all BODY [...] fetch data items) must never change. This does not include message numbers, nor does it include attributes that can be set by a STORE command (e.g., FLAGS). [2] http://tools.ietf.org/html/rfc3501 2.3.1.2. Message Sequence Number Message Attribute A relative position from 1 to the number of messages in the mailbox. This position MUST be ordered by ascending unique identifier. As each new message is added, it is assigned a message sequence number that is 1 higher than the number of messages in the mailbox before that new message was added. Message sequence numbers can be reassigned during the session. For example, when a message is permanently removed (expunged) from the mailbox, the message sequence number for all subsequent messages is decremented. The number of messages in the mailbox is also decremented. Similarly, a new message can be assigned a message sequence number that was once held by some other message prior to an expunge. In addition to accessing messages by relative position in the mailbox, message sequence numbers can be used in mathematical calculations. For example, if an untagged "11 EXISTS" is received, and previously an untagged "8 EXISTS" was received, three new messages have arrived with message sequence numbers of 9, 10, and 11. Another example, if message 287 in a 523 message mailbox has UID 12345, there are exactly 286 messages which have lesser UIDs and 236 messages which have greater UIDs. [3] http://hbase.apache.org/ [4] http://cassandra.apache.org/ [5] http://incubator.apache.org/gora/
        Hide
        Robert Burrell Donkin added a comment -

        IMHO JSON is an interesting option for email storage, and a Mime4J module parsing a MIME mail into JSON would be useful for much more than just AVRO

        Show
        Robert Burrell Donkin added a comment - IMHO JSON is an interesting option for email storage, and a Mime4J module parsing a MIME mail into JSON would be useful for much more than just AVRO
        Hide
        Eric Charles added a comment -

        Regarding : "Another problem to settle is the format and compression of the HDFS files to store the emails", an option would be avro (other otpions would be to use the different native hdfs file type, or to develope a MailHadoopFile).

        From http://avro.apache.org/docs/current/, Avro provides:

        Rich data structures.
        A compact, fast, binary data format.
        A container file, to store persistent data.
        Remote procedure call (RPC).
        Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

        The nice thing is that you define your format in JSON and you get for free the persistent of your object in hadoop (direct + via map/reduce).

        Twitter uses for example similar mechanism to store their tweets (very small objects) in their distibuted store.

        To be tested/compared with other alternatives...

        Would be cool to inject this in your application.tks,

        Show
        Eric Charles added a comment - Regarding : "Another problem to settle is the format and compression of the HDFS files to store the emails", an option would be avro (other otpions would be to use the different native hdfs file type, or to develope a MailHadoopFile). From http://avro.apache.org/docs/current/ , Avro provides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. The nice thing is that you define your format in JSON and you get for free the persistent of your object in hadoop (direct + via map/reduce). Twitter uses for example similar mechanism to store their tweets (very small objects) in their distibuted store. To be tested/compared with other alternatives... Would be cool to inject this in your application.tks,
        Hide
        Eric Charles added a comment -

        Ioan, please link to https://issues.apache.org/jira/browse/MAILBOX-44 from your google-melange application. tks.

        Show
        Eric Charles added a comment - Ioan, please link to https://issues.apache.org/jira/browse/MAILBOX-44 from your google-melange application. tks.
        Hide
        Robert Burrell Donkin added a comment -

        HDFS is good at (relatively) small numbers of large files. For small files, the main limitation was block size. Hadoop moves fast. Need to establish early the current state of the art, and what tuning would be required.

        Show
        Robert Burrell Donkin added a comment - HDFS is good at (relatively) small numbers of large files. For small files, the main limitation was block size. Hadoop moves fast. Need to establish early the current state of the art, and what tuning would be required.
        Hide
        Robert Burrell Donkin added a comment -

        Yep

        IMHO there's an art to RFCs. Implementation requires lots of reading and re-reading but you don't need to do that if you just want to use them. Aim to skim read them, so you know where to find information rather than retain any details.

        Show
        Robert Burrell Donkin added a comment - Yep IMHO there's an art to RFCs. Implementation requires lots of reading and re-reading but you don't need to do that if you just want to use them. Aim to skim read them, so you know where to find information rather than retain any details.
        Hide
        Ioan Eugen Stan added a comment -

        A lot of history catching up .

        Show
        Ioan Eugen Stan added a comment - A lot of history catching up .
        Hide
        Robert Burrell Donkin added a comment -

        The Structure Of An Mail
        ------------------------------------
        Numerous RFCs describe the structure which emails should have. Though in the wild, wild web variations are encountered, it's important to read these standards to start to understand the data structure used by mail.

        Take a look at the Mime4J mail parser (http://james.apache.org/mime4j/index.html) and here's a selection of RFC to skim:

        http://tools.ietf.org/html/rfc5322
        http://tools.ietf.org/html/rfc5335
        (and for historic reasons also:
        http://tools.ietf.org/html/rfc5335
        http://tools.ietf.org/html/rfc2822
        http://tools.ietf.org/html/rfc822)

        http://tools.ietf.org/html/rfc2045
        http://tools.ietf.org/html/rfc2184
        http://tools.ietf.org/html/rfc2231
        http://tools.ietf.org/html/rfc2046
        http://tools.ietf.org/html/rfc2646
        http://tools.ietf.org/html/rfc3676
        http://tools.ietf.org/html/rfc3798
        http://tools.ietf.org/html/rfc5147

        Show
        Robert Burrell Donkin added a comment - The Structure Of An Mail ------------------------------------ Numerous RFCs describe the structure which emails should have. Though in the wild, wild web variations are encountered, it's important to read these standards to start to understand the data structure used by mail. Take a look at the Mime4J mail parser ( http://james.apache.org/mime4j/index.html ) and here's a selection of RFC to skim: http://tools.ietf.org/html/rfc5322 http://tools.ietf.org/html/rfc5335 (and for historic reasons also: http://tools.ietf.org/html/rfc5335 http://tools.ietf.org/html/rfc2822 http://tools.ietf.org/html/rfc822 ) http://tools.ietf.org/html/rfc2045 http://tools.ietf.org/html/rfc2184 http://tools.ietf.org/html/rfc2231 http://tools.ietf.org/html/rfc2046 http://tools.ietf.org/html/rfc2646 http://tools.ietf.org/html/rfc3676 http://tools.ietf.org/html/rfc3798 http://tools.ietf.org/html/rfc5147

          People

          • Assignee:
            Norman Maurer
            Reporter:
            Eric Charles
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development