James Mailbox
  1. James Mailbox
  2. MAILBOX-173

[gsoc2012] Distribuited mailbox indexing over HBase/HDFS

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: hbase, lucene, store
    • Labels:

      Description

      James provide a module called Lucene Mailbox Index that knows how to index emails. Indexing is done by providing a suitable Lucene Directory implementation that will store the index and allow searching. Lucene comes with File system directory JDBC Directory and a few other implementations to store the index in a file-system or in a database.

      In order to provide distributed search we should implement a Directory implementation that will store the index in HBase. Such an implementation is described very well here [1].

      [1] http://www.infoq.com/articles/LuceneHbase

      1. MAILBOX-173.patch
        139 kB
        Mihai Soloi

        Activity

        Hide
        Ioan Eugen Stan added a comment -

        The directory implementation should accept a mailbox-id as parameter and use it to prefix the index for a mailbox to limit the search space to a single mailbox.

        Show
        Ioan Eugen Stan added a comment - The directory implementation should accept a mailbox-id as parameter and use it to prefix the index for a mailbox to limit the search space to a single mailbox.
        Hide
        Mihai Soloi added a comment -

        Hi Eugen, I've read the article and I am interested on this subject, i am currently playing with the James Server and will look for the way search should be implemented.

        Show
        Mihai Soloi added a comment - Hi Eugen, I've read the article and I am interested on this subject, i am currently playing with the James Server and will look for the way search should be implemented.
        Hide
        Ioan Eugen Stan added a comment -

        Hi,

        You should check out mailbox project. lucene-mailbox is responsible for indexing and searching. It exposes API that James server calls when a search is performed (usually triggered by a IMAP SEARCH command). An indexing mailet is usually is responsible for indexing the document.

        Good luck,

        Show
        Ioan Eugen Stan added a comment - Hi, You should check out mailbox project. lucene-mailbox is responsible for indexing and searching. It exposes API that James server calls when a search is performed (usually triggered by a IMAP SEARCH command). An indexing mailet is usually is responsible for indexing the document. Good luck,
        Hide
        Mihai Soloi added a comment -

        Proposal submitted! Please take a look at the google melange: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/mihaisoloi/1

        Thank you Eugen for the help so far!

        Show
        Mihai Soloi added a comment - Proposal submitted! Please take a look at the google melange: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/mihaisoloi/1 Thank you Eugen for the help so far!
        Hide
        Mihai Soloi added a comment -

        Any input/suggestion is highly appreciated

        Show
        Mihai Soloi added a comment - Any input/suggestion is highly appreciated
        Hide
        Mihai Soloi added a comment -

        Implemented an HBaseDirectory and HBase IndexInput and IndexOutput, looking into HBASE-3529 for a more optimal approach on distributed searching also emailed Lucene dev mailing list for problems with the checksum when trying to get an already open IndexReader.

        Show
        Mihai Soloi added a comment - Implemented an HBaseDirectory and HBase IndexInput and IndexOutput, looking into HBASE-3529 for a more optimal approach on distributed searching also emailed Lucene dev mailing list for problems with the checksum when trying to get an already open IndexReader.
        Hide
        Mihai Soloi added a comment -

        This patch is an inverted index in an HBase table to search through the mails in a mailbox.

        The structure of the index is as follows.

        1. mailboxID is an java.util.UUID
        2. the fields are now Enums, and what is stored is a byte that identifies that enum field.
        3. each of the terms in the fields are tokenized using the lucene org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer, but some fields are not tokenized due to their nature(SENT_DATE for example)

        The row is composed of all the above byte arrays concatenated, so that searching can be done very fast through the HBase table, as well as lookup on the specific mailbox and field in the mail. The mailID is the qualifier in the static column family(only one column family) so that mail id's are found with relative ease.

        This is for the mail document in itself, the flags are stored in a single row in the table(one row for each mailbox) and can be found easily by a scan. Each of the rows now has an empty value, where in the possible future we'll be able to store data related to the term frequency in the document.

        What works currently are the searches based on the text, flags, headers, all criterions, uid and uid ranges. These are implemented using Filters inside an Endpoint Coprocessors due to the benefit they provide of less data transfer over the network and distributed processing on each region.

        Show
        Mihai Soloi added a comment - This patch is an inverted index in an HBase table to search through the mails in a mailbox. The structure of the index is as follows. 1. mailboxID is an java.util.UUID 2. the fields are now Enums, and what is stored is a byte that identifies that enum field. 3. each of the terms in the fields are tokenized using the lucene org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer, but some fields are not tokenized due to their nature(SENT_DATE for example) The row is composed of all the above byte arrays concatenated, so that searching can be done very fast through the HBase table, as well as lookup on the specific mailbox and field in the mail. The mailID is the qualifier in the static column family(only one column family) so that mail id's are found with relative ease. This is for the mail document in itself, the flags are stored in a single row in the table(one row for each mailbox) and can be found easily by a scan. Each of the rows now has an empty value, where in the possible future we'll be able to store data related to the term frequency in the document. What works currently are the searches based on the text, flags, headers, all criterions, uid and uid ranges. These are implemented using Filters inside an Endpoint Coprocessors due to the benefit they provide of less data transfer over the network and distributed processing on each region.
        Hide
        Ioan Eugen Stan added a comment -

        Hello Mihai,

        Thx, and great job finish it. I'll have a look and try to merge it, but quite busy ATM so it will take be a while before I have time to look at it.

        Cheers,

        Show
        Ioan Eugen Stan added a comment - Hello Mihai, Thx, and great job finish it. I'll have a look and try to merge it, but quite busy ATM so it will take be a while before I have time to look at it. Cheers,

          People

          • Assignee:
            Ioan Eugen Stan
            Reporter:
            Ioan Eugen Stan
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development