Solr
  1. Solr
  2. SOLR-934

Enable importing of mails into a solr index through DIH.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Labels:
      None

      Description

      Enable importing of mails into solr through DIH. Take one or more mailbox credentials, download and index their content along with the content from attachments. The folders to fetch can be made configurable based on various criteria. Apache Tika is used for extracting content from different kinds of attachments. JavaMail is used for mail box related operations like fetching mails, filtering them etc.

      The basic configuration for one mail box is as below:

      <document>
         <entity processor="MailEntityProcessor" user="somebody@gmail.com" 
                      password="something" host="imap.gmail.com" protocol="imaps"/>
      </document>
      

      The below is the list of all configuration available:

      Required
      ---------
      user
      pwd
      protocol (only "imaps" supported now)
      host

      Optional
      ---------
      folders - comma seperated list of folders.
      If not specified, default folder is used. Nested folders can be specified like a/b/c
      recurse - index subfolders. Defaults to true.
      exclude - comma seperated list of patterns.
      include - comma seperated list of patterns.
      batchSize - mails to fetch at once in a given folder.
      Only headers can be prefetched in Javamail IMAP.
      readTimeout - defaults to 60000ms
      conectTimeout - defaults to 30000ms
      fetchSize - IMAP config. 32KB default
      fetchMailsSince -
      date/time in "yyyy-MM-dd HH:mm:ss" format, mails received after which will be fetched. Useful for delta import.
      customFilter - class name.

      import javax.mail.Folder;
      import javax.mail.SearchTerm;
      
      clz implements MailEntityProcessor.CustomFilter() {    
      public SearchTerm getCustomSearch(Folder folder);
      }
      

      processAttachement - defaults to true

      The below are the indexed fields.

        // Fields To Index
        // single valued
        private static final String SUBJECT = "subject";
        private static final String FROM = "from";
        private static final String SENT_DATE = "sentDate";
        private static final String XMAILER = "xMailer";
        // multi valued
        private static final String TO_CC_BCC = "allTo";
        private static final String FLAGS = "flags";
        private static final String CONTENT = "content";
        private static final String ATTACHMENT = "attachement";
        private static final String ATTACHMENT_NAMES = "attachementNames";
        // flag values
        private static final String FLAG_ANSWERED = "answered";
        private static final String FLAG_DELETED = "deleted";
        private static final String FLAG_DRAFT = "draft";
        private static final String FLAG_FLAGGED = "flagged";
        private static final String FLAG_RECENT = "recent";
        private static final String FLAG_SEEN = "seen";
      
      1. SOLR-934.patch
        2 kB
        Shalin Shekhar Mangar
      2. SOLR-934.patch
        118 kB
        Shalin Shekhar Mangar
      3. SOLR-934.patch
        129 kB
        Shalin Shekhar Mangar
      4. SOLR-934.patch
        37 kB
        Shalin Shekhar Mangar
      5. SOLR-934.patch
        31 kB
        Preetam Rao
      6. SOLR-934.patch
        31 kB
        Preetam Rao
      7. SOLR-934.patch
        15 kB
        Preetam Rao

        Activity

        Hide
        Preetam Rao added a comment -

        Rough cut version. Tested with sample mails from my gmail account.

        • Indexes one folder from IMAP account.
        • Indexes attachments from various types like ppt, word, txt, and anything that Tika supports.

        TODO
        --------

        • recurse into folders
        • performance tuning
        • support filter criteria for folders
        • supprt more than one mail box
        • support pop3

        USAGE
        ----------

        For each mail it creates a document with the following attributes:
        // Created fields
        // single valued
        "subject"
        "from"
        "sent_date"
        "sent_date_display"
        "X_Mailer"
        // multi valued
        "all_to"
        "flags"
        "content"
        "Attachement"

        // flag values
        "answered"
        "deleted"
        "draft"
        "flagged"
        "recent"
        "seen"

        COMPILE
        -------------
        Dependencies:
        JavaMail API jar
        Activation jar
        Tika and its dependent jars

        How should we go about adding these dependencies ?

        Show
        Preetam Rao added a comment - Rough cut version. Tested with sample mails from my gmail account. Indexes one folder from IMAP account. Indexes attachments from various types like ppt, word, txt, and anything that Tika supports. TODO -------- recurse into folders performance tuning support filter criteria for folders supprt more than one mail box support pop3 USAGE ---------- For each mail it creates a document with the following attributes: // Created fields // single valued "subject" "from" "sent_date" "sent_date_display" "X_Mailer" // multi valued "all_to" "flags" "content" "Attachement" // flag values "answered" "deleted" "draft" "flagged" "recent" "seen" COMPILE ------------- Dependencies: JavaMail API jar Activation jar Tika and its dependent jars How should we go about adding these dependencies ?
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks for this Preetam, looks great!

        A few suggestions:

        1. Use the Lucene code style – you can get a codestyle for Eclipse/Idea from http://wiki.apache.org/solr/HowToContribute
        2. Let us use the Java variable naming convention for the fields e.g sent_date becomes sentDate
        3. I don't think we need the sent_date_display, people can always format the date and display as they want
        4. All the attributes for the entity processor should be templatized e.g user="$ {dataimporter.request.user}

          " and so on. You'd need to use context.getVariableResolver().replaceTokens(attr)

        5. The Profile class looks un-necessary. The values can be stored directly as private variables
        6. Attachment names can be another multi-valued field
        7. Exception while connecting must be propagated so that the users know why the connection is failing.
        8. For delta imports, we can just provide a olderThan and newerThan syntax. That should be enough
        9. Streaming is recommended instead of calling folder.getMessages(). We can use getMessages(int start, int end) and the batchSize can be a configurable parameter with some sane default.

        Support for recursive folders will be awesome.

        Show
        Shalin Shekhar Mangar added a comment - Thanks for this Preetam, looks great! A few suggestions: Use the Lucene code style – you can get a codestyle for Eclipse/Idea from http://wiki.apache.org/solr/HowToContribute Let us use the Java variable naming convention for the fields e.g sent_date becomes sentDate I don't think we need the sent_date_display, people can always format the date and display as they want All the attributes for the entity processor should be templatized e.g user="$ {dataimporter.request.user} " and so on. You'd need to use context.getVariableResolver().replaceTokens(attr) The Profile class looks un-necessary. The values can be stored directly as private variables Attachment names can be another multi-valued field Exception while connecting must be propagated so that the users know why the connection is failing. For delta imports, we can just provide a olderThan and newerThan syntax. That should be enough Streaming is recommended instead of calling folder.getMessages(). We can use getMessages(int start, int end) and the batchSize can be a configurable parameter with some sane default. Support for recursive folders will be awesome.
        Hide
        Preetam Rao added a comment -

        I agree with all the comments... Will incorporate them soon...

        Show
        Preetam Rao added a comment - I agree with all the comments... Will incorporate them soon...
        Hide
        Preetam Rao added a comment - - edited

        Most of the features are implemented now.
        Test cases also updated.

        • recursion supported.
        • folders can be selected/excluded by list of comma separated patterns
        • mails can be fetched since a predefined receive date/time
        • custom filters can be plugged in
        • batching supported

        TODO

        • currently testbed needs to be setup manually. Create folders in testcase setup().
        • support POP3
        • any reveiws/feedbacks/cleanup

        attaching all the dependency jars as an attachment so that one does not have to search them. May be it should be integrated through ant-maven tasks or maven directly.

        Show
        Preetam Rao added a comment - - edited Most of the features are implemented now. Test cases also updated. recursion supported. folders can be selected/excluded by list of comma separated patterns mails can be fetched since a predefined receive date/time custom filters can be plugged in batching supported TODO currently testbed needs to be setup manually. Create folders in testcase setup(). support POP3 any reveiws/feedbacks/cleanup attaching all the dependency jars as an attachment so that one does not have to search them. May be it should be integrated through ant-maven tasks or maven directly.
        Hide
        Noble Paul added a comment - - edited

        looks good. A few observations.

        • the init must call super.init()
        • Right before returning nextRow() ,call super.applyTransformer(row)
        • Returning null signals end of rows. Close any connections or do cleanup
        • 'exclude' and 'include' should either allow for escaping comma (between multiple regex) or it can just take one reex for the time being
        • For fetchMailsSince use the format yyyy-MM-dd HH:mm:ss. There is already an instance DataImporter.DATE_TIME_FORMAT
        Show
        Noble Paul added a comment - - edited looks good. A few observations. the init must call super.init() Right before returning nextRow() ,call super.applyTransformer(row) Returning null signals end of rows. Close any connections or do cleanup 'exclude' and 'include' should either allow for escaping comma (between multiple regex) or it can just take one reex for the time being For fetchMailsSince use the format yyyy-MM-dd HH:mm:ss. There is already an instance DataImporter.DATE_TIME_FORMAT
        Hide
        Grant Ingersoll added a comment -

        Would it make more sense for DIH to farm out it's content acquisition to a library like Droids? Then, we could have real crawling, etc. all through a pluggable connector framework.

        Show
        Grant Ingersoll added a comment - Would it make more sense for DIH to farm out it's content acquisition to a library like Droids? Then, we could have real crawling, etc. all through a pluggable connector framework.
        Hide
        Noble Paul added a comment -

        Would it make more sense for DIH to farm out it's content acquisition to a library like Droids

        It would be great. It should be possible to have a DroidEntityProcessor one day.

        Show
        Noble Paul added a comment - Would it make more sense for DIH to farm out it's content acquisition to a library like Droids It would be great. It should be possible to have a DroidEntityProcessor one day.
        Hide
        Preetam Rao added a comment -

        Regarding comma separated list of patterns:

        Folder names won't contain commas usually.
        The regex which will contain commas is for limiting number of occurances like

        {M,N}

        , which also does not seem to be very useful in restricting
        folder names.

        Can we leave it as it is till the need arises ? If not what would be a good escape character or replacement for comma ?

        Show
        Preetam Rao added a comment - Regarding comma separated list of patterns: Folder names won't contain commas usually. The regex which will contain commas is for limiting number of occurances like {M,N} , which also does not seem to be very useful in restricting folder names. Can we leave it as it is till the need arises ? If not what would be a good escape character or replacement for comma ?
        Hide
        Noble Paul added a comment -

        This is a trivial thing. Other suggestions are really important

        Show
        Noble Paul added a comment - This is a trivial thing. Other suggestions are really important
        Hide
        Preetam Rao added a comment -

        Thanks for comments and feedback Noble and Shalin.

        Attached is the latest patch which calls init() as well as applyTransformer(). Receives fetchTimeSince in yyyy-MM-dd HH:mm:ss format.

        exclude/include pattern is still comma seperated.

        Cleanup is already being handled in FolderIterator when it learns that all folders have been exhausted.

        Could not attach dependency jars (13MB). Single part or multi part with smaller size both fail...

        Show
        Preetam Rao added a comment - Thanks for comments and feedback Noble and Shalin. Attached is the latest patch which calls init() as well as applyTransformer(). Receives fetchTimeSince in yyyy-MM-dd HH:mm:ss format. exclude/include pattern is still comma seperated. Cleanup is already being handled in FolderIterator when it learns that all folders have been exhausted. Could not attach dependency jars (13MB). Single part or multi part with smaller size both fail...
        Hide
        Preetam Rao added a comment -

        updated date format

        Show
        Preetam Rao added a comment - updated date format
        Hide
        Shalin Shekhar Mangar added a comment -

        MailEntityProcessor and its dependencies must be kept in one place – either in WEB-INF/lib or $solr_home/lib. We can't keep just the MailEntityProcessor in the war because it won't be able to load the dependencies from $solr_home/lib (due to the classloader being different) and asking the user to drop the dependencies to WEB-INF/lib does not sound good. It is impractical to keep all these dependencies in the solr war itself because most users will not need this functionality.

        I guess this needs to go into a separate contrib area. Thoughts?

        PS: a contrib for a contrib, cool!

        Show
        Shalin Shekhar Mangar added a comment - MailEntityProcessor and its dependencies must be kept in one place – either in WEB-INF/lib or $solr_home/lib. We can't keep just the MailEntityProcessor in the war because it won't be able to load the dependencies from $solr_home/lib (due to the classloader being different) and asking the user to drop the dependencies to WEB-INF/lib does not sound good. It is impractical to keep all these dependencies in the solr war itself because most users will not need this functionality. I guess this needs to go into a separate contrib area. Thoughts? PS: a contrib for a contrib, cool!
        Hide
        Noble Paul added a comment -

        how about a new contrib called 'dih-ext' . So all the future DIH enhancements which require external dependencies can go here (like a TikaEntityProcessor).

        Show
        Noble Paul added a comment - how about a new contrib called 'dih-ext' . So all the future DIH enhancements which require external dependencies can go here (like a TikaEntityProcessor).
        Hide
        Shalin Shekhar Mangar added a comment -
        1. Brought patch in sync with trunk
        2. Created a 'lib' directory inside contrib/dataimporthandler which will have the mail and activation jars
        3. Created a 'extras' directory inside src which will hold DIH components that have extra dependencies
        4. Added ant targets to work on the extras in DIH build.xml

        TODO:

        1. Need to find out the licenses for the additional dependencies
        2. Need to add info about these dependencies into LICENSE.txt and NOTICE.txt
        3. Test some more (perhaps index my gmail account?) and create a demo?
        4. Add info to the wiki
        5. Commit
        Show
        Shalin Shekhar Mangar added a comment - Brought patch in sync with trunk Created a 'lib' directory inside contrib/dataimporthandler which will have the mail and activation jars Created a 'extras' directory inside src which will hold DIH components that have extra dependencies Added ant targets to work on the extras in DIH build.xml TODO: Need to find out the licenses for the additional dependencies Need to add info about these dependencies into LICENSE.txt and NOTICE.txt Test some more (perhaps index my gmail account?) and create a demo? Add info to the wiki Commit
        Hide
        Shalin Shekhar Mangar added a comment -

        Changes:

        1. Parse and store the fetchMailsSince string during init.
        2. Return the sentDate as a Date object rather than as a long timestamp
        3. Removed context as an argument from the getXFromContext methods
        4. Removed unused getLongFromContext method

        I just indexed a month's worth of my gmail inbox. Works great!

        One question, what is the uniqueKey that we should use when indexing emails? I couldn't figure out so I removed the uniqueKey from my schema to try this out.

        Next steps:

        1. Enhance the ant build file to copy the dependencies to example/solr/lib just like Solr Cell does.
        2. Add a wiki page with instructions to setup, list of dependencies, example schema and data-config.xml
        Show
        Shalin Shekhar Mangar added a comment - Changes: Parse and store the fetchMailsSince string during init. Return the sentDate as a Date object rather than as a long timestamp Removed context as an argument from the getXFromContext methods Removed unused getLongFromContext method I just indexed a month's worth of my gmail inbox. Works great! One question, what is the uniqueKey that we should use when indexing emails? I couldn't figure out so I removed the uniqueKey from my schema to try this out. Next steps: Enhance the ant build file to copy the dependencies to example/solr/lib just like Solr Cell does. Add a wiki page with instructions to setup, list of dependencies, example schema and data-config.xml
        Hide
        Hoss Man added a comment -

        One question, what is the uniqueKey that we should use when indexing emails? I couldn't figure out so I removed the uniqueKey from my schema to try this out.

        FWIW: "Message-ID" while common is not mandatory (see sec3.6 and sec3.6.4 of RFCs #2822 and #5322)

        Show
        Hoss Man added a comment - One question, what is the uniqueKey that we should use when indexing emails? I couldn't figure out so I removed the uniqueKey from my schema to try this out. FWIW: "Message-ID" while common is not mandatory (see sec3.6 and sec3.6.4 of RFCs #2822 and #5322)
        Hide
        Ryan McKinley added a comment -

        FWIW: "Message-ID" while common is not mandatory (see sec3.6 and sec3.6.4 of RFCs #2822 and #5322)

        In practice you can not rely on the the "Message-ID" to be unique. Most modern mail servers do a good job making sure each value is unique, but some old MS mail servers sent the same message ID for every message!

        Show
        Ryan McKinley added a comment - FWIW: "Message-ID" while common is not mandatory (see sec3.6 and sec3.6.4 of RFCs #2822 and #5322) In practice you can not rely on the the "Message-ID" to be unique. Most modern mail servers do a good job making sure each value is unique, but some old MS mail servers sent the same message ID for every message!
        Hide
        Noble Paul added a comment -

        One question, what is the uniqueKey that we should use when indexing emails?

        The "Message-ID" can be emitted by the EntityProcessor it can be left to the discretion of the user whether to use that as a uniqueKey or not.

        Show
        Noble Paul added a comment - One question, what is the uniqueKey that we should use when indexing emails? The "Message-ID" can be emitted by the EntityProcessor it can be left to the discretion of the user whether to use that as a uniqueKey or not.
        Hide
        Shalin Shekhar Mangar added a comment -

        Changes

        1. Added messageId as another field
        2. Added another core to example-DIH for indexing mails. When the example target is run, it copies over the tika libs, mail.jar, activation.jar and extras.jar into example/example-DIH/solr/mail/lib directory.
        3. Added a maven pom template for extras jar
        4. Updated maven related targets in the main build.xml for the new pom
        5. Added licenses for mail.jar and activation.jar in LICENSE.txt

        I'm not sure what needs to be added to NOTICE.txt, can anybody help?

        To run this:

        1. Apply this patch
        2. Create a directory called lib inside contrib/dataimporthandler
        3. Download and add mail.jar and activation.jar in the above directory
        4. Update example/example-DIH/solr/mail/conf/data-config.xml with your mail server and login details
        5. Run ant clean example
        6. cd example
        7. java -Dsolr.solr.home=./example-DIH/solr -jar start.jar
        8. Hit http://localhost:8983/solr/mail/dataimport?command=full-import

        I'll let people try this out before committing this in a day or two.

        This will probably need some more enhancements which can be done through additional issues. Some that I can think of are:

        1. Pluggable CustomFilter implementations
        2. Making fields/methods inside MailEntityProcessor protected so functionality can be enhanced/overridden
        3. Attachments are stored as two attachment and attachmentNames fields – a way to associate one with another. I recall some discussion on the LocalSolr issue about something similar for multiple lat/long pairs.
        4. Enhance example configuration to be able to run a mailing list search service out-of-the-box
        Show
        Shalin Shekhar Mangar added a comment - Changes Added messageId as another field Added another core to example-DIH for indexing mails. When the example target is run, it copies over the tika libs, mail.jar, activation.jar and extras.jar into example/example-DIH/solr/mail/lib directory. Added a maven pom template for extras jar Updated maven related targets in the main build.xml for the new pom Added licenses for mail.jar and activation.jar in LICENSE.txt I'm not sure what needs to be added to NOTICE.txt, can anybody help? To run this: Apply this patch Create a directory called lib inside contrib/dataimporthandler Download and add mail.jar and activation.jar in the above directory Update example/example-DIH/solr/mail/conf/data-config.xml with your mail server and login details Run ant clean example cd example java -Dsolr.solr.home=./example-DIH/solr -jar start.jar Hit http://localhost:8983/solr/mail/dataimport?command=full-import I'll let people try this out before committing this in a day or two. This will probably need some more enhancements which can be done through additional issues. Some that I can think of are: Pluggable CustomFilter implementations Making fields/methods inside MailEntityProcessor protected so functionality can be enhanced/overridden Attachments are stored as two attachment and attachmentNames fields – a way to associate one with another. I recall some discussion on the LocalSolr issue about something similar for multiple lat/long pairs. Enhance example configuration to be able to run a mailing list search service out-of-the-box
        Hide
        Shalin Shekhar Mangar added a comment -

        Updated NOTICE.txt and LICENSE.txt with the license information given at the following:

        I'll commit this shortly.

        Show
        Shalin Shekhar Mangar added a comment - Updated NOTICE.txt and LICENSE.txt with the license information given at the following: http://repo2.maven.org/maven2/javax/activation/activation/1.1/activation-1.1.pom http://repo2.maven.org/maven2/javax/mail/mail/1.4.1/mail-1.4.1.pom I'll commit this shortly.
        Hide
        Shalin Shekhar Mangar added a comment -

        Committed revision 764601.

        Thanks Preetam!

        Show
        Shalin Shekhar Mangar added a comment - Committed revision 764601. Thanks Preetam!
        Hide
        Shalin Shekhar Mangar added a comment -

        A few changes in this patch

        1. Made the CustomFilter interface static
        2. Removed logRow method. LogTransformer can be used if needed
        3. logConfig first checks if info level is enabled or not

        I'll commit shortly.

        Show
        Shalin Shekhar Mangar added a comment - A few changes in this patch Made the CustomFilter interface static Removed logRow method. LogTransformer can be used if needed logConfig first checks if info level is enabled or not I'll commit shortly.
        Hide
        Shalin Shekhar Mangar added a comment -

        Committed revision 764691.

        Show
        Shalin Shekhar Mangar added a comment - Committed revision 764691.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Preetam Rao
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development