Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.

      • The first email headers are used to fill in metadata. Subsequent email headers are tossed.
      • Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
      • Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).
      1. tika-295.patch
        33 kB
        Ken Krugler

        Activity

        Hide
        Ken Krugler added a comment -

        Hi Alex - thanks for looking into the formatting issues. Maybe I should open a Jira issue to create an Eclipse formatter file

        Re additional work done on this parser - nothing more yet, it's working for what I currently need, sorry.

        Show
        Ken Krugler added a comment - Hi Alex - thanks for looking into the formatting issues. Maybe I should open a Jira issue to create an Eclipse formatter file Re additional work done on this parser - nothing more yet, it's working for what I currently need, sorry.
        Hide
        Ken Krugler added a comment -

        Hi Thilo - I also looked at mstor, but trying to figure out the license issues and JavaMail dependencies gave me a headache.

        And the mbox format itself is trivial - the hard part is parsing properly the mail messages themselves, which is where (I think) mime4j would be a good option.

        But if there aren't any license issues, and it's easy to separate mstor, then I agree that's a good candidate.

        Show
        Ken Krugler added a comment - Hi Thilo - I also looked at mstor, but trying to figure out the license issues and JavaMail dependencies gave me a headache. And the mbox format itself is trivial - the hard part is parsing properly the mail messages themselves, which is where (I think) mime4j would be a good option. But if there aren't any license issues, and it's easy to separate mstor, then I agree that's a good candidate.
        Hide
        Thilo Goetz added a comment -

        I have used mstor in the past, which is under a BSD license and worked well for me. It drags in a whole boatload of dependencies (and I didn't check all the licenses), but I suspect that just for MBOX parsing you won't need most of them. It might be worth checking out mstor before writing our own mbox parser.

        Show
        Thilo Goetz added a comment - I have used mstor in the past, which is under a BSD license and worked well for me. It drags in a whole boatload of dependencies (and I didn't check all the licenses), but I suspect that just for MBOX parsing you won't need most of them. It might be worth checking out mstor before writing our own mbox parser.
        Hide
        Alex Baranau added a comment - - edited

        I guess since the Tika is subproject of Lucene you should use the same format as for other Lucene projects:

        http://wiki.apache.org/lucene-java/HowToContribute
        http://wiki.apache.org/solr/HowToContribute
        (in the end of the pages).

        [Edited: well it turned out that they use another coding styles on Tika project. At least the indent is 4 spaces instead of 2...]

        One question about the parser - do you still work on it? Any progress from the first draft?

        Show
        Alex Baranau added a comment - - edited I guess since the Tika is subproject of Lucene you should use the same format as for other Lucene projects: http://wiki.apache.org/lucene-java/HowToContribute http://wiki.apache.org/solr/HowToContribute (in the end of the pages). [Edited: well it turned out that they use another coding styles on Tika project. At least the indent is 4 spaces instead of 2...] One question about the parser - do you still work on it? Any progress from the first draft?
        Hide
        Ken Krugler added a comment -

        Hi Jukka,

        Is there an Eclipse formatter file that defines the Tika project's target format?

        Thanks,

        – Ken

        Show
        Ken Krugler added a comment - Hi Jukka, Is there an Eclipse formatter file that defines the Tika project's target format? Thanks, – Ken
        Hide
        Jukka Zitting added a comment -

        Nice work, thanks! I committed the patch (with tabs->spaces changes and an added license header for the test case) in revision 820967.

        For further work on this I would suggest using the Mime4J library [1] from Apache James, as they've already dealt with many of the questions you raise above.

        I'm resolving this as Fixed as the basic feature is now there thanks to the patch. Please file additional issues on any future improvements.

        [1] http://james.apache.org/mime4j/

        Show
        Jukka Zitting added a comment - Nice work, thanks! I committed the patch (with tabs->spaces changes and an added license header for the test case) in revision 820967. For further work on this I would suggest using the Mime4J library [1] from Apache James, as they've already dealt with many of the questions you raise above. I'm resolving this as Fixed as the basic feature is now there thanks to the patch. Please file additional issues on any future improvements. [1] http://james.apache.org/mime4j/
        Hide
        Ken Krugler added a comment -

        This patch also relies on using Mockito for unit tests, so there's a modified pom.xml that adds this as a dependency.

        I'm hoping it's OK to add Mockito to the test scope.

        Show
        Ken Krugler added a comment - This patch also relies on using Mockito for unit tests, so there's a modified pom.xml that adds this as a dependency. I'm hoping it's OK to add Mockito to the test scope.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Ken Krugler
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development