Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6
    • Component/s: parser
    • Labels:
      None

      Description

      Hello everyone,

      As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/

      I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika.

      Best regards
      Tran Nam Quang

        Activity

        Hide
        Hong-Thai Nguyen added a comment -

        Improvement: extract each mail as attachment document. Recursion down to folders, subfolders and also attachments inside mail.
        Committed at r1584574

        Show
        Hong-Thai Nguyen added a comment - Improvement: extract each mail as attachment document. Recursion down to folders, subfolders and also attachments inside mail. Committed at r1584574
        Hide
        Luis Filipe Nassif added a comment -

        A possible improvement could be recursing down to attached messages (if present) and parsing them and their attachments through PSTAttachment.getEmbeddedPSTMessage(). Setting a relationship id between messages and attachments would be very nice too.

        Show
        Luis Filipe Nassif added a comment - A possible improvement could be recursing down to attached messages (if present) and parsing them and their attachments through PSTAttachment.getEmbeddedPSTMessage(). Setting a relationship id between messages and attachments would be very nice too.
        Hide
        Hong-Thai Nguyen added a comment -

        Luis Filipe Nassif, binary attached is handled with embeddedExtractor. BTW, I agree that we can split each mail to a separate unit.
        Tim Allison, we couldn't fix .pst and .msg (msg is already handled as part of OfficeParser), and feel free to finish properly this issue as you can

        Show
        Hong-Thai Nguyen added a comment - Luis Filipe Nassif , binary attached is handled with embeddedExtractor. BTW, I agree that we can split each mail to a separate unit. Tim Allison , we couldn't fix .pst and .msg (msg is already handled as part of OfficeParser), and feel free to finish properly this issue as you can
        Hide
        Tim Allison added a comment -

        Agreed. Is there any way to reuse OutlookParser or to refactor so that we're using the same lib for an email, whether .pst or .msg. There are lots of lessons learned embedded in the OutlookParser. I'll be happy to chip in as I can. Hong-Thai Nguyen, thank you for getting this rolling!

        Show
        Tim Allison added a comment - Agreed. Is there any way to reuse OutlookParser or to refactor so that we're using the same lib for an email, whether .pst or .msg. There are lots of lessons learned embedded in the OutlookParser. I'll be happy to chip in as I can. Hong-Thai Nguyen , thank you for getting this rolling!
        Hide
        Luis Filipe Nassif added a comment -

        Good job. I think a possible improvement would be to generate a html for each email, containing its metadata and content, and call the embeddedExtractor to process the generated html, instead of printing all emails directly to xhtmlContentHandler. So, in addition to attachments, emails could also be extracted from PST files if that is the goal of the application. What do you think?

        Show
        Luis Filipe Nassif added a comment - Good job. I think a possible improvement would be to generate a html for each email, containing its metadata and content, and call the embeddedExtractor to process the generated html, instead of printing all emails directly to xhtmlContentHandler. So, in addition to attachments, emails could also be extracted from PST files if that is the goal of the application. What do you think?
        Hide
        Hong-Thai Nguyen added a comment -

        Commit on r1574411

        Show
        Hong-Thai Nguyen added a comment - Commit on r1574411
        Hide
        Hong-Thai Nguyen added a comment - - edited

        java-libpst-0.7 has been uploaded to oss sonatype nexus: https://issues.sonatype.org/browse/OSSRH-8965
        If there's no objection, I'll refactory attached parser and provide output as:

        <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
        <meta name="Content-Length" content="271360" />
        <meta name="isValid" content="true" />
        <meta name="Content-Type" content="application/vnd.ms-outlook" />
        <title></title>
        </head>
        <body>
        	<div class="email-folder">
        		<h1>Début du fichier de données Outlook</h1>
        		<div class="email-entry">
        			<h1>&lt;530D9CAC.5080901@gmail.com&gt;</h1>
        			<meta subject="Re: Feature Generators" />
        			<meta internetMessageId="&lt;530D9CAC.5080901@gmail.com&gt;" />
        			<meta descriptorNodeId="2097188" />
        			<meta lastModificationTime="1393418263291" />
        			<meta senderName="Jörn Kottmann" />
        			<meta senderEmailAddress="kottmann@gmail.com" />
        			<meta recipients="No recipients table!" />
        			<p>mail content</p>
        		</div>
        		<div class="email-folder">
        			<h1>Éléments supprimés</h1>
        		</div>
        	</div>
        	<div class="email-folder">
        		<h1>Racine (pour la recherche)</h1>
        	</div>
        	<div class="email-folder">
        		<h1>SPAM Search Folder 2</h1>
        	</div>
        </body>
        </html>
        
        Show
        Hong-Thai Nguyen added a comment - - edited java-libpst-0.7 has been uploaded to oss sonatype nexus: https://issues.sonatype.org/browse/OSSRH-8965 If there's no objection, I'll refactory attached parser and provide output as: <html xmlns= "http: //www.w3.org/1999/xhtml" > <head> <meta name= "Content-Length" content= "271360" /> <meta name= "isValid" content= " true " /> <meta name= "Content-Type" content= "application/vnd.ms-outlook" /> <title></title> </head> <body> <div class= "email-folder" > <h1>Début du fichier de données Outlook</h1> <div class= "email-entry" > <h1>&lt;530D9CAC.5080901@gmail.com&gt;</h1> <meta subject= "Re: Feature Generators" /> <meta internetMessageId= "&lt;530D9CAC.5080901@gmail.com&gt;" /> <meta descriptorNodeId= "2097188" /> <meta lastModificationTime= "1393418263291" /> <meta senderName= "Jörn Kottmann" /> <meta senderEmailAddress= "kottmann@gmail.com" /> <meta recipients= "No recipients table!" /> <p>mail content</p> </div> <div class= "email-folder" > <h1>Éléments supprimés</h1> </div> </div> <div class= "email-folder" > <h1>Racine (pour la recherche)</h1> </div> <div class= "email-folder" > <h1>SPAM Search Folder 2</h1> </div> </body> </html>
        Hide
        Jim Kay added a comment -

        I also would like to see this capability added.

        Show
        Jim Kay added a comment - I also would like to see this capability added.
        Hide
        Gary Gregory added a comment -

        Did anyone ever push java-libpst to Maven Central? Searching for 'java-libpst' yields 0 results.

        Show
        Gary Gregory added a comment - Did anyone ever push java-libpst to Maven Central? Searching for 'java-libpst' yields 0 results.
        Hide
        Jukka Zitting added a comment -

        Is there some way to proceed here without requiring libpst be mavenized?

        Certainly. The only thing we'd need is to have the library available as a dependency on the central repository (otherwise we can't push out a Tika release with such a dependency). This requires no changes to the upstream library, just some extra metadata and appropriate -sources and -javadoc jars to accompany to the upload. See https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository for details.

        Anyone can volunteer to take care of this. See for example https://groups.google.com/d/topic/tagsoup-friends/vIUe_jSR5YQ/discussion for a thread where I volunteered and did this for a recent release of the TagSoup library.

        Show
        Jukka Zitting added a comment - Is there some way to proceed here without requiring libpst be mavenized? Certainly. The only thing we'd need is to have the library available as a dependency on the central repository (otherwise we can't push out a Tika release with such a dependency). This requires no changes to the upstream library, just some extra metadata and appropriate -sources and -javadoc jars to accompany to the upload. See https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository for details. Anyone can volunteer to take care of this. See for example https://groups.google.com/d/topic/tagsoup-friends/vIUe_jSR5YQ/discussion for a thread where I volunteered and did this for a recent release of the TagSoup library.
        Hide
        Michael McCandless added a comment -

        Is there some way to proceed here without requiring libpst be
        mavenized? Ie, is that really a blocker? Are we unable to add a
        simple JAR into Tika, here, and then open a follow-on issue for
        mavenizing libpst?

        If it is a blocker.... can one of the maven gurus please step in and
        help?

        I don't think we should push "mavenizing" responsibilities onto
        contributors if we can possibly help it... it's already wonderful
        enough that Richard created libpst, relicensed it so we could
        incorporate it, fixed bugs, that Tran created an initial Tika parser,
        that Mark is pushing things forward.

        Show
        Michael McCandless added a comment - Is there some way to proceed here without requiring libpst be mavenized? Ie, is that really a blocker? Are we unable to add a simple JAR into Tika, here, and then open a follow-on issue for mavenizing libpst? If it is a blocker.... can one of the maven gurus please step in and help? I don't think we should push "mavenizing" responsibilities onto contributors if we can possibly help it... it's already wonderful enough that Richard created libpst, relicensed it so we could incorporate it, fixed bugs, that Tran created an initial Tika parser, that Mark is pushing things forward.
        Hide
        Andrzej Bialecki added a comment -

        Mark, visiting the github link to the project results in 404 Not Found... Are you still working on this? PST support would be surely a nice addition to Tika, so to answer your question, yes please continue It doesn't have to be ideal, but as soon as it's in Maven then it's more likely that the Tika parser glue that Tran created can be fleshed out and added.

        Show
        Andrzej Bialecki added a comment - Mark, visiting the github link to the project results in 404 Not Found... Are you still working on this? PST support would be surely a nice addition to Tika, so to answer your question, yes please continue It doesn't have to be ideal, but as soon as it's in Maven then it's more likely that the Tika parser glue that Tran created can be fleshed out and added.
        Hide
        Mark Kerzner added a comment -

        Hi, everybody,

        I have forked Richard Johnson's java-libpst project here on GitHub https://github.com/markkerzner/JavaLibpst. My reasons for doing this are as follows:

        1. I need java-libpst parsing capabilities for my FreeEed project https://github.com/markkerzner/FreeEed
        2. I want it in Maven, for FreeEed's purposes, and later on I would be happy to see it included in Tika, which also needs it in Maven;
        3. I want it in active development, and Richard told me that he has less time for it than before.
        4. By no means do I want to take the glory or the project away from Richard, but it is one of the keys for FreeEed's adoption in Windows.

        I am in touch with Richard on all that, but I want the community feedback. Should I continue? Should I bring it into some Maven repository? I have been working with Carl Byington and know his libpst somewhat, so that additional qualification should help. Therefore, please, how am I to proceed?

        Thank you.

        Show
        Mark Kerzner added a comment - Hi, everybody, I have forked Richard Johnson's java-libpst project here on GitHub https://github.com/markkerzner/JavaLibpst . My reasons for doing this are as follows: 1. I need java-libpst parsing capabilities for my FreeEed project https://github.com/markkerzner/FreeEed 2. I want it in Maven, for FreeEed's purposes, and later on I would be happy to see it included in Tika, which also needs it in Maven; 3. I want it in active development, and Richard told me that he has less time for it than before. 4. By no means do I want to take the glory or the project away from Richard, but it is one of the keys for FreeEed's adoption in Windows. I am in touch with Richard on all that, but I want the community feedback. Should I continue? Should I bring it into some Maven repository? I have been working with Carl Byington and know his libpst somewhat, so that additional qualification should help. Therefore, please, how am I to proceed? Thank you.
        Hide
        Tran Nam Quang added a comment -

        Okay, I finished the basic PST Tika parser. Emphasis is on "basic". A lot of lines in the parser are marked as TODO, especially the metadata and content handling, simply because I had no idea what to do. Hope somebody will clean this up.

        Show
        Tran Nam Quang added a comment - Okay, I finished the basic PST Tika parser. Emphasis is on "basic". A lot of lines in the parser are marked as TODO, especially the metadata and content handling, simply because I had no idea what to do. Hope somebody will clean this up.
        Hide
        Tran Nam Quang added a comment -

        The PST file is basically a folder tree with emails and other stuff in it. Is there some sort of specification out there that tells me how to map this tree to specific XHTML elements?

        More specifically, what XML tags should I use to separate the emails from one another? And should the output be just a linear stream of emails, or should the tree structure be included in the output as well?

        Show
        Tran Nam Quang added a comment - The PST file is basically a folder tree with emails and other stuff in it. Is there some sort of specification out there that tells me how to map this tree to specific XHTML elements? More specifically, what XML tags should I use to separate the emails from one another? And should the output be just a linear stream of emails, or should the tree structure be included in the output as well?
        Hide
        Richard Johnson added a comment -

        I'll start working on getting the library into Maven Central, thanks for those links Nick.

        Show
        Richard Johnson added a comment - I'll start working on getting the library into Maven Central, thanks for those links Nick.
        Hide
        Richard Johnson added a comment -

        getDescriptorNodeId() is most likely the one you want for a unique identifier. They are for internal use, however they are guaranteed unique per PST file and are unchanging (incrementally allocated and not reused).

        Internet Message Ids are the ones from rfc2822, and therefore not all PST objects (such as unsent emails) have them.

        I'll get this updated in the javadocs.

        Show
        Richard Johnson added a comment - getDescriptorNodeId() is most likely the one you want for a unique identifier. They are for internal use, however they are guaranteed unique per PST file and are unchanging (incrementally allocated and not reused). Internet Message Ids are the ones from rfc2822, and therefore not all PST objects (such as unsent emails) have them. I'll get this updated in the javadocs.
        Hide
        Nick Burch added a comment -

        Tran - wrap it in a TikaInputStream, which handles converting between Files and InputStreams as required by the underlying libraries.

        Show
        Nick Burch added a comment - Tran - wrap it in a TikaInputStream, which handles converting between Files and InputStreams as required by the underlying libraries.
        Hide
        Tran Nam Quang added a comment - - edited

        I started work on the Tika parser, but got stuck with the following problem: In order to access the Outlook PST file, I need to create a PSTFile instance. Now, the PSTFile constructor requires either a File or a String argument that points at the PST file. The constructor then takes either of these arguments to create a RandomAccessFile internally. However, Tika's Parser interface gives me an InputStream. What do I do?

        Show
        Tran Nam Quang added a comment - - edited I started work on the Tika parser, but got stuck with the following problem: In order to access the Outlook PST file, I need to create a PSTFile instance. Now, the PSTFile constructor requires either a File or a String argument that points at the PST file. The constructor then takes either of these arguments to create a RandomAccessFile internally. However, Tika's Parser interface gives me an InputStream. What do I do?
        Hide
        Tran Nam Quang added a comment - - edited

        Cool! I'll start writing the Tika parser as soon as I can. Could take a couple of days though.

        Richard, I have one question regarding the API: PSTMessage has two methods, getDescriptorNodeId() and getInternetMessageId(). Both return identifiers, apparently. My question is: Which one is an unique identifier that will never, ever change? Cause I wouldn't want the Tika parser to extract identifiers that are "internal-only" and not unique.

        Btw, maybe it's a good idea to also clarify this in the Javadoc.

        Show
        Tran Nam Quang added a comment - - edited Cool! I'll start writing the Tika parser as soon as I can. Could take a couple of days though. Richard, I have one question regarding the API: PSTMessage has two methods, getDescriptorNodeId() and getInternetMessageId(). Both return identifiers, apparently. My question is: Which one is an unique identifier that will never, ever change? Cause I wouldn't want the Tika parser to extract identifiers that are "internal-only" and not unique. Btw, maybe it's a good idea to also clarify this in the Javadoc.
        Hide
        Nick Burch added a comment -

        Great news Richard.

        Are you happy to start the process of getting the new release into Maven Central? The process should be largely the same as Ken did with TIKA-462, and Sonatype seem to have a very handy walkthrough of the process at https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide

        Show
        Nick Burch added a comment - Great news Richard. Are you happy to start the process of getting the new release into Maven Central? The process should be largely the same as Ken did with TIKA-462 , and Sonatype seem to have a very handy walkthrough of the process at https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide
        Hide
        Richard Johnson added a comment -

        Hey Guys,

        I've just uploaded a new version with some cleanups, bug fixes and most importantly a new License.

        Kind Regards,

        Richard

        Show
        Richard Johnson added a comment - Hey Guys, I've just uploaded a new version with some cleanups, bug fixes and most importantly a new License. Kind Regards, Richard
        Hide
        Ken Krugler added a comment -

        I've got comments on TIKA-462 that talk about what I did to get Boilerpipe into Maven Central, via Sonatype.

        Show
        Ken Krugler added a comment - I've got comments on TIKA-462 that talk about what I did to get Boilerpipe into Maven Central, via Sonatype.
        Hide
        Nick Burch added a comment -

        Could one of our Maven gurus (Jukka? Chris?) maybe help out Richard with getting the new release into Maven Central when it's out?

        Looking at the codebase, I can see quite a few places where there's code to do things that Apache POI also does. (RTF LZW, reading MAPI properties, MAPI constants, Little Endian etc). Longer term, it's probably worth Richard joining the POI dev list so we can see when libpst can use POI, and what POI could use from libpst. That's not needed for the initial Tika plugin though!

        Show
        Nick Burch added a comment - Could one of our Maven gurus (Jukka? Chris?) maybe help out Richard with getting the new release into Maven Central when it's out? Looking at the codebase, I can see quite a few places where there's code to do things that Apache POI also does. (RTF LZW, reading MAPI properties, MAPI constants, Little Endian etc). Longer term, it's probably worth Richard joining the POI dev list so we can see when libpst can use POI, and what POI could use from libpst. That's not needed for the initial Tika plugin though!
        Hide
        Richard Johnson added a comment -

        Hi Guys,

        I'm the original author. I've cleared the license change with the other contributor, and will try to get a release out that reflects this over the next few days.

        Also, as Uwe has pointed out, there are some clean-ups that really should be made to the project. I am a little time limited, however I will attempt to address these are they are brought to my attention.

        Thanks for considering this project for inclusion.

        Kind Regards,

        Richard

        Show
        Richard Johnson added a comment - Hi Guys, I'm the original author. I've cleared the license change with the other contributor, and will try to get a release out that reflects this over the next few days. Also, as Uwe has pointed out, there are some clean-ups that really should be made to the project. I am a little time limited, however I will attempt to address these are they are brought to my attention. Thanks for considering this project for inclusion. Kind Regards, Richard
        Hide
        Tran Nam Quang added a comment -

        I have zero experience with Maven, so I don't think I'm the right person to take care of the Maven upload.

        I might be able to handle the Parser, although it'll probably have to wait until the library author makes a new relicensed release available.

        Show
        Tran Nam Quang added a comment - I have zero experience with Maven, so I don't think I'm the right person to take care of the Maven upload. I might be able to handle the Parser, although it'll probably have to wait until the library author makes a new relicensed release available.
        Hide
        Nick Burch added a comment -

        The re-license is great news! There are two steps needed then:

        • Get a version of libpst into Maven Central (so we can include it as a dependency)
        • Write a Parser which uses libpst, likely one that does all the metadata bits and delegates to other parsers for the message body + attachments

        For the former, see something like TIKA-407 for a guide. For the latter, I'd suggest cribbing off something like PackageParser and the Outlook Parser

        Show
        Nick Burch added a comment - The re-license is great news! There are two steps needed then: Get a version of libpst into Maven Central (so we can include it as a dependency) Write a Parser which uses libpst, likely one that does all the metadata bits and delegates to other parsers for the message body + attachments For the former, see something like TIKA-407 for a guide. For the latter, I'd suggest cribbing off something like PackageParser and the Outlook Parser
        Hide
        Tran Nam Quang added a comment - - edited

        I contacted the library author, he agreed to dual-licensing the library as LGPL/Apache. This means java-libpst can be included by default in Tika, right?

        As for the Tika parser, I won't be able to implement that before Saturday or Sunday (assuming I'm still supposed to).

        Show
        Tran Nam Quang added a comment - - edited I contacted the library author, he agreed to dual-licensing the library as LGPL/Apache. This means java-libpst can be included by default in Tika, right? As for the Tika parser, I won't be able to implement that before Saturday or Sunday (assuming I'm still supposed to).
        Hide
        Nick Burch added a comment -

        Details on the licenses that are allowed to be used are at: http://www.apache.org/legal/resolved.html

        From looking at their homepage, writing a tika parser shouldn't be too hard - you'd likely want to crib off one of the other container based parsers to see how to have each part processed for you by the appropriate tika parsers.

        Show
        Nick Burch added a comment - Details on the licenses that are allowed to be used are at: http://www.apache.org/legal/resolved.html From looking at their homepage, writing a tika parser shouldn't be too hard - you'd likely want to crib off one of the other container based parsers to see how to have each part processed for you by the appropriate tika parsers.
        Hide
        Uwe Schindler added a comment -

        From looking at the code of this library, it looks that it needs some improvements/fixes:

        • It catches all exceptions and instead of simply wrap'n'rethrow or declare the checked exceptions in the methods, it prints the stack trace to System.out. Also messages are printed to System.out.
        • The RTF compression decoder uses new String(byte[]) without charset -> locale dependent! Other places do this, too. This is broken, as the file format should define the charset.
        Show
        Uwe Schindler added a comment - From looking at the code of this library, it looks that it needs some improvements/fixes: It catches all exceptions and instead of simply wrap'n'rethrow or declare the checked exceptions in the methods, it prints the stack trace to System.out. Also messages are printed to System.out. The RTF compression decoder uses new String(byte[]) without charset -> locale dependent! Other places do this, too. This is broken, as the file format should define the charset.
        Hide
        Tran Nam Quang added a comment - - edited

        What licenses would permit inclusion in Tika, other than the Apache License 2.0? I could ask the author to change the library's license or to switch to dual-licensing...

        The basic parser is already listed as an example on the front page of the java-libpst website, by the way.

        Show
        Tran Nam Quang added a comment - - edited What licenses would permit inclusion in Tika, other than the Apache License 2.0? I could ask the author to change the library's license or to switch to dual-licensing... The basic parser is already listed as an example on the front page of the java-libpst website, by the way.
        Hide
        Nick Burch added a comment -

        If it's LGPL then we can't include it in Tika as standard

        However, it is possible to have the parser dynamically loaded if a user chooses to download the parser + dependent files (if the license works for them)

        If you're interested in pst support, then I'd suggest you try to knock up a basic parser using libpst. If you do get it working, please list it on the wiki:
        http://wiki.apache.org/tika/3rd%20party%20parser%20plugins

        If you need help with developing the plugin, please ask on the dev list. You might also be interested in looking at the relatively small patch that was all that was required to enable JTNEF (GPL) to be used as a Tika plugin:
        https://github.com/jukka/jtnef/commit/a9a51982165101c0bdda4cb5266d7f8958c271ef

        Show
        Nick Burch added a comment - If it's LGPL then we can't include it in Tika as standard However, it is possible to have the parser dynamically loaded if a user chooses to download the parser + dependent files (if the license works for them) If you're interested in pst support, then I'd suggest you try to knock up a basic parser using libpst. If you do get it working, please list it on the wiki: http://wiki.apache.org/tika/3rd%20party%20parser%20plugins If you need help with developing the plugin, please ask on the dev list. You might also be interested in looking at the relatively small patch that was all that was required to enable JTNEF (GPL) to be used as a Tika plugin: https://github.com/jukka/jtnef/commit/a9a51982165101c0bdda4cb5266d7f8958c271ef

          People

          • Assignee:
            Unassigned
            Reporter:
            Tran Nam Quang
          • Votes:
            6 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development