Nutch
  1. Nutch
  2. NUTCH-21

parser plugin for MS PowerPoint slides

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8
    • Component/s: fetcher
    • Labels:
      None

      Description

      1. build.xml.patch.txt
        2 kB
        Stephan Strittmatter
      2. MSPowerPointParser.java
        4 kB
        Renat Lumpau
      3. parse-mspowerpoint.zip
        972 kB
        Stephan Strittmatter
      4. parse-mspowerpoint.zip
        955 kB
        Stephan Strittmatter

        Activity

        Hide
        Stephan Strittmatter added a comment -

        I am refactoring it at the moment to provide a org.apache.nutch conform patch.

        Show
        Stephan Strittmatter added a comment - I am refactoring it at the moment to provide a org.apache.nutch conform patch.
        Hide
        Stephan Strittmatter added a comment -

        Atached you can find the complete PowerPoint parser.
        Also included in the zip are:

        • JUnit test with one sample (the protocol-file plugin is required to run this)
          It is a very detailed test, which could check on char basis the result!
          I think It would be a good idea to extract this as a small test
          environment for all parsers to get the most useful parsing results.
        • The required POI jars are also included.

        The build.xml file of the plugin directory has to be updated for this additional plugin. For tis I attached the build.xml.patch.txt

        Show
        Stephan Strittmatter added a comment - Atached you can find the complete PowerPoint parser. Also included in the zip are: JUnit test with one sample (the protocol-file plugin is required to run this) It is a very detailed test, which could check on char basis the result! I think It would be a good idea to extract this as a small test environment for all parsers to get the most useful parsing results. The required POI jars are also included. The build.xml file of the plugin directory has to be updated for this additional plugin. For tis I attached the build.xml.patch.txt
        Hide
        Stephan Strittmatter added a comment -

        Was someone able to use this plugin successfully?
        Shall I provide a compiled version also?

        Show
        Stephan Strittmatter added a comment - Was someone able to use this plugin successfully? Shall I provide a compiled version also?
        Hide
        David Spencer added a comment -

        This may be of some use:

        I needed a PPT parser in the context of Lucene, so I copied the code from here, commented out a few nutch-specific things (e.g. the logging calls), and tested it on some local PPT files. I'm using POI-2.5.1-final.

        The code is not perfect, nor is the PPT I have but it's pretty good.
        When it works it works well.
        Went it fails it sometimes says there is no content, but in the doc I spot checked there seemed to be textual content. I have only spot checked a few docs but I did run it thru my disk:

        In a test run:
        [a] I had 195 PPT files
        [b] In 36 files it said there was no body
        [c] With one file it thru an exception
        [d] With 158 files it found content

        Wrt [b] this is not necessarily wrong e.g. if there are only images, however in the 1 file I spot checked there was apparently textual content.

        Wrt [d], I didn't spot check many files but the ones I did seemed fine.

        Personally I would advocate using this esp if someone verifies this within nutch - but I'm confident it will work as I didn't change much to use it in Lucene.

        This was the "bug" that happened in 1 file
        Caused by: java.io.IOException: Cannot remove block[ 18805 ]; out of range
        at org.apache.poi.poifs.storage.BlockListImpl.remove(BlockListImpl.java:103)
        at org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocationTableReader.java:92)
        at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:83)
        at com.tropo.ppt.PPT2Text.init(PPT2Text.java:92)

        – Dave

        Show
        David Spencer added a comment - This may be of some use: I needed a PPT parser in the context of Lucene, so I copied the code from here, commented out a few nutch-specific things (e.g. the logging calls), and tested it on some local PPT files. I'm using POI-2.5.1-final. The code is not perfect, nor is the PPT I have but it's pretty good. When it works it works well. Went it fails it sometimes says there is no content, but in the doc I spot checked there seemed to be textual content. I have only spot checked a few docs but I did run it thru my disk: In a test run: [a] I had 195 PPT files [b] In 36 files it said there was no body [c] With one file it thru an exception [d] With 158 files it found content Wrt [b] this is not necessarily wrong e.g. if there are only images, however in the 1 file I spot checked there was apparently textual content. Wrt [d] , I didn't spot check many files but the ones I did seemed fine. Personally I would advocate using this esp if someone verifies this within nutch - but I'm confident it will work as I didn't change much to use it in Lucene. This was the "bug" that happened in 1 file Caused by: java.io.IOException: Cannot remove block[ 18805 ]; out of range at org.apache.poi.poifs.storage.BlockListImpl.remove(BlockListImpl.java:103) at org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocationTableReader.java:92) at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:83) at com.tropo.ppt.PPT2Text.init(PPT2Text.java:92) – Dave
        Hide
        David Spencer added a comment -

        I figured out why I was getting docs with zero body.
        Here's a stack trace, but note that I changed the package to com.tropo.ppt for my uses..

        java.lang.StringIndexOutOfBoundsException: String index out of range: 1756156169
        at java.lang.String.checkBounds(String.java:287)
        at java.lang.String.<init>(String.java:370)
        at com.tropo.ppt.ContentReaderListener.extractSlides(ContentReaderListener.java:353)
        at com.tropo.ppt.ContentReaderListener.processPOIFSReaderEvent(ContentReaderListener.java:121)
        at org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFSReader.java:259)
        at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:95)
        at com.tropo.ppt.PPT2Text.init(PPT2Text.java:92)
        at com.tropo.ppt.PPT2Text.getProperties(PPT2Text.java:132)
        at com.tropo.ppt.PPT2Text.main(PPT2Text.java:144)


        What's happening is in ContentReaderListener.java.

        In processPOIFSReaderEvent() there's an empty catch(Throwable) block that hides the error.

        In extractSlides() it happily goes thru some data but then, for some reason, 'size' is larger than pptdata.length.

        One hack to "fix" this is to replace this:

        final String strTempContent = new String(pptdata, (int) i + 6,
        (int) (size) + 2);

        String strTempContent;

        try

        { strTempContent = new String(pptdata, (int) i + 6, (int) (size) + 2); }

        catch( StringIndexOutOfBoundsException ouch)

        { strTempContent = ""; }

        When I do this I get data out of a ppt file that previously seemed to to have a zero length body...

        Show
        David Spencer added a comment - I figured out why I was getting docs with zero body. Here's a stack trace, but note that I changed the package to com.tropo.ppt for my uses.. java.lang.StringIndexOutOfBoundsException: String index out of range: 1756156169 at java.lang.String.checkBounds(String.java:287) at java.lang.String.<init>(String.java:370) at com.tropo.ppt.ContentReaderListener.extractSlides(ContentReaderListener.java:353) at com.tropo.ppt.ContentReaderListener.processPOIFSReaderEvent(ContentReaderListener.java:121) at org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFSReader.java:259) at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:95) at com.tropo.ppt.PPT2Text.init(PPT2Text.java:92) at com.tropo.ppt.PPT2Text.getProperties(PPT2Text.java:132) at com.tropo.ppt.PPT2Text.main(PPT2Text.java:144) — What's happening is in ContentReaderListener.java. In processPOIFSReaderEvent() there's an empty catch(Throwable) block that hides the error. In extractSlides() it happily goes thru some data but then, for some reason, 'size' is larger than pptdata.length. One hack to "fix" this is to replace this: final String strTempContent = new String(pptdata, (int) i + 6, (int) (size) + 2); String strTempContent; try { strTempContent = new String(pptdata, (int) i + 6, (int) (size) + 2); } catch( StringIndexOutOfBoundsException ouch) { strTempContent = ""; } When I do this I get data out of a ppt file that previously seemed to to have a zero length body...
        Hide
        Stephan Strittmatter added a comment -

        Could someone send me a ppt-file which produces such errors for debugging?
        I tested several files but I was not abel to reproduce them.

        BTW: The exception is not completly hidden it is logged.

        Show
        Stephan Strittmatter added a comment - Could someone send me a ppt-file which produces such errors for debugging? I tested several files but I was not abel to reproduce them. BTW: The exception is not completly hidden it is logged.
        Hide
        Stephan Strittmatter added a comment -

        Updated plugin sources in respect of changed Nutch interface

        Show
        Stephan Strittmatter added a comment - Updated plugin sources in respect of changed Nutch interface
        Hide
        Jerome Charron added a comment -

        Want to commit it, but unit tests failed.

        Show
        Jerome Charron added a comment - Want to commit it, but unit tests failed.
        Hide
        Stephan Strittmatter added a comment -

        I will verify the Unit-Tests until next week!

        Show
        Stephan Strittmatter added a comment - I will verify the Unit-Tests until next week!
        Hide
        Renat Lumpau added a comment -

        I had to hack MSPowerPointParser.java to get this working with nutch-0.7. I've attached the modified file.

        Show
        Renat Lumpau added a comment - I had to hack MSPowerPointParser.java to get this working with nutch-0.7. I've attached the modified file.
        Hide
        Jerome Charron added a comment -

        Commited to trunk (http://svn.apache.org/viewcvs.cgi?rev=267226&view=rev)
        Thanks to Stephan Strittmatter.

        Note: Take care of the patches attached to this issue since the unit tests are platform dependent (only successed on windows). The committed code is platform independent (I hope). I tested it on Linux, so if someone can test it on other platforms it would be a good idea.

        Show
        Jerome Charron added a comment - Commited to trunk ( http://svn.apache.org/viewcvs.cgi?rev=267226&view=rev ) Thanks to Stephan Strittmatter. Note: Take care of the patches attached to this issue since the unit tests are platform dependent (only successed on windows). The committed code is platform independent (I hope). I tested it on Linux, so if someone can test it on other platforms it would be a good idea.

          People

          • Assignee:
            Unassigned
            Reporter:
            Stefan Groschupf
          • Votes:
            3 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development