Tika
  1. Tika
  2. TIKA-1130

.docx text extract leaves out some portions of text

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None
    • Environment:

      OpenJDK x86_64

      Description

      When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted.

      I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine.

      Looking at the document.xml portion of the .docx zip file shows the text is all there.

      1. TIKA-1130.patch
        11 kB
        Tim Allison
      2. TIKA-1130.patch
        11 kB
        Tim Allison
      3. tee internal resme.docx
        39 kB
        Daniel Gibby
      4. Resume 6.4.13.docx
        125 kB
        Daniel Gibby
      5. OwenResume.docx
        45 kB
        Daniel Gibby

        Activity

        Hide
        Tim Allison added a comment -

        I've submitted a patch to POI for this (https://issues.apache.org/bugzilla/show_bug.cgi?id=54849). I haven't gotten any feedback after my initial trivial fix. The issue is that sdt/content controls can stand alone as the equivalent of a paragraph or table. POI isn't currently picking those up.

        Show
        Tim Allison added a comment - I've submitted a patch to POI for this ( https://issues.apache.org/bugzilla/show_bug.cgi?id=54849 ). I haven't gotten any feedback after my initial trivial fix. The issue is that sdt/content controls can stand alone as the equivalent of a paragraph or table. POI isn't currently picking those up.
        Hide
        Daniel Gibby added a comment -

        I read somewhere yesterday that either Tika or POI currently doesn't have a person in charge of POI commits. Hopefully this gets picked up by someone on both projects to get the bug fixed.

        Show
        Daniel Gibby added a comment - I read somewhere yesterday that either Tika or POI currently doesn't have a person in charge of POI commits. Hopefully this gets picked up by someone on both projects to get the bug fixed.
        Hide
        Tim Allison added a comment -

        I'll try to submit the Tika portion of the POI-54849 patch by early next week in case anyone wants to apply both patches "at home."

        Show
        Tim Allison added a comment - I'll try to submit the Tika portion of the POI-54849 patch by early next week in case anyone wants to apply both patches "at home."
        Hide
        Nick Burch added a comment -

        We'll hopefully get an updated version of Tim's patch in POI soon. Once there's then a POI release (expected shortly), Tika can upgrade, and fingers crossed the text will show up!

        In the mean time, it would be good if someone could produce a junit unit test for Tika, showing the current issue. That'll let us ensure it gets fixed in Tika with the upgrade, and that it stays fixed into the future...

        Show
        Nick Burch added a comment - We'll hopefully get an updated version of Tim's patch in POI soon. Once there's then a POI release (expected shortly), Tika can upgrade, and fingers crossed the text will show up! In the mean time, it would be good if someone could produce a junit unit test for Tika, showing the current issue. That'll let us ensure it gets fixed in Tika with the upgrade, and that it stays fixed into the future...
        Hide
        Ray Gauss II added a comment -

        I've created a unit test that reproduces the issue with a stripped down version of the original file.

        Shall I comment out the actual test and commit?

        Show
        Ray Gauss II added a comment - I've created a unit test that reproduces the issue with a stripped down version of the original file. Shall I comment out the actual test and commit?
        Hide
        Nick Burch added a comment -

        I think we've tended to prefix the method name, rather than commenting out, so it's more obvious that they want re-enabling later. Pop a note of the tika bug number, and POI bug number in the javadoc for the method, so someone later can easily work out why it was disabled and when it might be ready

        That said, maybe this is our change to move at least one test to JUnit 4, so we can use @Ignore?

        Show
        Nick Burch added a comment - I think we've tended to prefix the method name, rather than commenting out, so it's more obvious that they want re-enabling later. Pop a note of the tika bug number, and POI bug number in the javadoc for the method, so someone later can easily work out why it was disabled and when it might be ready That said, maybe this is our change to move at least one test to JUnit 4, so we can use @Ignore?
        Hide
        Ray Gauss II added a comment -

        Test file and method committed in r1492909.

        This was just added onto OOXMLParserTest and named with a disabled prefix rather than using @Ignore. I think we should start moving towards that for new test classes though.

        Show
        Ray Gauss II added a comment - Test file and method committed in r1492909. This was just added onto OOXMLParserTest and named with a disabled prefix rather than using @Ignore . I think we should start moving towards that for new test classes though.
        Hide
        Daniel Gibby added a comment - - edited

        Looks like the POI bug (https://issues.apache.org/bugzilla/show_bug.cgi?id=54849) was updated to "Resolved Fixed". I've downloaded svn sources of POI and Tika, but I'm not sure where the POI code gets located in Tika. What needs to be done to test the updated POI code?

        Show
        Daniel Gibby added a comment - - edited Looks like the POI bug ( https://issues.apache.org/bugzilla/show_bug.cgi?id=54849 ) was updated to "Resolved Fixed". I've downloaded svn sources of POI and Tika, but I'm not sure where the POI code gets located in Tika. What needs to be done to test the updated POI code?
        Hide
        Nick Burch added a comment -

        Do a svn checkout of POI, run "ant jar" to build the jars, then replace the POI jars in your Tika classpath with the ones you've just built

        Or, wait about a week, as we'll be starting the vote for 3.10 beta 1 just as soon as Yegor gets one last bugfix in / we decide to stop waiting for him...!

        Show
        Nick Burch added a comment - Do a svn checkout of POI, run "ant jar" to build the jars, then replace the POI jars in your Tika classpath with the ones you've just built Or, wait about a week, as we'll be starting the vote for 3.10 beta 1 just as soon as Yegor gets one last bugfix in / we decide to stop waiting for him...!
        Hide
        Daniel Gibby added a comment -

        I'd rather wait a week for a beta, but I need to test this on our code sooner than that. I'm getting pressure to just get only the patch for this one bug and apply it to our production code.

        I did get the jars built, but I'm not sure where in the Tika project structure the POI jars get imported from. Tika uses maven to build, which I'm not familiar with how to configure. I'm also not familiar with the Tika project structure, is there a place I can drop the POI jar file and maven will recognize it, or is there an environment variable or PATH I should set somewhere?

        I've noticed in the tika-parsers/pom.xml that it mentions POI, but just as a comment and version numbers for a few properties. I also see the POI jars mentioned in target/classes/META-INF/DEPENDENCIES, but those are also just version numbers.

        I see the POIContainerExtractionTest and ooxml packages in org/apache/tika/parser/microsoft, but those are just the tests.

        Where does the POI jar go, and what needs to be configured before running mvn to build Tika?
        My guess is that my ignorance on how maven works is why I'm not sure what to do, but based on using ant, I'm used to putting something in the right spot and possibly changing a build.xml property to make sure it is looking at the correct spot. What am I missing?

        Show
        Daniel Gibby added a comment - I'd rather wait a week for a beta, but I need to test this on our code sooner than that. I'm getting pressure to just get only the patch for this one bug and apply it to our production code. I did get the jars built, but I'm not sure where in the Tika project structure the POI jars get imported from. Tika uses maven to build, which I'm not familiar with how to configure. I'm also not familiar with the Tika project structure, is there a place I can drop the POI jar file and maven will recognize it, or is there an environment variable or PATH I should set somewhere? I've noticed in the tika-parsers/pom.xml that it mentions POI, but just as a comment and version numbers for a few properties. I also see the POI jars mentioned in target/classes/META-INF/DEPENDENCIES, but those are also just version numbers. I see the POIContainerExtractionTest and ooxml packages in org/apache/tika/parser/microsoft, but those are just the tests. Where does the POI jar go, and what needs to be configured before running mvn to build Tika? My guess is that my ignorance on how maven works is why I'm not sure what to do, but based on using ant, I'm used to putting something in the right spot and possibly changing a build.xml property to make sure it is looking at the correct spot. What am I missing?
        Hide
        Tim Allison added a comment -

        Nick,
        I think I have to make modifications to Tika to execute the new SDT components. Should my patch be to Tika trunk?

        Show
        Tim Allison added a comment - Nick, I think I have to make modifications to Tika to execute the new SDT components. Should my patch be to Tika trunk?
        Hide
        Nick Burch added a comment -

        Tim - you'll want to checkout POI from SVN, do "ant jar maven-poms", then manually install the jars and poms into your local maven repo. Next, bump up the poi version in the tika-parsers pom, and re-build Tika. Fix any issues that show up / open a new jira entry and ask for help on them. Once it builds cleanly, patch the parser to use the new functionality (if needed) and add a unit test. Finally, attach the patch to this bug!

        Show
        Nick Burch added a comment - Tim - you'll want to checkout POI from SVN, do "ant jar maven-poms", then manually install the jars and poms into your local maven repo. Next, bump up the poi version in the tika-parsers pom, and re-build Tika. Fix any issues that show up / open a new jira entry and ask for help on them. Once it builds cleanly, patch the parser to use the new functionality (if needed) and add a unit test. Finally, attach the patch to this bug!
        Hide
        Tim Allison added a comment -

        Maven proxy setting in my settings.xml file is working for grabbing dependencies, but the proxy info isn't being transferred to testUrlOnly's url.openStream() in MimeDetectionTest. The proxy props appear correctly in the surefire-report for MimeDetectionTest, but the proxy settings are null when I insert this into testUrlOnly:

        System.out.println("HOST: " + System.getProperty("http.proxyHost"));
        System.out.println("PORT: " + System.getProperty("http.proxyPort"));

        Will likely find the answer as soon as I post this...

        Show
        Tim Allison added a comment - Maven proxy setting in my settings.xml file is working for grabbing dependencies, but the proxy info isn't being transferred to testUrlOnly's url.openStream() in MimeDetectionTest. The proxy props appear correctly in the surefire-report for MimeDetectionTest, but the proxy settings are null when I insert this into testUrlOnly: System.out.println("HOST: " + System.getProperty("http.proxyHost")); System.out.println("PORT: " + System.getProperty("http.proxyPort")); Will likely find the answer as soon as I post this...
        Hide
        Tim Allison added a comment -

        Many thanks to Ray for the unit test and to Nick for his guidance on the POI patch and this Tika patch.

        This is the first round patch for Tika to make use of the new SDT processing in POI.

        Ray's test case brought to light a formatting issue in POI 54849...we don't want to insert a "\n" between two runs within an SDT. I'll submit a patch for this in POI.

        Let me know how this looks.

        Show
        Tim Allison added a comment - Many thanks to Ray for the unit test and to Nick for his guidance on the POI patch and this Tika patch. This is the first round patch for Tika to make use of the new SDT processing in POI. Ray's test case brought to light a formatting issue in POI 54849...we don't want to insert a "\n" between two runs within an SDT. I'll submit a patch for this in POI. Let me know how this looks.
        Hide
        Tim Allison added a comment -

        Ray's initial test restored after POI-55142 was committed. Thank you, Nick!

        Show
        Tim Allison added a comment - Ray's initial test restored after POI-55142 was committed. Thank you, Nick!
        Hide
        Nick Burch added a comment -

        The POI 3.10 beta 1 release vote has just started, which includes this fix. It'd be great if people could review that, and vote as appropriate. Once we have a new POI release, we can bump the dependency version in Tika, then apply this patch (Tika only ever depends on released versions of software available on Maven Central)

        Show
        Nick Burch added a comment - The POI 3.10 beta 1 release vote has just started, which includes this fix. It'd be great if people could review that, and vote as appropriate. Once we have a new POI release, we can bump the dependency version in Tika, then apply this patch (Tika only ever depends on released versions of software available on Maven Central)
        Hide
        Nick Burch added a comment -

        Thanks for the patch Tim, applied in r1498968.

        Show
        Nick Burch added a comment - Thanks for the patch Tim, applied in r1498968.
        Hide
        Tim Allison added a comment -

        That was fast. Thank you!

        Show
        Tim Allison added a comment - That was fast. Thank you!
        Hide
        Daniel Gibby added a comment - - edited

        Is the attached file another example of this same problem, or is this a separate bug that needs to be addressed?

        Various parts of the resume do get extracted, while others don't.

        Show
        Daniel Gibby added a comment - - edited Is the attached file another example of this same problem, or is this a separate bug that needs to be addressed? Various parts of the resume do get extracted, while others don't.
        Hide
        Daniel Gibby added a comment -

        Here's another file that isn't converting everything. Let me know if I should open another ticket. Also, I can update or open a ticket in POI.

        Show
        Daniel Gibby added a comment - Here's another file that isn't converting everything. Let me know if I should open another ticket. Also, I can update or open a ticket in POI.
        Hide
        Daniel Gibby added a comment -

        I found some files that still exhibit the problem of not all text being extracted. If the problem is still the same underlying POI, perhaps these POI issues should all be handled in this ticket? Or should a new ticket be opened?

        Show
        Daniel Gibby added a comment - I found some files that still exhibit the problem of not all text being extracted. If the problem is still the same underlying POI, perhaps these POI issues should all be handled in this ticket? Or should a new ticket be opened?
        Hide
        Nick Burch added a comment -

        Daniel - the simplest way to check would be for you to do a svn checkout of Tika, build a snapshot of the Tika App, and try with that. If you problem goes away, you know it was this and is fixed. If it still remains, you'll probably want to open up a fresh bug and also try to identify what the kind of text is that Tika ignores.

        Show
        Nick Burch added a comment - Daniel - the simplest way to check would be for you to do a svn checkout of Tika, build a snapshot of the Tika App, and try with that. If you problem goes away, you know it was this and is fixed. If it still remains, you'll probably want to open up a fresh bug and also try to identify what the kind of text is that Tika ignores.
        Hide
        Daniel Gibby added a comment -

        I have tested the attached files with the newest Tika. The original file I attached is converted correctly. I have since found two others that don't extract all of the text.

        Show
        Daniel Gibby added a comment - I have tested the attached files with the newest Tika. The original file I attached is converted correctly. I have since found two others that don't extract all of the text.
        Hide
        Tim Allison added a comment -

        Haven't had a chance to build from trunk today, but the latest attachment seems to work on my local build of Tika. Which portions are missing?

        Show
        Tim Allison added a comment - Haven't had a chance to build from trunk today, but the latest attachment seems to work on my local build of Tika. Which portions are missing?
        Hide
        Tim Allison added a comment -

        Tested with freshly built trunk, and the text looks good to me. Let me know if you don't find the same.

        There is a case that this bug fix didn't cover: if a content control takes up an entire table cell/is the equivalent of a table cell, then that content is not currently being pulled by POI.

        That is on my todo list.

        Show
        Tim Allison added a comment - Tested with freshly built trunk, and the text looks good to me. Let me know if you don't find the same. There is a case that this bug fix didn't cover: if a content control takes up an entire table cell/is the equivalent of a table cell, then that content is not currently being pulled by POI. That is on my todo list.
        Hide
        Daniel Gibby added a comment -

        It does appear that the newest build I have is working on the newer attachments. Something in my code is somehow removing sections of text. Thanks again.

        Show
        Daniel Gibby added a comment - It does appear that the newest build I have is working on the newer attachments. Something in my code is somehow removing sections of text. Thanks again.
        Hide
        Daniel Gibby added a comment -

        I discovered the problem. I have two apps running, one was using the newest version, the other wasn't. Now I know I'm not crazy.

        Show
        Daniel Gibby added a comment - I discovered the problem. I have two apps running, one was using the newest version, the other wasn't. Now I know I'm not crazy.

          People

          • Assignee:
            Unassigned
            Reporter:
            Daniel Gibby
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development