Tika
  1. Tika
  2. TIKA-739

For certain DWG files, the Tika content parser outputs garbage

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: None
    • Labels:
      None

      Description

      I'm using Solr version 3.4. After I index the attached file, Solr displays an error message if it is included in the search results because of malformed XML. When I extract the file using Solr's extractOnly option, I get results back that look corrupted to me (see attached).

      I observed the same behavior with Solr version 3.3.

      The exact URL that I used to extract the content is (before I URL encode it): http://localhost:8983/solr/update/extract?extractOnly=true&literal.type=file&literal.id=9a7ab433616746aaab526d77564b916f&literal.name=3D Dacor Modern Kitchen.dwg&resource.name=3D Dacor Modern Kitchen.dwg&literal.createddate=2010-08-19T17:32:48.277Z&literal.modifieddate=2010-08-19T17:32:49.996Z&literal.size=452832&literal.versionnumber=0&literal.ownerid=92a7271bfa3c4639993c4652ef7e922b&literal.creatorid=201008051854838&literal.viewerids=201008051854838&literal.viewerids=201006231721543&literal.viewerids=201011041924210

      1. 3D Dacor Modern Kitchen.dwg
        442 kB
        John Bartak
      2. ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg
        203 kB
        John Bartak
      3. extractedContent.xml
        1.04 MB
        John Bartak
      4. SolrErrorMsg.png
        93 kB
        John Bartak

        Activity

        Hide
        Jukka Zitting added a comment -

        I fixed this in revision 1179225 by adding a check for an unexpected property offset value. The result is that Tika is unable to extract anything from this file, but that's already better than returning garbage.

        Show
        Jukka Zitting added a comment - I fixed this in revision 1179225 by adding a check for an unexpected property offset value. The result is that Tika is unable to extract anything from this file, but that's already better than returning garbage.
        Hide
        Michael McCandless added a comment -

        I opened SOLR-2807 to upgrade Solr to Tika 0.10....

        Show
        Michael McCandless added a comment - I opened SOLR-2807 to upgrade Solr to Tika 0.10....
        Hide
        John Bartak added a comment - - edited

        Just downloaded 0.10 and tried extracting the file in it. It takes a really long time to run and consumes over 400 MB of memory. Eventually I get some data back, but still have the Progress dialog up displaying "Parsing stream". I get similar garbage output as when I run 0.8 inside Solr displayed behind the Progress dialog. I've attached a screenshot of the the Tika screen.

        It probably shouldn't matter, but I'm running on Windows Server 2008 R2.

        I know this DWG is valid because it opens successfully inside AutocadWS (https://www.autocadws.com/). It is a 3D DWG, so perhaps that is causing problems.

        Show
        John Bartak added a comment - - edited Just downloaded 0.10 and tried extracting the file in it. It takes a really long time to run and consumes over 400 MB of memory. Eventually I get some data back, but still have the Progress dialog up displaying "Parsing stream". I get similar garbage output as when I run 0.8 inside Solr displayed behind the Progress dialog. I've attached a screenshot of the the Tika screen. It probably shouldn't matter, but I'm running on Windows Server 2008 R2. I know this DWG is valid because it opens successfully inside AutocadWS ( https://www.autocadws.com/ ). It is a 3D DWG, so perhaps that is causing problems.
        Hide
        John Bartak added a comment -

        It's 0.8 . Not sure how easy it will be to switch Solr to use .10. Maybe I'll try installing Tika by itself and see if it handles the file properly.

        Show
        John Bartak added a comment - It's 0.8 . Not sure how easy it will be to switch Solr to use .10. Maybe I'll try installing Tika by itself and see if it handles the file properly.
        Hide
        Nick Burch added a comment -

        Someone may chime in with the exact answer, in the mean time you could try looking inside SOLR and see what Tika jar it uses - with any luck that'll have the version in it (eg tika-core-0.9.jar)

        Show
        Nick Burch added a comment - Someone may chime in with the exact answer, in the mean time you could try looking inside SOLR and see what Tika jar it uses - with any luck that'll have the version in it (eg tika-core-0.9.jar)
        Hide
        John Bartak added a comment -

        Not entirely sure what version I'm using. I'm using the default content extraction system built into the latest version of Solr. I'm assuming that's Tika – but don't know which version.

        Show
        John Bartak added a comment - Not entirely sure what version I'm using. I'm using the default content extraction system built into the latest version of Solr. I'm assuming that's Tika – but don't know which version.
        Hide
        Nick Burch added a comment -

        What version of Tika are you using? And if it isn't 0.10, does switching to 0.10 fix your issue?

        Show
        Nick Burch added a comment - What version of Tika are you using? And if it isn't 0.10, does switching to 0.10 fix your issue?
        Hide
        John Bartak added a comment -

        The error message displayed when getting this file back in search results using the Solr admin UI.

        Show
        John Bartak added a comment - The error message displayed when getting this file back in search results using the Solr admin UI.
        Hide
        John Bartak added a comment - - edited

        File that seems to be causing Tika problems

        Show
        John Bartak added a comment - - edited File that seems to be causing Tika problems

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            John Bartak
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development