Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-739

For certain DWG files, the Tika content parser outputs garbage

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: None
    • Labels:
      None

      Description

      I'm using Solr version 3.4. After I index the attached file, Solr displays an error message if it is included in the search results because of malformed XML. When I extract the file using Solr's extractOnly option, I get results back that look corrupted to me (see attached).

      I observed the same behavior with Solr version 3.3.

      The exact URL that I used to extract the content is (before I URL encode it): http://localhost:8983/solr/update/extract?extractOnly=true&literal.type=file&literal.id=9a7ab433616746aaab526d77564b916f&literal.name=3D Dacor Modern Kitchen.dwg&resource.name=3D Dacor Modern Kitchen.dwg&literal.createddate=2010-08-19T17:32:48.277Z&literal.modifieddate=2010-08-19T17:32:49.996Z&literal.size=452832&literal.versionnumber=0&literal.ownerid=92a7271bfa3c4639993c4652ef7e922b&literal.creatorid=201008051854838&literal.viewerids=201008051854838&literal.viewerids=201006231721543&literal.viewerids=201011041924210

      1. ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg
        203 kB
        John Bartak
      2. SolrErrorMsg.png
        93 kB
        John Bartak
      3. extractedContent.xml
        1.04 MB
        John Bartak
      4. 3D Dacor Modern Kitchen.dwg
        442 kB
        John Bartak

        Activity

        Hide
        jukkaz Jukka Zitting added a comment -

        I fixed this in revision 1179225 by adding a check for an unexpected property offset value. The result is that Tika is unable to extract anything from this file, but that's already better than returning garbage.

        Show
        jukkaz Jukka Zitting added a comment - I fixed this in revision 1179225 by adding a check for an unexpected property offset value. The result is that Tika is unable to extract anything from this file, but that's already better than returning garbage.
        Hide
        mikemccand Michael McCandless added a comment -

        I opened SOLR-2807 to upgrade Solr to Tika 0.10....

        Show
        mikemccand Michael McCandless added a comment - I opened SOLR-2807 to upgrade Solr to Tika 0.10....
        Hide
        johnbartak John Bartak added a comment - - edited

        Just downloaded 0.10 and tried extracting the file in it. It takes a really long time to run and consumes over 400 MB of memory. Eventually I get some data back, but still have the Progress dialog up displaying "Parsing stream". I get similar garbage output as when I run 0.8 inside Solr displayed behind the Progress dialog. I've attached a screenshot of the the Tika screen.

        It probably shouldn't matter, but I'm running on Windows Server 2008 R2.

        I know this DWG is valid because it opens successfully inside AutocadWS (https://www.autocadws.com/). It is a 3D DWG, so perhaps that is causing problems.

        Show
        johnbartak John Bartak added a comment - - edited Just downloaded 0.10 and tried extracting the file in it. It takes a really long time to run and consumes over 400 MB of memory. Eventually I get some data back, but still have the Progress dialog up displaying "Parsing stream". I get similar garbage output as when I run 0.8 inside Solr displayed behind the Progress dialog. I've attached a screenshot of the the Tika screen. It probably shouldn't matter, but I'm running on Windows Server 2008 R2. I know this DWG is valid because it opens successfully inside AutocadWS ( https://www.autocadws.com/ ). It is a 3D DWG, so perhaps that is causing problems.
        Hide
        johnbartak John Bartak added a comment -

        It's 0.8 . Not sure how easy it will be to switch Solr to use .10. Maybe I'll try installing Tika by itself and see if it handles the file properly.

        Show
        johnbartak John Bartak added a comment - It's 0.8 . Not sure how easy it will be to switch Solr to use .10. Maybe I'll try installing Tika by itself and see if it handles the file properly.
        Hide
        gagravarr Nick Burch added a comment -

        Someone may chime in with the exact answer, in the mean time you could try looking inside SOLR and see what Tika jar it uses - with any luck that'll have the version in it (eg tika-core-0.9.jar)

        Show
        gagravarr Nick Burch added a comment - Someone may chime in with the exact answer, in the mean time you could try looking inside SOLR and see what Tika jar it uses - with any luck that'll have the version in it (eg tika-core-0.9.jar)
        Hide
        johnbartak John Bartak added a comment -

        Not entirely sure what version I'm using. I'm using the default content extraction system built into the latest version of Solr. I'm assuming that's Tika – but don't know which version.

        Show
        johnbartak John Bartak added a comment - Not entirely sure what version I'm using. I'm using the default content extraction system built into the latest version of Solr. I'm assuming that's Tika – but don't know which version.
        Hide
        gagravarr Nick Burch added a comment -

        What version of Tika are you using? And if it isn't 0.10, does switching to 0.10 fix your issue?

        Show
        gagravarr Nick Burch added a comment - What version of Tika are you using? And if it isn't 0.10, does switching to 0.10 fix your issue?
        Hide
        johnbartak John Bartak added a comment -

        The error message displayed when getting this file back in search results using the Solr admin UI.

        Show
        johnbartak John Bartak added a comment - The error message displayed when getting this file back in search results using the Solr admin UI.
        Hide
        johnbartak John Bartak added a comment - - edited

        File that seems to be causing Tika problems

        Show
        johnbartak John Bartak added a comment - - edited File that seems to be causing Tika problems

          People

          • Assignee:
            jukkaz Jukka Zitting
            Reporter:
            johnbartak John Bartak
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development