Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-739

For certain DWG files, the Tika content parser outputs garbage

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0
    • None
    • None

    Description

      I'm using Solr version 3.4. After I index the attached file, Solr displays an error message if it is included in the search results because of malformed XML. When I extract the file using Solr's extractOnly option, I get results back that look corrupted to me (see attached).

      I observed the same behavior with Solr version 3.3.

      The exact URL that I used to extract the content is (before I URL encode it): http://localhost:8983/solr/update/extract?extractOnly=true&literal.type=file&literal.id=9a7ab433616746aaab526d77564b916f&literal.name=3D Dacor Modern Kitchen.dwg&resource.name=3D Dacor Modern Kitchen.dwg&literal.createddate=2010-08-19T17:32:48.277Z&literal.modifieddate=2010-08-19T17:32:49.996Z&literal.size=452832&literal.versionnumber=0&literal.ownerid=92a7271bfa3c4639993c4652ef7e922b&literal.creatorid=201008051854838&literal.viewerids=201008051854838&literal.viewerids=201006231721543&literal.viewerids=201011041924210

      Attachments

        1. ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg
          203 kB
          John Bartak
        2. SolrErrorMsg.png
          93 kB
          John Bartak
        3. extractedContent.xml
          1.04 MB
          John Bartak
        4. 3D Dacor Modern Kitchen.dwg
          442 kB
          John Bartak

        Activity

          People

            jukkaz Jukka Zitting
            johnbartak John Bartak
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: