Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-739

For certain DWG files, the Tika content parser outputs garbage

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0
    • None
    • None

    Description

      I'm using Solr version 3.4. After I index the attached file, Solr displays an error message if it is included in the search results because of malformed XML. When I extract the file using Solr's extractOnly option, I get results back that look corrupted to me (see attached).

      I observed the same behavior with Solr version 3.3.

      The exact URL that I used to extract the content is (before I URL encode it): http://localhost:8983/solr/update/extract?extractOnly=true&literal.type=file&literal.id=9a7ab433616746aaab526d77564b916f&literal.name=3D Dacor Modern Kitchen.dwg&resource.name=3D Dacor Modern Kitchen.dwg&literal.createddate=2010-08-19T17:32:48.277Z&literal.modifieddate=2010-08-19T17:32:49.996Z&literal.size=452832&literal.versionnumber=0&literal.ownerid=92a7271bfa3c4639993c4652ef7e922b&literal.creatorid=201008051854838&literal.viewerids=201008051854838&literal.viewerids=201006231721543&literal.viewerids=201011041924210

      Attachments

        1. SolrErrorMsg.png
          93 kB
          John Bartak
        2. extractedContent.xml
          1.04 MB
          John Bartak
        3. ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg
          203 kB
          John Bartak
        4. 3D Dacor Modern Kitchen.dwg
          442 kB
          John Bartak

        Activity

          People

            jukkaz Jukka Zitting
            johnbartak John Bartak
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: