Tika
  1. Tika
  2. TIKA-739

For certain DWG files, the Tika content parser outputs garbage

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: None
    • Labels:
      None

      Description

      I'm using Solr version 3.4. After I index the attached file, Solr displays an error message if it is included in the search results because of malformed XML. When I extract the file using Solr's extractOnly option, I get results back that look corrupted to me (see attached).

      I observed the same behavior with Solr version 3.3.

      The exact URL that I used to extract the content is (before I URL encode it): http://localhost:8983/solr/update/extract?extractOnly=true&literal.type=file&literal.id=9a7ab433616746aaab526d77564b916f&literal.name=3D Dacor Modern Kitchen.dwg&resource.name=3D Dacor Modern Kitchen.dwg&literal.createddate=2010-08-19T17:32:48.277Z&literal.modifieddate=2010-08-19T17:32:49.996Z&literal.size=452832&literal.versionnumber=0&literal.ownerid=92a7271bfa3c4639993c4652ef7e922b&literal.creatorid=201008051854838&literal.viewerids=201008051854838&literal.viewerids=201006231721543&literal.viewerids=201011041924210

      1. SolrErrorMsg.png
        93 kB
        John Bartak
      2. extractedContent.xml
        1.04 MB
        John Bartak
      3. ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg
        203 kB
        John Bartak
      4. 3D Dacor Modern Kitchen.dwg
        442 kB
        John Bartak

        Activity

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            John Bartak
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development