Solr
  1. Solr
  2. SOLR-2332

TikaEntityProcessor retrieves only File Names from Zip extraction

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Extraction of Zip files using TikaEntityProcessor results in only names of file.
      It does not extract the contents of the Files in the Zip

      1. solr-word.zip
        20 kB
        Jayendra Patil
      2. SOLR-2332.patch
        3 kB
        Jayendra Patil

        Issue Links

          Activity

          Hide
          Jayendra Patil added a comment -

          Attached is the Patch for the fix and Testcase.
          Also attached is the Test zip file.

          Show
          Jayendra Patil added a comment - Attached is the Patch for the fix and Testcase. Also attached is the Test zip file.
          Hide
          Hoss Man added a comment -

          I can't find any docs suggestion how exactly TikaEntityProcessor should be expected to deal with zip files, particularly what to expect if a zip files contains multiple documents.

          FWIW: TikaEntityProcessor did not exist in Solr 1.4.1, so the behavior currently seen in the 3x branch (and the 3.1rc1 artifacts) is not a regression.

          Show
          Hoss Man added a comment - I can't find any docs suggestion how exactly TikaEntityProcessor should be expected to deal with zip files, particularly what to expect if a zip files contains multiple documents. FWIW: TikaEntityProcessor did not exist in Solr 1.4.1, so the behavior currently seen in the 3x branch (and the 3.1rc1 artifacts) is not a regression.
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Lance Norskog added a comment -

          Unpacking a zip file is a very narrow, focused operation. This could also be done with a separate UpdateRequestHandler that does nothing but unpack zip files. It would use the basic JDK zip file code, not Tika. You configure the Tika handler beneath it.

          Another use case is a ZIP file full of solr update xml files, which TIKA does not know about. To do this, you want an UpdateRequestHandler stack like this: zip unpacker -> XmlUpdateRequestHandler

          Show
          Lance Norskog added a comment - Unpacking a zip file is a very narrow, focused operation. This could also be done with a separate UpdateRequestHandler that does nothing but unpack zip files. It would use the basic JDK zip file code, not Tika. You configure the Tika handler beneath it. Another use case is a ZIP file full of solr update xml files, which TIKA does not know about. To do this, you want an UpdateRequestHandler stack like this: zip unpacker -> XmlUpdateRequestHandler
          Hide
          Hoss Man added a comment -

          Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

          Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

          Show
          Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Hoss Man added a comment -

          removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)

          Show
          Hoss Man added a comment - removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)

            People

            • Assignee:
              Unassigned
              Reporter:
              Jayendra Patil
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development