Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15039

Error in Solr Cell extract when using multipart upload with some documents



    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 6.6.4, 8.4, 8.6.3, 8.7
    • Fix Version/s: None
    • Labels:


      (Note: I asked about this in the IRC channel as prompted, but didn't get a response.)

      When uploading particular documents to /update/extract, you get different (wrong) results if you are using multipart file upload compared to the basic encoded upload, even though both methods are shown on the documentation page (https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html).

      The first example in the documentation page uses a multipart POST with a field called 'myfile' set to the file content. Some later examples use a standard POST with the raw data provided.

      Here are these two approaches in the commands I used with my example file (I have replaced the URL, username, password, and collection name for my Solr, which isn't publicly available):

      curl --user myuser:mypassword "https://example.org/solr/mycollection/update/extract?&extractOnly=true" --data-binary '@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 'Content-type:text/html' > nonmultipart-result.txt
      curl --user myuser:mypassword "https://example.org/solr/mycollection/update/extract?&extractOnly=true" -F 'myfile=@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 'Content-type:text/html' > multipart-result.txt

      The example file is a ~10MB PowerPoint with a few sentences of English text in it (and some pictures).

      The nonmultipart-result.txt file is 9,871 bytes long and JSON-encoded; it includes an XHTML version of the text content of the PowerPoint, and some metadata.

      The multipart-result.txt is 7,352,348 bytes long and contains mainly a large sequence of Chinese characters, or at least, random data being interpreted as Chinese characters.

      This example was running against Solr 8.4 on a Linux server from our cloud Solr supplier. On another Linux (Ubuntu 18) server that I set up myself I got the same results using various other Solr versions. Running against localhost which is a Windows 10 machine with Solr 8.5, I get slightly different results; the non-multipart works correctly but the multipart-result.txt in that case is a slightly more helpful error 500 message:

      <?xml version="1.0" encoding="UTF-8"?>
      <lst name="responseHeader">
        <int name="status">500</int>
        <int name="QTime">138</int>
      <lst name="error">
        <lst name="metadata">
          <str name="error-class">org.apache.solr.common.SolrException</str>
          <str name="root-error-class">java.util.zip.ZipException</str>
        <str name="msg">org.apache.tika.exception.TikaException: Error creating OOXML extractor</str>
        <str name="trace">org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Error creating OOXML extractor
      Caused by: java.util.zip.ZipException: Unexpected record signature: 0X2D2D2D2D
              at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)

      My conclusion is that even though both versions of this command (-F myfile=@file, and --data-binary @file) are shown in the documentation, they clearly don't work equally.

      Note: Although I've reproduced this using command-line curl to simplify this report, this is actually the result of a highly tortuous debugging process where I eventually managed to track down why a search index (generated by an open source learning system, Moodle, which currently uses the multipart post approach although I might have to change that) was using up too much disk space...

      I'm going to try to remove private data from the offending file and attach it here.


        1. b364b24b-public
          9.85 MB
          sam marshall



            • Assignee:
              sam_marshall sam marshall
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: