[SOLR-15039] Error in Solr Cell extract when using multipart upload with some documents - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 6.6.4, 8.4, 8.6.3, 8.7
Fix Version/s: None
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None

Description

(Note: I asked about this in the IRC channel as prompted, but didn't get a response.)

When uploading particular documents to /update/extract, you get different (wrong) results if you are using multipart file upload compared to the basic encoded upload, even though both methods are shown on the documentation page (https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html).

The first example in the documentation page uses a multipart POST with a field called 'myfile' set to the file content. Some later examples use a standard POST with the raw data provided.

Here are these two approaches in the commands I used with my example file (I have replaced the URL, username, password, and collection name for my Solr, which isn't publicly available):

curl --user myuser:mypassword "https://example.org/solr/mycollection/update/extract?&extractOnly=true" --data-binary '@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 'Content-type:text/html' > nonmultipart-result.txt

curl --user myuser:mypassword "https://example.org/solr/mycollection/update/extract?&extractOnly=true" -F 'myfile=@c:/temp/b364b24b728b350eac18d6379ede3437fd220829' -H 'Content-type:text/html' > multipart-result.txt

The example file is a ~10MB PowerPoint with a few sentences of English text in it (and some pictures).

The nonmultipart-result.txt file is 9,871 bytes long and JSON-encoded; it includes an XHTML version of the text content of the PowerPoint, and some metadata.

The multipart-result.txt is 7,352,348 bytes long and contains mainly a large sequence of Chinese characters, or at least, random data being interpreted as Chinese characters.

This example was running against Solr 8.4 on a Linux server from our cloud Solr supplier. On another Linux (Ubuntu 18) server that I set up myself I got the same results using various other Solr versions. Running against localhost which is a Windows 10 machine with Solr 8.5, I get slightly different results; the non-multipart works correctly but the multipart-result.txt in that case is a slightly more helpful error 500 message:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">500</int>
  <int name="QTime">138</int>
</lst>
<lst name="error">
  <lst name="metadata">
    <str name="error-class">org.apache.solr.common.SolrException</str>
    <str name="root-error-class">java.util.zip.ZipException</str>
  </lst>
  <str name="msg">org.apache.tika.exception.TikaException: Error creating OOXML extractor</str>
  <str name="trace">org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Error creating OOXML extractor
...
Caused by: java.util.zip.ZipException: Unexpected record signature: 0X2D2D2D2D
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)

My conclusion is that even though both versions of this command (-F myfile=@file, and --data-binary @file) are shown in the documentation, they clearly don't work equally.

Note: Although I've reproduced this using command-line curl to simplify this report, this is actually the result of a highly tortuous debugging process where I eventually managed to track down why a search index (generated by an open source learning system, Moodle, which currently uses the multipart post approach although I might have to change that) was using up too much disk space...

I'm going to try to remove private data from the offending file and attach it here.

Error in Solr Cell extract when using multipart upload with some documents

Details

Description

Attachments

Attachments

Activity

People

Dates