[SOLR-2480] Text extraction of password protected files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.4.1, 3.1
Fix Version/s: 3.2, 4.0-ALPHA
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None

Description

Proposal:
There are password-protected files. PDF, Office documents in 2007 format/97 format.
These files are posted using SolrCell.
We do not have to read these files if we do not know the reading password of files.
So, these files may not be extracted text.
My requirement is that these files should be processed normally without extracting text, and without throwing exception.

This background:
Now, when you post a password-protected file, solr returns 500 server error.
Solr catches the error in ExtractingDocumentLoader and throws TikException.

I use ManifoldCF.
If the solr server responds 500, ManifoldCF judge is that "this
document should be retried because I have absolutely no idea what
happened".
And it attempts to retry posting many times without getting the password.

In the other case, my customer posts the files with embedded images.
Sometimes it seems that solr throws TikaException of unknown cause.
He wants to post just metadata without extracting text, but makes him stop posting by the exception.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-2480.patch
14/May/11 14:50
25 kB
Koji Sekiguchi
password-is-solrcell.docx
14/May/11 04:20
33 kB
Koji Sekiguchi
SOLR-2480.patch
14/May/11 04:20
9 kB
Koji Sekiguchi
SOLR-2480.patch
14/May/11 03:51
5 kB
Koji Sekiguchi
SOLR-2480-idea1.patch
02/May/11 08:06
1 kB
Shinichiro Abe

Activity

People

Assignee:: Koji Sekiguchi

Reporter:: Shinichiro Abe

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/Apr/11 06:01

Updated:: 03/Jun/11 16:44

Resolved:: 14/May/11 15:09