Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2930

Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.

    Details

      Description

      Tika 1.0 has fixed a major issue with processing and parsing of PDF files that was splitting the words incorrectly: https://issues.apache.org/jira/browse/TIKA-724

      This causes text to be indexed incorrectly in solr and it becomes specially visible when using spellcheck features etc.

      They have added a special parameter set using setEnableAutoSpace that fixes the problem but there is currently no way of setting this when using Solr. As discussed in thread on above issue, it would be nice if we could control this (and in future other) parameter via Solr configuration.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ravish Ravish Bhagdev
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: