Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2460

Some European characters cannot be parsed correctly for some PDFs

    XMLWordPrintableJSON

Details

    Description

      The Norwegian characters (æ, ø and å) in the following PDF document will not display correctly after Solr has indexed it, using Solr Cell:
      http://ridder.uio.no/dokument.pdf

      If I manually change the version of PDFBox (one of Tika's dependencies) to 1.4.0, the document will parse correctly.

      I suggest that the next release of Solr ships with version 0.9 of Tika which also has updated its PDFBox dependencies to 1.4.0

      Attachments

        Activity

          People

            sarowe Steven Rowe
            erlendfg Erlend Garåsen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: