Details

      Description

      Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we don't have the optional dependency to do it.

      1. SOLR-1813.patch
        2 kB
        Robert Muir
      2. icu4j-4_2_1.jar
        6.05 MB
        Robert Muir
      3. arabic.pdf
        12 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        attached is a patch with a testcase.

        i can shrink the icu4j jar file if this is needed.

        I will attach the test pdf separately.

        Show
        Robert Muir added a comment - attached is a patch with a testcase. i can shrink the icu4j jar file if this is needed. I will attach the test pdf separately.
        Hide
        Robert Muir added a comment -

        the pdf file for contrib/extraction/src/test/resources/arabic.pdf

        Show
        Robert Muir added a comment - the pdf file for contrib/extraction/src/test/resources/arabic.pdf
        Hide
        Robert Muir added a comment -

        the icu4j jar file that goes in contrib/extraction/lib

        Show
        Robert Muir added a comment - the icu4j jar file that goes in contrib/extraction/lib

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development