Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3864

Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.1
    • 2.6.0
    • tika-pipes, tika-server
    • None
    • debian:bullseye docker container running tika-server-standard-2.4.1jar

    Description

      When use FileSystemFetcher, if there is non-ascii characters in fetchKey, Tika Server throws exception because the file name is incorrect. Here is an example:

      curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: 中文.txt" 

      I get java.nio.file.NoSuchFileException:

      Caused by: java.nio.file.NoSuchFileException: /restricted/ä¸æ–‡.txt	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)	at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)	at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)	at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)	at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159) 

       

      When I try to quote the characters:

      curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: %E4%B8%AD%E6%96%87.txt" 

      I still get a java.nio.file.NoSuchFileException:

      Caused by: java.nio.file.NoSuchFileException: /restricted/%E4%B8%AD%E6%96%87.txt	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)	at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)	at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)	at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)	at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)

      BTW, locale is set to C.UTF-8 on Tika Server:

      # locale
      LANG=C.UTF-8
      LANGUAGE=
      LC_CTYPE="C.UTF-8"
      LC_NUMERIC="C.UTF-8"
      LC_TIME="C.UTF-8"
      LC_COLLATE="C.UTF-8"
      LC_MONETARY="C.UTF-8"
      LC_MESSAGES="C.UTF-8"
      LC_PAPER="C.UTF-8"
      LC_NAME="C.UTF-8"
      LC_ADDRESS="C.UTF-8"
      LC_TELEPHONE="C.UTF-8"
      LC_MEASUREMENT="C.UTF-8"
      LC_IDENTIFICATION="C.UTF-8"
      LC_ALL= 

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tongwang70 Tong Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment