Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.1
-
None
-
debian:bullseye docker container running tika-server-standard-2.4.1jar
Description
When use FileSystemFetcher, if there is non-ascii characters in fetchKey, Tika Server throws exception because the file name is incorrect. Here is an example:
curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: 中文.txt"
I get java.nio.file.NoSuchFileException:
Caused by: java.nio.file.NoSuchFileException: /restricted/ä¸æ.txt at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64) at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90) at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)
When I try to quote the characters:
curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" --header "fetchKey: %E4%B8%AD%E6%96%87.txt"
I still get a java.nio.file.NoSuchFileException:
Caused by: java.nio.file.NoSuchFileException: /restricted/%E4%B8%AD%E6%96%87.txt at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64) at org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90) at org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)
BTW, locale is set to C.UTF-8 on Tika Server:
# locale LANG=C.UTF-8 LANGUAGE= LC_CTYPE="C.UTF-8" LC_NUMERIC="C.UTF-8" LC_TIME="C.UTF-8" LC_COLLATE="C.UTF-8" LC_MONETARY="C.UTF-8" LC_MESSAGES="C.UTF-8" LC_PAPER="C.UTF-8" LC_NAME="C.UTF-8" LC_ADDRESS="C.UTF-8" LC_TELEPHONE="C.UTF-8" LC_MEASUREMENT="C.UTF-8" LC_IDENTIFICATION="C.UTF-8" LC_ALL=