Tika
  1. Tika
  2. TIKA-1078

TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Component/s: cli, parser
    • Labels:
      None

      Description

      Attached document hits this on Windows:

      C:\>java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
      Extracting 'file0.png' (image/png) to .\file0.png
      Extracting 'file1.emf' (application/x-emf) to .\file1.emf
      Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
      Extracting 'file3.emf' (application/x-emf) to .\file3.emf
      Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
      Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin
      Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
              at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
              at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
              at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
      Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:205)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:156)
              at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
              at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
              at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
              at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
              at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
              ... 5 more
      

      TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails.

      On Linux it runs fine.

      I think somehow ... we have to sanitize the embedded file name ...

      1. tika-1078-2.patch
        9 kB
        Stefano Fornari
      2. tika-1078.patch
        8 kB
        Ken Krugler
      3. T-DS_Excel2003-PPT2003_1.xls
        1.37 MB
        Michael McCandless

        Activity

        Hide
        Chris A. Mattmann added a comment -
        • push to 1.5, get ready for 1.4 RC #1.
        Show
        Chris A. Mattmann added a comment - push to 1.5, get ready for 1.4 RC #1.
        Hide
        Stefano Fornari added a comment -

        I'd like to fix this one as a way to get familiar with tika.
        I have a couple of questions:

        1. As far as I understand it (and based on the tests I have done) the problem here is with special characters not allowed in file names by the different file systems, not to special (i.e. not ASCII or UTF8) characters. can anyone confirm?
        2. Is there any general policy in tika development I should follow wrt java version? shall I stick to a particular version of java, or can I go with Java 7?

        Show
        Stefano Fornari added a comment - I'd like to fix this one as a way to get familiar with tika. I have a couple of questions: 1. As far as I understand it (and based on the tests I have done) the problem here is with special characters not allowed in file names by the different file systems, not to special (i.e. not ASCII or UTF8) characters. can anyone confirm? 2. Is there any general policy in tika development I should follow wrt java version? shall I stick to a particular version of java, or can I go with Java 7?
        Hide
        Michael McCandless added a comment -

        I'd like to fix this one as a way to get familiar with tika.

        Wonderful!

        1. As far as I understand it (and based on the tests I have done) the problem here is with special characters not allowed in file names by the different file systems, not to special (i.e. not ASCII or UTF8) characters. can anyone confirm?

        Yes, I think so. I.e., each OS/filesystem imposes its own restrictions on what characters are allowed in a filename.

        2. Is there any general policy in tika development I should follow wrt java version? shall I stick to a particular version of java, or can I go with Java 7?

        Tika must work with Java 6 ... so you can use Java 7 for development, but before committing we need to make sure it works on Java 6 as well.

        Show
        Michael McCandless added a comment - I'd like to fix this one as a way to get familiar with tika. Wonderful! 1. As far as I understand it (and based on the tests I have done) the problem here is with special characters not allowed in file names by the different file systems, not to special (i.e. not ASCII or UTF8) characters. can anyone confirm? Yes, I think so. I.e., each OS/filesystem imposes its own restrictions on what characters are allowed in a filename. 2. Is there any general policy in tika development I should follow wrt java version? shall I stick to a particular version of java, or can I go with Java 7? Tika must work with Java 6 ... so you can use Java 7 for development, but before committing we need to make sure it works on Java 6 as well.
        Hide
        Stefano Fornari added a comment -

        I have the patch ready. I can not find a way to attach it here, I am posting it to the dev list. I followed a more conservative approach so that the characters that may be reserved by a operating system or file systems are turned into an hex code. This because this is transparent to all platforms and the behaviour will be the same on all platform.

        Show
        Stefano Fornari added a comment - I have the patch ready. I can not find a way to attach it here, I am posting it to the dev list. I followed a more conservative approach so that the characters that may be reserved by a operating system or file systems are turned into an hex code. This because this is transparent to all platforms and the behaviour will be the same on all platform.
        Hide
        Ken Krugler added a comment -

        Attaching for Stefano

        Show
        Ken Krugler added a comment - Attaching for Stefano
        Hide
        Michael McCandless added a comment -

        Thanks Stefano!

        Can you fix the license header on the two new files to match the current sources? Thanks.

        Also, we don't normally include @ author tags.

        Maybe use a HashSet instead of an array for RESERVED, so it's not an O(N) lookup per character? Also, since you check for < ' ', you shouldn't need any entries < 0x20?

        Sometimes (rarely?), attachment filenames have their own sub-directories, and the code today will happily .mkdirs those subdirectories, but it looks like with this patch we now replace / and \ with their hex equivalents, instead? I think that's OK...

        Show
        Michael McCandless added a comment - Thanks Stefano! Can you fix the license header on the two new files to match the current sources? Thanks. Also, we don't normally include @ author tags. Maybe use a HashSet instead of an array for RESERVED, so it's not an O(N) lookup per character? Also, since you check for < ' ', you shouldn't need any entries < 0x20? Sometimes (rarely?), attachment filenames have their own sub-directories, and the code today will happily .mkdirs those subdirectories, but it looks like with this patch we now replace / and \ with their hex equivalents, instead? I think that's OK...
        Hide
        Stefano Fornari added a comment -

        Hi Michael,
        thanks for the review. I took into account all your comments. About the directory structure, I reverted my change now that I understood better the original behaviour. I think the original behaviour is cleaner and nicer.

        attaching the new patch.

        Show
        Stefano Fornari added a comment - Hi Michael, thanks for the review. I took into account all your comments. About the directory structure, I reverted my change now that I understood better the original behaviour. I think the original behaviour is cleaner and nicer. attaching the new patch.
        Hide
        Michael McCandless added a comment -

        Thanks Stefano, I made one small change (added generics: HashSet<Character>) and committed.

        Show
        Michael McCandless added a comment - Thanks Stefano, I made one small change (added generics: HashSet<Character>) and committed.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development