Tika
  1. Tika
  2. TIKA-877

Embedded document not extracted (regression)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:

      Description

      Testing the 1.1 rc, I believe I found a regression, hence the priority.

      dbonniot-t520 /tmp/1.0 java -jar ../tika-app-1.0.jar -z ../coffee.xls 
      Extracting 'file0.wmf' (application/x-msmetafile)
      Extracting 'file1.wmf' (application/x-msmetafile)
      Extracting 'file2.wmf' (application/x-msmetafile)
      Extracting 'file3.wmf' (application/x-msmetafile)
      Extracting 'file4.png' (image/png)
      Extracting 'MBD002B040A.wps' (application/vnd.ms-works)
      Extracting 'file5.bin' (application/octet-stream)
      Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice)
      
      dbonniot-t520 /tmp/1.0 cd ../1.1
      dbonniot-t520 /tmp/1.1 java -jar ../tika-app-1.1.jar -z ../coffee.xls 
      Extracting 'file0.emf' (application/x-emf)
      Extracting 'file1.emf' (application/x-emf)
      Extracting 'file2.emf' (application/x-emf)
      Extracting 'file3.emf' (application/x-emf)
      Extracting 'file4.png' (image/png)
      Extracting 'MBD002B040A.wps' (application/vnd.ms-works)
      Extracting 'file5' (application/x-tika-msoffice-embedded)
      Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice)
      
      dbonniot-t520 /tmp/1.1 ls -l ../1.0/file5.bin ../1.1/file5 
      -rw-r--r-- 1 dbonniot dbonniot 2519 2012-03-18 21:51 ../1.0/file5.bin
      -rw-r--r-- 1 dbonniot dbonniot    0 2012-03-18 21:51 ../1.1/file5
      

      Notice how 1.0 could extract the data for file5, but 1.1 creates an empty file instead.

      By the way, I do see improvements in 1.1 as well, congrats for that!

      1. coffee.xls
        113 kB
        Daniel Bonniot de Ruisselet

        Activity

        Hide
        Daniel Bonniot de Ruisselet added a comment - - edited

        The regression appears with this commit:

        r1221112 | nick | 2011-12-20 07:15:29 +0100 (Tue, 20 Dec 2011) | 1 line

        TIKA-757 Tidy the OLE10Native extractor code now that POI has been upgraded

        http://svn.apache.org/viewvc?view=revision&revision=1221112

        Show
        Daniel Bonniot de Ruisselet added a comment - - edited The regression appears with this commit: r1221112 | nick | 2011-12-20 07:15:29 +0100 (Tue, 20 Dec 2011) | 1 line TIKA-757 Tidy the OLE10Native extractor code now that POI has been upgraded http://svn.apache.org/viewvc?view=revision&revision=1221112
        Hide
        Nick Burch added a comment -

        Hmm, that commit wasn't supposed to break anything, it was just removing some code that is now provided by POI utility methods. At first glance, I can't see what's wrong, but maybe one of the other POI experts may be able to spot it?

        Show
        Nick Burch added a comment - Hmm, that commit wasn't supposed to break anything, it was just removing some code that is now provided by POI utility methods. At first glance, I can't see what's wrong, but maybe one of the other POI experts may be able to spot it?
        Hide
        Daniel Bonniot de Ruisselet added a comment -

        It's definitely not equivalent code. That commit can be reversed cleanly on HEAD, and file5.bin (instead of "file5") is again extracted and non-empty.

        Show
        Daniel Bonniot de Ruisselet added a comment - It's definitely not equivalent code. That commit can be reversed cleanly on HEAD, and file5.bin (instead of "file5") is again extracted and non-empty.
        Hide
        Michael McCandless added a comment -

        I'm also surprised this change broke embedded OLE extraction!

        We should add this document as a test case.

        Maybe, until we can understand what's going on (need help from POI experts!), we should go back to the full copy/serialize?

        Show
        Michael McCandless added a comment - I'm also surprised this change broke embedded OLE extraction! We should add this document as a test case. Maybe, until we can understand what's going on (need help from POI experts!), we should go back to the full copy/serialize?
        Hide
        Maxim Valyanskiy added a comment -

        Hm, I found this problem in my tika-server yesterday and found solution. Extraction is broken after this commit:

        https://github.com/apache/tika/commit/29473bf5d81a23e59d5d9ff08c611fbbc7ed79c3#L3L139

        (TIKA-753).

        I'll fix that in tika-app too

        Show
        Maxim Valyanskiy added a comment - Hm, I found this problem in my tika-server yesterday and found solution. Extraction is broken after this commit: https://github.com/apache/tika/commit/29473bf5d81a23e59d5d9ff08c611fbbc7ed79c3#L3L139 ( TIKA-753 ). I'll fix that in tika-app too
        Hide
        Daniel Bonniot de Ruisselet added a comment -

        Maxim, sounds good, but are you sure it's the same issue (you point to a different commit)? Do you have example documents to demonstrate the regression you are seeing?

        Show
        Daniel Bonniot de Ruisselet added a comment - Maxim, sounds good, but are you sure it's the same issue (you point to a different commit)? Do you have example documents to demonstrate the regression you are seeing?
        Hide
        Maxim Valyanskiy added a comment -

        I'm no sure about 'file5', but zero sized 'MBD00262FE3.unknown' and 'MBD002B040A.wps' in your coffee.xls is definitely the problem that I'm going to solve

        Show
        Maxim Valyanskiy added a comment - I'm no sure about 'file5', but zero sized 'MBD00262FE3.unknown' and 'MBD002B040A.wps' in your coffee.xls is definitely the problem that I'm going to solve
        Hide
        Daniel Bonniot de Ruisselet added a comment -

        Then it seems like that is another case, and that opening a separate task would be clearer.

        Show
        Daniel Bonniot de Ruisselet added a comment - Then it seems like that is another case, and that opening a separate task would be clearer.
        Hide
        Maxim Valyanskiy added a comment -

        It became the same problem after commit that you are pointed too

        Show
        Maxim Valyanskiy added a comment - It became the same problem after commit that you are pointed too
        Hide
        Maxim Valyanskiy added a comment -
        [maxcom@pc-elrond t]$ java -jar ../tika-app/target/tika-app-1.2-SNAPSHOT.jar -z ~/download-tmp/coffee.xls 
        Extracting 'file0.emf' (application/x-emf)
        Extracting 'file1.emf' (application/x-emf)
        Extracting 'file2.emf' (application/x-emf)
        Extracting 'file3.emf' (application/x-emf)
        Extracting 'file4.png' (image/png)
        Extracting 'MBD002B040A.wps' (application/vnd.ms-works)
        Extracting 'file5' (application/x-tika-msoffice-embedded)
        Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice)
        [maxcom@pc-elrond t]$ ls -l
        итого 156
        -rw-r--r-- 1 maxcom consult 10988 марта 21 15:07 file0.emf
        -rw-r--r-- 1 maxcom consult 16836 марта 21 15:07 file1.emf
        -rw-r--r-- 1 maxcom consult 13816 марта 21 15:07 file2.emf
        -rw-r--r-- 1 maxcom consult  9296 марта 21 15:07 file3.emf
        -rw-r--r-- 1 maxcom consult  4984 марта 21 15:07 file4.png
        -rw-r--r-- 1 maxcom consult 16896 марта 21 15:07 file5
        -rw-r--r-- 1 maxcom consult 31232 марта 21 15:07 MBD00262FE3.unknown
        -rw-r--r-- 1 maxcom consult 35840 марта 21 15:07 MBD002B040A.wps
        
        
        Show
        Maxim Valyanskiy added a comment - [maxcom@pc-elrond t]$ java -jar ../tika-app/target/tika-app-1.2-SNAPSHOT.jar -z ~/download-tmp/coffee.xls Extracting 'file0.emf' (application/x-emf) Extracting 'file1.emf' (application/x-emf) Extracting 'file2.emf' (application/x-emf) Extracting 'file3.emf' (application/x-emf) Extracting 'file4.png' (image/png) Extracting 'MBD002B040A.wps' (application/vnd.ms-works) Extracting 'file5' (application/x-tika-msoffice-embedded) Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice) [maxcom@pc-elrond t]$ ls -l итого 156 -rw-r--r-- 1 maxcom consult 10988 марта 21 15:07 file0.emf -rw-r--r-- 1 maxcom consult 16836 марта 21 15:07 file1.emf -rw-r--r-- 1 maxcom consult 13816 марта 21 15:07 file2.emf -rw-r--r-- 1 maxcom consult 9296 марта 21 15:07 file3.emf -rw-r--r-- 1 maxcom consult 4984 марта 21 15:07 file4.png -rw-r--r-- 1 maxcom consult 16896 марта 21 15:07 file5 -rw-r--r-- 1 maxcom consult 31232 марта 21 15:07 MBD00262FE3.unknown -rw-r--r-- 1 maxcom consult 35840 марта 21 15:07 MBD002B040A.wps
        Hide
        Maxim Valyanskiy added a comment -

        Hm, no empty files, but file5 size is different...

        Show
        Maxim Valyanskiy added a comment - Hm, no empty files, but file5 size is different...
        Hide
        Maxim Valyanskiy added a comment -

        I think it is not a real problem, because "file5" is invalid Ole10Native attachement.

        Tika 1.0 saves internal data stream of that entry prepended by some headers that it could not parse. Current (trunk) version saves complete Ole10Native stream when entry is not valid.

        Show
        Maxim Valyanskiy added a comment - I think it is not a real problem, because "file5" is invalid Ole10Native attachement. Tika 1.0 saves internal data stream of that entry prepended by some headers that it could not parse. Current (trunk) version saves complete Ole10Native stream when entry is not valid.

          People

          • Assignee:
            Maxim Valyanskiy
            Reporter:
            Daniel Bonniot de Ruisselet
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development