Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2146

Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.11
    • Fix Version/s: None
    • Component/s: core, parser
    • Labels:
      None
    • Environment:

      Windows 7

      Description

      When I try to parse a MS word document which is protected, I am unable to extract the content rather, i get the below exception

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.Tika.parseToString(Tika.java:537)
      at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
      at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
      at java.security.AccessController.doPrivileged(Native Method)
      at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
      at org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
      at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
      at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
      at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
      at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
      at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
      at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
      at org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
      at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
      at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
      at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
      at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
      at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
      at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
      at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
      at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
      at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
      at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
      at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
      at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
      at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.ArrayIndexOutOfBoundsException
      at org.apache.poi.hwpf.model.SectionTable.<init>(SectionTable.java:84)
      at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:345)
      at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
      at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
      at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Are you able to share the document?

        Do you have the password for the document?

        Show
        tallison@mitre.org Tim Allison added a comment - Are you able to share the document? Do you have the password for the document?
        Hide
        mnsk07 Sharath Kumar added a comment - - edited

        Sure. I have uploaded the doc. The file is not password protected.
        I also see errors like the below for these type of docs(protected word docs)

        java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40
        at java.security.AccessController.doPrivileged(Native Method)

        Show
        mnsk07 Sharath Kumar added a comment - - edited Sure. I have uploaded the doc. The file is not password protected. I also see errors like the below for these type of docs(protected word docs) java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at java.security.AccessController.doPrivileged(Native Method)
        Hide
        t3knoid Frank Refol added a comment - - edited

        I just ran into this issue as well. I am testing unprotecting MS-WORD docs from command-line using the Tika app 1.13. I ran into the problem trying to open a Word 97-2003 document:

        java -jar tika-app-1.13.jar -t --password=password "This is password protected.doc"

        I am attaching the sample doc that I am using for testing. The password is simply, password.

        BTW, there is no problem parsing a non-password protected document. Also, FYI, the test file was created using MS Office 2010 by using the Save As Word 97-2003 document option.

        Show
        t3knoid Frank Refol added a comment - - edited I just ran into this issue as well. I am testing unprotecting MS-WORD docs from command-line using the Tika app 1.13. I ran into the problem trying to open a Word 97-2003 document: java -jar tika-app-1.13.jar -t --password=password "This is password protected.doc" I am attaching the sample doc that I am using for testing. The password is simply, password. BTW, there is no problem parsing a non-password protected document. Also, FYI, the test file was created using MS Office 2010 by using the Save As Word 97-2003 document option.
        Hide
        gagravarr Nick Burch added a comment -

        As per https://poi.apache.org/encryption.html, there's no support in Apache POI for reading password protected .doc files, only .docx ones. Sadly that means, unless someone volunteers to add the support to POI, that haven't the password won't actually help...

        Show
        gagravarr Nick Burch added a comment - As per https://poi.apache.org/encryption.html , there's no support in Apache POI for reading password protected .doc files, only .docx ones. Sadly that means, unless someone volunteers to add the support to POI, that haven't the password won't actually help...
        Hide
        mnsk07 Sharath Kumar added a comment -

        Tim Allison

        I ran the same document that i have attached using tika 1.13 I get the below issue even in 1.13 . I have one more protected document MS Word 97( which I cant share due to the sensitive data in that, that also returns in error. Below are the error logs. I have question. Does tika support extrating the contents of a protected MS-word doument. The doument in question is not password prtotected though.

        Output 1:
        C:\Users\sk\Downloads>java -jar tika-app-1.13.jar Testbug.doc
        Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.Offic
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
        Caused by: java.lang.IllegalStateException: Told we're for characters 8236 -> 10293, but actually covers 2055 characters!
        at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:73)
        at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:112)
        at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
        at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:72)
        at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:602)
        at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:146)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 5 more

        Output 2:

        Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f27a732
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
        Caused by: java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at org.apache.poi.hwpf.model.SectionTable.<init>(SectionTable.java:84)
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:342)
        at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 5 more

        Show
        mnsk07 Sharath Kumar added a comment - Tim Allison I ran the same document that i have attached using tika 1.13 I get the below issue even in 1.13 . I have one more protected document MS Word 97( which I cant share due to the sensitive data in that, that also returns in error. Below are the error logs. I have question. Does tika support extrating the contents of a protected MS-word doument. The doument in question is not password prtotected though. Output 1: C:\Users\sk\Downloads>java -jar tika-app-1.13.jar Testbug.doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.Offic at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.lang.IllegalStateException: Told we're for characters 8236 -> 10293, but actually covers 2055 characters! at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:73) at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:112) at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:72) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:602) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more Output 2: Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f27a732 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.SectionTable.<init>(SectionTable.java:84) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:342) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more
        Hide
        mnsk07 Sharath Kumar added a comment -

        Does tika support extracting the contents of a protected MS-word document. The document is however not a password protected though.

        Show
        mnsk07 Sharath Kumar added a comment - Does tika support extracting the contents of a protected MS-word document. The document is however not a password protected though.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I wonder if these errors are caused by what I found with old "protected" Excel files. Even though they weren't password protected, they were still "protected", and the inner objects were encrypted to the point that even the record lengths were unreadable, leading to aioobe and other similar problems.

        Show
        tallison@mitre.org Tim Allison added a comment - I wonder if these errors are caused by what I found with old "protected" Excel files. Even though they weren't password protected, they were still "protected", and the inner objects were encrypted to the point that even the record lengths were unreadable, leading to aioobe and other similar problems.
        Hide
        t3knoid Frank Refol added a comment -

        Thanks for clarifying and providing that link. That is very helpful in giving insight on what is available in Tika with decrypting MS Office docs.

        Show
        t3knoid Frank Refol added a comment - Thanks for clarifying and providing that link. That is very helpful in giving insight on what is available in Tika with decrypting MS Office docs.
        Hide
        mnsk07 Sharath Kumar added a comment -

        What would be action plan for this. is this gonna be supported in Tika or not

        Show
        mnsk07 Sharath Kumar added a comment - What would be action plan for this. is this gonna be supported in Tika or not
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Sadly that means, unless someone volunteers to add the support to POI, that haven't the password won't actually help...

        I think Nick summed it up. If you can open an issue on POI and find someone to add the support, then it will happen. I regret that we can't fix this at the Tika level, and I'd look into at the POI level, but this is an area that is beyond my comfort zone.

        Show
        tallison@mitre.org Tim Allison added a comment - Sadly that means, unless someone volunteers to add the support to POI, that haven't the password won't actually help... I think Nick summed it up. If you can open an issue on POI and find someone to add the support, then it will happen. I regret that we can't fix this at the Tika level, and I'd look into at the POI level, but this is an area that is beyond my comfort zone.
        Hide
        gagravarr Nick Burch added a comment -

        My guess is it's about 2-3 weeks of work at the POI level to add support for this. Unless you've got a handy intern or some budget, it looks unlikely it'll be fixed soon...

        However, it's probably only 2-3 hours of work reading through the published .DOC file format specs from Microsoft to find out how encrypted word documents are marked as such in the file. You probably want https://msdn.microsoft.com/en-us/library/office/gg615596(v=office.14).aspx then https://msdn.microsoft.com/en-us/library/office/cc313153(v=office.12).aspx . Once someone has found that out, it's only a few minutes work to add the check and throw a more helpful exception

        Show
        gagravarr Nick Burch added a comment - My guess is it's about 2-3 weeks of work at the POI level to add support for this. Unless you've got a handy intern or some budget, it looks unlikely it'll be fixed soon... However, it's probably only 2-3 hours of work reading through the published .DOC file format specs from Microsoft to find out how encrypted word documents are marked as such in the file. You probably want https://msdn.microsoft.com/en-us/library/office/gg615596(v=office.14).aspx then https://msdn.microsoft.com/en-us/library/office/cc313153(v=office.12).aspx . Once someone has found that out, it's only a few minutes work to add the check and throw a more helpful exception

          People

          • Assignee:
            Unassigned
            Reporter:
            mnsk07 Sharath Kumar
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development