Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2519

Issue parsing multiple CHM files concurrently

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.16
    • Fix Version/s: 1.17
    • Component/s: None
    • Labels:
      None

      Description

      Should I expect to be able to parse multiple CHM files concurrently in multiple threads?
      What I'm noticing when attempting to parse 2 different CHM files in different threads is that:

      • ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows:
                        ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance(
                                directoryListingEntry, (int) getChmLzxcResetTable()
                                        .getBlockLen(), getChmLzxcControlData());
        
      • ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to limit the number of ChmBlockInfo instances to 1.
            public static ChmBlockInfo getChmBlockInfoInstance(
                    DirectoryListingEntry dle, int bytesPerBlock,
                    ChmLzxcControlData clcd) {
                setChmBlockInfo(new ChmBlockInfo());
                getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock);
                getChmBlockInfo().setEndBlock(
                        (dle.getOffset() + dle.getLength()) / bytesPerBlock);
                getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock);
                getChmBlockInfo().setEndOffset(
                        (dle.getOffset() + dle.getLength()) % bytesPerBlock);
                // potential problem with casting long to int
                getChmBlockInfo().setIniBlock(
                        getChmBlockInfo().startBlock - getChmBlockInfo().startBlock
                                % (int) clcd.getResetInterval());
        //                (getChmBlockInfo().startBlock - getChmBlockInfo().startBlock)
        //                        % (int) clcd.getResetInterval());
                return getChmBlockInfo();
            }
        

      Is there a good reason why there should only ever be one instance of ChmBlockInfo?

      Should we forget about attempting to process CHM files in parallel and instead queue them up to be processed sequentially?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              esaunders Eamonn Saunders
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: