Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2519

Issue parsing multiple CHM files concurrently

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.16
    • 1.17
    • None
    • None

    Description

      Should I expect to be able to parse multiple CHM files concurrently in multiple threads?
      What I'm noticing when attempting to parse 2 different CHM files in different threads is that:

      • ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows:
                        ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance(
                                directoryListingEntry, (int) getChmLzxcResetTable()
                                        .getBlockLen(), getChmLzxcControlData());
        
      • ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to limit the number of ChmBlockInfo instances to 1.
            public static ChmBlockInfo getChmBlockInfoInstance(
                    DirectoryListingEntry dle, int bytesPerBlock,
                    ChmLzxcControlData clcd) {
                setChmBlockInfo(new ChmBlockInfo());
                getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock);
                getChmBlockInfo().setEndBlock(
                        (dle.getOffset() + dle.getLength()) / bytesPerBlock);
                getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock);
                getChmBlockInfo().setEndOffset(
                        (dle.getOffset() + dle.getLength()) % bytesPerBlock);
                // potential problem with casting long to int
                getChmBlockInfo().setIniBlock(
                        getChmBlockInfo().startBlock - getChmBlockInfo().startBlock
                                % (int) clcd.getResetInterval());
        //                (getChmBlockInfo().startBlock - getChmBlockInfo().startBlock)
        //                        % (int) clcd.getResetInterval());
                return getChmBlockInfo();
            }
        

      Is there a good reason why there should only ever be one instance of ChmBlockInfo?

      Should we forget about attempting to process CHM files in parallel and instead queue them up to be processed sequentially?

      Attachments

        Activity

          People

            Unassigned Unassigned
            esaunders Eamonn Saunders
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: