Bug 51524 - PapBinTable constructor is slow
Summary: PapBinTable constructor is slow
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-18 14:52 UTC by Antoni Mylka
Modified: 2011-07-19 12:25 UTC (History)
0 users



Attachments
Hack which disables the paragraphs structure rebuild (6.05 KB, patch)
2011-07-18 16:46 UTC, Antoni Mylka
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Antoni Mylka 2011-07-18 14:52:43 UTC
The current (r1147828) constructor of the PapBinTable class does something like:

List<PAPX> newPapxs = new LinkedList<PAPX>();
foreach character in docText
   do something
   List<PAPX> papxs = new LinkedList<PAPX>();
   foreach paragraph in paragraphs
      do something with papxs
   do something with papxs and newPapxs
set this.paragraphs to newPapxs

The problem is that the overall complexity rises quadratically with the document size. For instance I have a document which has 341742 paragraphs and docText at this point is 653186 characters log. I didn't even have the patience to wait until it finishes. 

In 3.7, this constructor was much simpler, this.paragraphs was not transformed. This introduced a performance regression. We had an experiment where we processed and indexed the content of some doc files. The time rose from 9 to 54 minutes between 3.7 and 3.8.beta3. 

The document I talked about comes from the govdocs dataset. It's public.

http://domex.nps.edu/corp/files/govdocs1/007/007488.doc

There is probably a good reason for this, but the performance regression is significant and the previous version seems to have worked well enough. Maybe this transformation could be disabled with some switch or a system property.
Comment 1 Antoni Mylka 2011-07-18 16:46:33 UTC
Created attachment 27297 [details]
Hack which disables the paragraphs structure rebuild

The attached file is an ugly hack I whipped up for myself. It allows me to bypass the paragraph rebuilding procedure. The HWPFDocument constructor takes 5 seconds now and I can use the content. Without it the constructor would take more than an hour on my machine. I didn't wait to the end.

Obviously this doesn't solve the problem and there are definitely consequences I'm not aware of. So please, either apply the patch (or something along the lines, allowing the user to bypass that method) and document the consequences, or accelerate that code to do what it's supposed to do faster.

The link I supplied gives a good test case.
Comment 2 Sergey Vladimirov 2011-07-18 18:53:57 UTC
Antoni,

Rebuilding PAPX table is central point in paragraph processing. It shall not be disabled in a way like "get me old poi", because old poi based on incorrect algorithm.

Previous version looked for all PAPX and called them paragraphs. Current version construct the text, split it into paragraphs and applies PAPX. I.e. in previous version you missed some text from document in case there were not PAPX for it. For example, in document by the link you provide 151 parts of text were missing.

So disabling this feature can be usefull only in one case - you want to work with PAPX directly. I have no idea why someone will want to do it outside of POI.

Anyway, personally i'm sad it is done in PapBinTable. It shall be done on different level, in completely different place. There should be at least two different APIs: one for low-level files processing, including parsing PAPX structures, complex file tables, etc., and second - to work with document, with tables, lists, paragraphs, etc. Rebuilding paragraphs list shall be on the second level.

But making such change will break API for sure. So we need to wait until 4.0, where "old" way to work with PAPX will be provided and preserved in "first level API", as well as DocumentParser with different options on the "second level API".

Anyway, again, i rewrite this code partially and now it tooks 4 seconds. Please, let me know if it's acceptable. If possible, please check if data parsed are correct - errors could be.
Comment 3 Antoni Mylka 2011-07-19 12:25:41 UTC
It seems that your fix works. Thanks very much!