Bug 51585 - [PATCH] WorkbookFactory.create() hangs when creating a workbook
Summary: [PATCH] WorkbookFactory.create() hangs when creating a workbook
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.9-FINAL
Hardware: All All
: P2 normal with 12 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords: PatchAvailable
: 53250 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-07-29 15:59 UTC by jxz164
Modified: 2014-01-26 17:00 UTC (History)
6 users (show)



Attachments
Example xlsx file to reproduce the hang (256.08 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2011-07-29 15:59 UTC, jxz164
Details
Altered ColumnHelper, that uses the CTCol and CTCols proxies (12.45 KB, patch)
2012-10-08 18:46 UTC, val235
Details | Diff
proxy for CTCols that maintains max and min values (1.27 KB, patch)
2012-10-08 18:49 UTC, val235
Details | Diff
- proxy for CTCol that maintain a reference to the CTCols its contained in and updates min and max (1.16 KB, patch)
2012-10-08 18:52 UTC, val235
Details | Diff
Modified ColumnHelper with efficient algorithm (15.33 KB, text/plain)
2013-09-20 07:47 UTC, Piotr Wilkin
Details
Modified test for column grouping (46.53 KB, text/plain)
2013-09-20 07:51 UTC, Piotr Wilkin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jxz164 2011-07-29 15:59:09 UTC
Created attachment 27331 [details]
Example xlsx file to reproduce the hang

For a particular xlsx file, the WorkbookFactory.create() call hangs. I have waited an hour and the call still did not return. Please see the code below and attached xlsx file. I am not sure where the hang comes from or why it is so slow. 

package org.apache.poi.ss.examples;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;

import java.io.FileInputStream;
import java.io.IOException;


public class TestWorkbookFactoryCreate {

    public static void main(String[] args) throws IOException, Exception {
        FileInputStream fileIn = null;

        try
            {
                fileIn = new FileInputStream("C:/workspace/TestHang.xlsx");
                Workbook wb = WorkbookFactory.create(fileIn);
                System.out.println("Workbook created");                
            } finally {
                if (fileIn != null)
                    fileIn.close();
            }
    }
    
}
Comment 1 manish.jindal 2011-11-03 19:39:28 UTC
getting the same error for similar code...
Comment 2 Yegor Kozlov 2011-11-04 09:37:33 UTC
The problem is not in POI. It is JDK that can't read the file. The hang can be reproduced with the following code:

  ZipFile zip =  new ZipFile("TestHang.xlsx");

So, something is wrong with the underlying zip archive.  We rely on JDK to open OOXML files and I don't see a problem on the  POI side.

Do you know any details how the problematic file was produced? Version of Excel, OS, etc. Was the file unzipped / modified and zipped back ? 

PS. I'm testing on Windows 7 and JDK 1.6.26. 

Yegor
Comment 3 val235 2012-10-03 04:51:46 UTC
I have the same problem as described. WorkbookFactory.create() hangs with a file I have localy and the sample file provided in this bug report.

I am able to parse the underlying zip files and read their contents without any issue, so the problem is NOT related to ZIP corruption.

try {
	ZipFile zip =  new ZipFile("C:\\TestHang.xlsx");
	List<ZipEntry> entries = (List<ZipEntry>) Collections.list(zip.entries());
			  
	for ( ZipEntry ze:entries){
		System.out.println("-------> "+ze.getName());
		System.out.println("\t -------> "+ze.getComment());
		System.out.println("\t -------> "+ze.getCompressedSize());
		System.out.println("\t -------> "+ze.getCrc());
		System.out.println("\t -------> "+ze.getMethod());
		System.out.println("\t -------> "+ze.getSize());
		System.out.println("\t -------> "+ze.getTime());
				
		InputStream is = zip.getInputStream(ze);
		
		StringWriter writer = new StringWriter(); 
		IOUtils.copy(is, writer);  

		System.out.println(writer.toString());
				
	}
		
} catch (IOException e) {
	e.printStackTrace();
}
Comment 4 val235 2012-10-03 18:49:51 UTC
Ok after some debugging, I can see that the code hangs on some very inffecient looping code centered at ColumnHelper.cleanColumns() and ColumnHelper.addCleanColIntoCols(CTCols, CTCol)

The attached example xlsx for example has some 10000 column definitons in its xml structure. The ridiculous looping code, starts to loop over 10000 records, and then calls some inner loop(s) which have the potential to loop 10000 times themselves, and as well as for every iteration of the loop, it also tries to sort that same list of columns which can have upto 10000 entries. 

There is no infinite loops or anything of the sort. Just some poorly designed iteration code. The call would eventually finish, but who knows how long it will take. The complexity and size of the list being iterated grows exponentially, and after about the first 1100 columns it slow to a crawl, Im guessing the complexity of this looping process is something like n^3

I have worked out some changes to ColumnHelper coupled with Proxy classes for CTCol and CTCols and was able to complete workbook creation in seconds. Will post sample code later. Probably needs a patch in the codebase.
Comment 5 val235 2012-10-08 18:46:49 UTC
Created attachment 29462 [details]
Altered ColumnHelper, that uses the CTCol and CTCols proxies

-slight modification to the ColumnHelper class, so that it uses proxies for classes CTCol and CTCols (also attached) significantly speeding up the looping code that tries to handle column overlaps
Comment 6 val235 2012-10-08 18:49:13 UTC
Created attachment 29463 [details]
proxy for CTCols that maintains max and min values

-original CTCols just maintains a list of CTCol objects, so if u everytime u need to tell the min col and max col ranges, you needed to iterate over all the columns, this is what was so time consuming. This CTColsProxy maintains min and max which is expected to be updated everytime a setMax or setMin is called on the container columns
Comment 7 val235 2012-10-08 18:52:13 UTC
Created attachment 29464 [details]
- proxy for CTCol that maintain a reference to the CTCols its contained in and updates min and max

Original CTCol had no knowledge of the CTCols object its mainted in, ie one directional relationship. This proxy adds a CTColsProxy property which is expected to be the parent CTCols object. Whenever a setMin or setMax is called on the CTCol[Proxy] the CTCols[Proxy] min/max value is also maintained
Comment 8 val235 2012-10-08 18:54:06 UTC
Attached 3 classes, that significantly sped up my processing, example xlsx as well my problematics files now all parse in about a second.

ColumnHelper is a modified version of org.apache.poi.xssf.usermodel.helpers.ColumnHelper while the other two classes, CTColProxy and CTColsProxy are brand new.
Comment 9 Yegor Kozlov 2012-10-11 11:31:41 UTC
Thanks for the patch. I added it to my punch list.

Yegor

(In reply to comment #8)
> Attached 3 classes, that significantly sped up my processing, example xlsx
> as well my problematics files now all parse in about a second.
> 
> ColumnHelper is a modified version of
> org.apache.poi.xssf.usermodel.helpers.ColumnHelper while the other two
> classes, CTColProxy and CTColsProxy are brand new.
Comment 10 Yegor Kozlov 2012-10-12 10:40:24 UTC
The patch needs some work.

3 tests failed:

org.apache.poi.xssf.usermodel.TestXSSFSheet.testGroupUngroupColumn
org.apache.poi.xssf.usermodel.TestXSSFSheet.testSetColumnGroupCollapsed
org.apache.poi.xssf.usermodel.helpers.TestColumnHelper.testAddCleanColIntoCols

Please re-visit your code and ensure that all tests pass or give sensible explanation why the new logic produces different results.

Yegor


(In reply to comment #8)
> Attached 3 classes, that significantly sped up my processing, example xlsx
> as well my problematics files now all parse in about a second.
> 
> ColumnHelper is a modified version of
> org.apache.poi.xssf.usermodel.helpers.ColumnHelper while the other two
> classes, CTColProxy and CTColsProxy are brand new.
Comment 11 Yegor Kozlov 2012-10-12 13:15:39 UTC
*** Bug 53250 has been marked as a duplicate of this bug. ***
Comment 12 Piotr Wilkin 2013-09-20 07:47:52 UTC
Created attachment 30866 [details]
Modified ColumnHelper with efficient algorithm

Modified the algorithm that adds new columns and that cleans the columns. The existing code is absolutely atrocious in terms of algorithmic design, has at least n^4 algorithmic complexity where n is the number of columns, which means that for about 1000 columns the code will execute for over an hour. Changed the algorithm to one that uses sweeping, should drop it to about n log n.
Comment 13 Piotr Wilkin 2013-09-20 07:51:50 UTC
Created attachment 30867 [details]
Modified test for column grouping

Modified the tests for column grouping that relied on old faulty behavior (line 491: collapsing column 10 should not automagically collapse column 9 which is not grouped with it).
Comment 14 sam fercios 2013-11-19 09:15:20 UTC
Hello,

I'm lost... I want to know exactly what I have to do to solve this problem. Should I add the two attachments (attachment 30866 [details] and attachment 30867 [details]) and rebuild a new jar? This problem is not resolved in the latest version of POI?

Please help!

Thank you in advance.
Comment 15 Piotr Wilkin 2013-11-19 10:22:14 UTC
(In reply to sam fercios from comment #14)
> Should I add the two attachments (attachment 30866 [details] and attachment
> 30867 [details]) and rebuild a new jar? 

Yes, exactly. You have to replace the respective files in the POI source. 

> This problem is not resolved in the latest version of POI?

No, apparently fixing a gross inefficiency that happens during every load of an XLS file is not high on the list of priorities.
Comment 16 sam fercios 2013-11-19 16:14:36 UTC
Thank you very much Wilkin. I made amendments and it works now. So I have to change the status of the bug to "resolved".

Hope I can help you one day on any subject :)
Comment 17 Nick Burch 2013-11-19 16:27:32 UTC
Please don't close bugs where the code hasn't yet been committed to SVN
Comment 18 Nick Burch 2013-11-19 16:29:53 UTC
A patch was originally proposed for this bug, but it had issues and couldn't be applied. Since then, another patch has been proposed, but without explicitly answering some of the queries raised against the original bug, and with a status which continued to be "NeedInfo"

It would be great if someone could confirm if the new patch does fix all the problems flagged against the original, and also if someone could try applying the patch(es) and confirm if all the unit tests still pass, behaviour is correct etc. Those two would greatly speed up the review/accept cycle for getting this applied!
Comment 19 sam fercios 2013-11-19 18:29:59 UTC
ouups,so sorry... I was wrong.
Comment 20 sam fercios 2013-11-19 18:38:20 UTC
The problem must be fixed urgently and attached to new POI versions, otherwise if I will migrate from version 3.9 to 3.10, I will have certainly some troubles that can be very dangerous for my future releases.
Comment 21 Piotr Wilkin 2013-11-20 09:22:11 UTC
(In reply to Nick Burch from comment #18)
> It would be great if someone could confirm if the new patch does fix all the
> problems flagged against the original, and also if someone could try
> applying the patch(es) and confirm if all the unit tests still pass,
> behaviour is correct etc. Those two would greatly speed up the review/accept
> cycle for getting this applied!

Let me clarify: 

The original bug report dealt with the WorkbookFactory.create() call hanging on files with many columns. This was later determined to be simply a complexity issue with slow code in the ColumnHelper.

I made a fix that redoes the ColumnHelper with an efficient algorithm (based on sweeping, see http://en.wikipedia.org/wiki/Sweep_line_algorithm) while still preserving the functionality of the original. I also made sure that the new algorithm passes all unit tests. In the one case where the original tests were not passed, I debugged it on a case-by-case basis to make sure the new behavior was correct. Seeing as the relevant testing code has a comment by the lines of "I'm not really sure how this is supposed to work", I do believe that the new behavior is actually the correct one. 

I don't think I have the necessary access level needed to push this onto SVN, which is I left the patch on the NEEDINFO status. Hopefully, if some developer verifies this and pushes this to SVN, it can be changed to RESOLVED.
Comment 22 Nick Burch 2013-11-20 11:25:35 UTC
Sounds like the patch is ready for review, and hopefully applying. I'm tagging as such, so that all POI committers can see it's ready, and hopefully someone will have a chance to look at it shortly!
Comment 23 Cédric Walter 2013-11-22 18:34:58 UTC
i am reviewing the patch now, checking code and that all tests passes
Comment 24 Andreas Beeker 2014-01-26 17:00:31 UTC
Thank you for the patch.
Applied with r1561435
... and because of problems with JDK 1.5, it needed another fix - r1561511