Bug 45260

Summary: Excel file unreadable due to applyFont()
Product: POI Reporter: Werner Schindling <w.schindling>
Component: HSSFAssignee: POI Developers List <dev>
Status: REOPENED ---    
Severity: normal CC: skyfree
Priority: P3    
Version: 3.0-dev   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Attachments: My code for initialising fonts and styles
Different files generated with beta3 and beta5

Description Werner Schindling 2008-06-23 08:36:49 UTC
Hi!

We're having a rather big problem at my company with a moderately large Excel report (6 sheets, 15 columns, maximum 102 rows) using POI.

We do a lot of formatting using about 28 different font styles (arial, 10pt, 12pt, colored, bold, italic ... - half of the fonts are strikeout as we use it to mark old entries). We have up to 6 different fonts in one cell.

Now, usually the POI-generated Excel-file is readable without any problem, but SOMETIMES MS Excel can't open it saying that there's "unreadable content". Our customer needs this 100% working all of the time and I can't figure out the exact reason. It looks like it has something to do with the applyFont() method. Because of this I've built in a switch to leave out text formatting. While the formatted file sometimes can be read, the unformatted ALWAYS can be.

One bugfix that I hoped was to be the solution was to add several blank spaces at the end of each cell String (I thought that corrupt font indices were the reason for the error). This worked for some time but now turned out to not always work.

 I've also tried to make BIFF-dumps of the files but can't open the resulting files with the viewers I've tried out (BIFF-Workbench etc.) ... ?

I hope someone can help me with this, I've been goofing around for ages with this and have become pretty frustrated.

I don't want to add the generated files as an attachment (this is pretty sensitive data) but could mail it to someone who can help me. URGENT HELP is needed here!! :)

Regards, Werner

(btw text formatting on an assembled HSSFRichTextString is quite annoying to do as you can't append to it ... therefore you have to remember the indices for applying the fonts and do that after the result string has been pieced together)

PS opening the file with Open Office always works and saving it with OO makes the file readable with MS Office (I wish I could convince our customer to use OO ...)
Comment 1 Nick Burch 2008-06-23 08:45:25 UTC
(In reply to comment #0)
> PS opening the file with Open Office always works and saving it with OO makes
> the file readable with MS Office (I wish I could convince our customer to use
> OO ...)

One thing to try then is run a BiffViewer on the poi output file, and the oo output file. See if you can spot what open office has done differently (and presumably correctly) with the formatting stuff

Also, can you generate a problem file with dummy data in? Without a test file to work against for unit tests etc, I can't see anyone being able to help 

Comment 2 Werner Schindling 2008-06-24 01:31:51 UTC
(In reply to comment #1)

Hi!

Thanks for the quick reply. :)

> One thing to try then is run a BiffViewer on the poi output file, and the oo
> output file. See if you can spot what open office has done differently (and
> presumably correctly) with the formatting stuff

The POI-generated file has only 20 font records in it (some created in the code seem to be missing) while the open office file has 32.

Also, the POI file has 2 format records while the open office one has 8.


> Also, can you generate a problem file with dummy data in? Without a test file
> to work against for unit tests etc, I can't see anyone being able to help 
> 

I'll look into this in a moment.
Comment 3 Werner Schindling 2008-06-24 02:03:23 UTC
I'm adding two BIFF-Dumps, one of the corrupted POI-generated file and one of the same file resaved with Open Office 2.2.

corrupt POI-generated file dump:
http://oradb02.local.netconomy.net/doris_andritz_mirror/temp/POIGeneratedFile.xls.BD5

working open office file dump:
http://oradb02.local.netconomy.net/doris_andritz_mirror/temp/OpenOfficeSavedFile.xls.BD5

I made both files using BIFF-Workbench.
Comment 4 Werner Schindling 2008-06-24 02:30:45 UTC
Ok, here comes a corrupt excel file (though not the one from the dump I've posted before) with masked data:

http://oradb02.local.netconomy.net/doris_andritz_mirror/temp/corruptPOIfile.xls
Comment 5 Werner Schindling 2008-06-25 00:40:48 UTC
Created attachment 22170 [details]
My code for initialising fonts and styles
Comment 6 Werner Schindling 2008-06-25 00:42:18 UTC
Nick, or someone else, do you have an idea why some of the fonts are missing in the corrupted file?

Maybe there's something wrong in how i create the fonts and styles (see attachment above).
Comment 7 Nick Burch 2008-06-26 08:34:47 UTC
The font creation code looks fine, though being only a code snippet I can't say anything else

With such large files, finding the cause probably won't be quick

I take it that if you apply the same fonts to a smaller file, it all works fine? Have you noticed any particular file complexity point where the problem manifests itself? Can you produce a file that works just fine, but adding another few rows or another sheet causes it to break?

(The more information you can supply on this, especially things like two very similar files where one works and one doesn't the better. It's much easier, and hence more likely, to fix a problem of the type "fonts not correctly written once 10 ???? Records present" than "fonts sometimes fail on my really large file")
Comment 8 Ron Thompson 2008-09-27 23:42:05 UTC
I ran into the same problem, and determined that it was only happening on files larger than 5MB, but sometimes large files of over 7MB worked fine and I couldn't narrow it down further.  Depending on the data set, it either happened or did not, but once it did it was 100% reproducible.  

Reducing the number of applyFont calls would cure the problem in the offending sheets.  

It only seems to happen when you mix fonts within a single RichText object. 

Lastly, mixing fonts within a single RTS seems to bloat the XLS file incredibly.  If I don't apply the fonts to the single column that has the mixed fonts, the XLS reduces from 7.8MB to 1.5MB.
Comment 9 Yegor Kozlov 2009-02-26 00:43:44 UTC
Please try the latest 3.5-beta5. 

There was a bug in how applyFont handled overlapping regions and regions with formatting applied twice. 

The fix was submitted in Bug 40520 and applied in 0

Yegor
Comment 10 Werner Schindling 2009-02-26 02:12:26 UTC
Great!

However, with Beta3.5 there's only 4 or 5 fonts im my Excel output file now instead of 30. Is the formatting run cancelled as soon as there's a problem with overlapping regions or something like that?

I need to switch back to Beta3.3.
Comment 11 Yegor Kozlov 2009-02-26 04:20:52 UTC
(In reply to comment #10)
> Great!
> 
> However, with Beta3.5 there's only 4 or 5 fonts im my Excel output file now
> instead of 30. Is the formatting run cancelled as soon as there's a problem
> with overlapping regions or something like that?
> 
> I need to switch back to Beta3.3.

Sounds like a bug. 

If I had it right, 3.5-beta3 and earlier versions produce files that can't  be opened by Excel. After read-write using OpenOffice the file becomes readable. 

With 3.5-beta5 the output can be read by Excel, but some fonts are missing. 

Can you upload a file with a dummy data generated by 3.5-beta5? Ideally, it would be the same file as you posted previously. 

Yegor
Comment 12 Werner Schindling 2009-02-26 04:31:18 UTC
Created attachment 23312 [details]
Different files generated with beta3 and beta5

Of course I can.
Since I found out the generated report is open to public anyway, here are two files, generated with beta 3 (has all fonts) and beta 5 (has only some of the fonts).
Comment 13 Yegor Kozlov 2009-02-27 06:25:13 UTC
Is there anything special in the way you create / set rich text strings? Can you post the code you used to generated the attached files? 

Compare two code snippets:

        HSSFRichTextString str1 = new HSSFRichTextString("Andritz AG, Graz Short Name: AAG");
        str1.applyFont(0, 16, FONT_CELL_HEADER_BLUE.getIndex());
        str1.applyFont(17, 29, FONT_ITALIC.getIndex());
        str1.applyFont(29, 32, FONT.getIndex());
        cell.setCellValue(str1);


        HSSFRichTextString str2 = new HSSFRichTextString("Andritz AG, Graz Short Name: AAG");
        cell2.setCellValue(str2);
        str2.applyFont(0, 16, FONT_CELL_HEADER_BLUE.getIndex());
        str2.applyFont(17, 29, FONT_ITALIC.getIndex());
        str2.applyFont(29, 32, FONT.getIndex());

They are not equivalent. The second one is incorrect and can produce wrong results. Once a HSSFRichTextString  is assigned, it is supposed to be immutable. I would say in the latter case the code should throw IllegalStateException. We don't do that only because of backwards compatibility. 

One of the problems with beta5fileMissingFonts.xls is that the internal cache of strings is incorrect. There are duplicate entries and as result, beta5fileMissingFonts.xls is 504 KB while beta3fileAllFonts.xls is just 331 KB. 
Internally, Excel stores all strings in a shared string table which is shared across workbook. Text cells just contain indexes into the string table as the value of a cell, instead of the full string. If the string cache is broken, it is very likely to be the reason why some fonts are missing. 

Regards,
Yegor
Comment 14 Werner Schindling 2009-03-02 01:03:59 UTC
(In reply to comment #13)
> Is there anything special in the way you create / set rich text strings? Can
> you post the code you used to generated the attached files? 
> 
> Compare two code snippets:
> 
>         HSSFRichTextString str1 = new HSSFRichTextString("Andritz AG, Graz
> Short Name: AAG");
>         str1.applyFont(0, 16, FONT_CELL_HEADER_BLUE.getIndex());
>         str1.applyFont(17, 29, FONT_ITALIC.getIndex());
>         str1.applyFont(29, 32, FONT.getIndex());
>         cell.setCellValue(str1);
> 
> 
>         HSSFRichTextString str2 = new HSSFRichTextString("Andritz AG, Graz
> Short Name: AAG");
>         cell2.setCellValue(str2);
>         str2.applyFont(0, 16, FONT_CELL_HEADER_BLUE.getIndex());
>         str2.applyFont(17, 29, FONT_ITALIC.getIndex());
>         str2.applyFont(29, 32, FONT.getIndex());
> 
> They are not equivalent. The second one is incorrect and can produce wrong
> results. Once a HSSFRichTextString  is assigned, it is supposed to be
> immutable. I would say in the latter case the code should throw
> IllegalStateException. We don't do that only because of backwards
> compatibility. 
> 
> One of the problems with beta5fileMissingFonts.xls is that the internal cache
> of strings is incorrect. There are duplicate entries and as result,
> beta5fileMissingFonts.xls is 504 KB while beta3fileAllFonts.xls is just 331 KB. 
> Internally, Excel stores all strings in a shared string table which is shared
> across workbook. Text cells just contain indexes into the string table as the
> value of a cell, instead of the full string. If the string cache is broken, it
> is very likely to be the reason why some fonts are missing. 
> 
> Regards,
> Yegor

Hi Yegor,

thanks for taking the time on this!
In my code setCellValue(text) is called after all the fonts have been applied. The biggest problem for me is that you just can't append to a HSSFRichTextString. I have to run 2 identical loops, one for piecing the String together and then another one to apply the fonts to the finished HSSFRichTextString.
Here's the code ;) ... (I've tried to remove unimportant lines)


StringBuffer buffer = new StringBuffer();
		
		for (POICellContent content : this.content){
			
			
			for (int i = 0; i < content.indent; i++){
				// insert leading whitespaces
				buffer.append(" ");
			}
			// insert content
			buffer.append(content.toString());
			
			if (this.companyReport.showHistory && content.getInterval() != null){
				String dateString = companyReport.createDateString(content.getInterval());
				if (!"".equals(dateString)){
					int dateLength = dateString.length();
					int textLength = 0;
					if (buffer.indexOf("\n") != - 1){
						textLength = (buffer.length() - 1) - buffer.lastIndexOf("\n");
					} else textLength = buffer.length();
					
					// factor: variable to determine, whether the date string fits in the same line
					// or a new line is needed
					double fontHeight = content.getFont().getFontHeightInPoints();
					if (fontHeight == 9) fontHeight = SMALL_FONT_MAGNIFICATION;
					double widthFactor = CompanyReportXlsGenerator.LINE_BREAK_FACTOR * fontHeight;
					double approxColumnWidth = column.getWidth() / widthFactor;
					boolean makeNewLine = content.dateWrap && textLength < approxColumnWidth 
							&& textLength + dateLength + 1 > approxColumnWidth;


					if (makeNewLine){
						buffer.append(CompanyReportXlsGenerator.NEW_LINE);
						if (content.dateIndent > 0){
							for (int i = 0; i < content.dateIndent; i++){
								dateString = " " + dateString;
							}
						} else if (content.indent > 0){
							for (int i = 0; i < content.indent; i++){
								dateString = " " + dateString;
							}
						}
						buffer.append(dateString);
					} else {
						buffer.append(" " + dateString);
					}	
				}
			}
		}

		HSSFRichTextString resultString = new HSSFRichTextString(buffer.toString());
		// build the cell String again to apply fonts
		buffer = new StringBuffer();
		
		int end = 0;
		int start = 0;
		for (POICellContent content : this.content){
			
			start = end;
			end = start + content.toString().length();
			for (int i = 0; i < content.indent; i++) {
				// insert leading whitespaces
				buffer.append(" ");
			}
			end += content.indent;
			
			// insert content
			buffer.append(content.toString());
			HSSFFont font = content.getFont();
			if (!font.getStrikeout() && row.getEntry().isStrikeOut() && content.strikeOutEnabled){
				// use strikeOut Font
				font = companyReport.getStrikeOutFont(content.getFont());
			}
			HSSFFont indentFont = companyReport.getNoStrikeOutFont(font);
			if ((companyReport.useFormatting || row.getEntry().isStrikeOut()) && content.indent > 0){			
				resultString.applyFont(start, start + content.indent, indentFont);
				resultString.applyFont(start + content.indent, end, font);
			} else if (companyReport.useFormatting || row.getEntry().isStrikeOut()){
				resultString.applyFont(start, end, font);
			}
			if (this.companyReport.showHistory && content.getInterval() != null){
				String dateString = companyReport.createDateString(content.getInterval());
				if (!"".equals(dateString)){
					int dateLength = dateString.length();								
					if (dateLength > 0){
						start = end;
						end = end + dateLength;
						// + 1 because of whitespace/new line
						end++;
	
						int textLength =  buffer.length();
						if (buffer.indexOf("\n") != - 1){
							textLength = (buffer.length() - 1) - buffer.lastIndexOf("\n");
						}
						
						
							if (content.dateIndent > 0){
								for (int i = 0; i < content.dateIndent; i++){
									dateString = " " + dateString;
									end++;
								}
							} else if (content.indent > 0){
								for (int i = 0; i < content.indent; i++){
									dateString = " " + dateString;
									end++;
								}
							}
							buffer.append(dateString);
						
						HSSFFont dateFont = this.companyReport.FONT_DATE;
						if (font.getStrikeout() || row.getEntry().isStrikeOut()) {
							dateFont = this.companyReport.FONT_DATE_STRIKE;
						}
						if (companyReport.useFormatting || row.getEntry().isStrikeOut()){
							try {
								if (makeNewLine && content.dateIndent > 0){
									resultString.applyFont(start, start + content.dateIndent, 
											companyReport.FONT_DATE);
									resultString.applyFont(start + content.dateIndent, end, dateFont);	
								} else if (makeNewLine && content.indent > 0){
									resultString.applyFont(start, start + content.indent, 
											companyReport.FONT_DATE);
									resultString.applyFont(start + content.indent, end, dateFont);
								} else {
									resultString.applyFont(start, end, dateFont);
								}
							} catch (IllegalArgumentException e){
								logger.error("Company Report error: font indices messed up: " + e.getMessage());
							}
						}
					}
				}
			}
		}
		
		return resultString;