50779 – RecordFormatException Not enough data (1) to read requested (2) bytes

Bug 50779 - RecordFormatException Not enough data (1) to read requested (2) bytes

Summary: RecordFormatException Not enough data (1) to read requested (2) bytes

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HSSF (show other bugs)
Version:	3.7-FINAL
Hardware:	PC Windows XP

Importance:	P2 normal with 1 vote (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-02-15 02:24 UTC by apptaro
Modified:	2011-03-13 23:00 UTC (History)
CC List:	1 user (show)

Attachments
UnicodeStringFailCase1 (23.00 KB, application/vnd.ms-excel) 2011-02-15 02:59 UTC, apptaro	Details
UnicodeStringFailCase2 (23.00 KB, application/vnd.ms-excel) 2011-02-15 03:00 UTC, apptaro	Details
junit test to demonsrate the bug (71.52 KB, application/octet-stream) 2011-03-07 09:20 UTC, Yegor Kozlov	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description apptaro 2011-02-15 02:24:09 UTC

The following error occurs when reading some Excel file saved with Excel 2003:

Exception in thread "main"
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
       at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
       at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:300)
       at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:270)
       at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:236)
       at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:442)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:263)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:188)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:305)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:286)
       at aflat4.apps.adr.POITest.main(POITest.java:18)
Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (1) to read requested (2) bytes
       at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
       at org.apache.poi.hssf.record.RecordInputStream.readUShort(RecordInputStream.java:267)
       at org.apache.poi.util.StringUtil.readUnicodeLE(StringUtil.java:277)
       at org.apache.poi.hssf.record.common.UnicodeString$ExtRst.<init>(UnicodeString.java:172)
       at org.apache.poi.hssf.record.common.UnicodeString.<init>(UnicodeString.java:438)
       at org.apache.poi.hssf.record.SSTDeserializer.manufactureStrings(SSTDeserializer.java:55)
       at org.apache.poi.hssf.record.SSTRecord.<init>(SSTRecord.java:250)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
       at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
       at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57)
       ... 9 more

Comment 1 apptaro 2011-02-15 02:31:23 UTC

This error occurs with some Excel files that have many unicode character strings with phonetic data. Details are described here:
http://thread.gmane.org/gmane.comp.jakarta.poi.user/16008/focus=16077

I have a Excel file that causes the error, but I can put it here because it is confidential. I'm trying to create a test file to duplicate the issue.

Comment 2 apptaro 2011-02-15 02:59:39 UTC

Created attachment 26658 [details]
UnicodeStringFailCase1

Comment 3 apptaro 2011-02-15 03:00:40 UTC

Created attachment 26659 [details]
UnicodeStringFailCase2

Comment 4 apptaro 2011-02-15 03:09:13 UTC

Two test files are attached. Both are created in Japanese Excel 2003.

UnicodeStringFailCase1.xls produces the original error. This is the case where a CONTINUE record appears in ExtRst and split two bytes of a unicode character.

Unicode StringFailCase2.xls produces a slightly different error below. This is the case where a CONTINUE record appears in PhRun and split two bytes of a unsigned short value.

Exception in thread "main" org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
       at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
       at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:300)
       at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:270)
       at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:236)
       at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:442)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:263)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:188)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:305)
       at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:286)
       at aflat4.apps.adr.POITest.main(POITest.java:18)

Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (1) to read requested (2) bytes
       at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
       at org.apache.poi.hssf.record.RecordInputStream.readUShort(RecordInputStream.java:267)
       at org.apache.poi.hssf.record.common.UnicodeString$PhRun.<init>(UnicodeString.java:309)
       at org.apache.poi.hssf.record.common.UnicodeString$PhRun.<init>(UnicodeString.java:297)
       at org.apache.poi.hssf.record.common.UnicodeString$ExtRst.<init>(UnicodeString.java:178)
       at org.apache.poi.hssf.record.common.UnicodeString.<init>(UnicodeString.java:438)
       at org.apache.poi.hssf.record.SSTDeserializer.manufactureStrings(SSTDeserializer.java:55)
       at org.apache.poi.hssf.record.SSTRecord.<init>(SSTRecord.java:250)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
       at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
       at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
       at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57)
       ... 9 more

Comment 5 Yegor Kozlov 2011-03-07 09:18:34 UTC

Interesting. So far we assumed that for primitive types (short, int, long, etc.) a continue record break always occurs at the type boundary. Your attachments clearly demonstrate that it is not always so and a  CONTINUE break can be in the middle of a primitive type. 

I know how to fix it, but I'm hesitating whether this behavior should be default or only applied to this particular case. 

Initialization of BIFF records sits on top of the RecordInputStream class which greedily reads the primitive types. To properly handle CONTINUE it needs to reads byte by byte and then make sense of the read data. Something like this:
        
        // current version. Does not work if CONTINUE occurs between two bytes.
	public int readUShort() {
            checkRecordPosition(LittleEndian.SHORT_SIZE);
            _currentDataOffset += LittleEndian.SHORT_SIZE;
            return _dataInput.readUShort();
	}

        // Corrected. readByte() rolls over CONTINUE if necessary
	public int readUShort() {
            int ch1 = readByte();
            int ch2 = readByte();
            return (ch2 << 8) + (ch1 << 0);
	}


Note that there is at least one case where readShort() must be greedy: for double-byte characters a Continue record  break MUST occur at the double-byte character boundary.

Yegor

Comment 6 Yegor Kozlov 2011-03-07 09:20:36 UTC

Created attachment 26740 [details]
junit test to demonsrate the bug

to be included in the poi test collection...

Comment 7 Yegor Kozlov 2011-03-11 05:12:54 UTC

Fixed in r1080496, junit added

My previous comment was not quite correct, I should have read the poi-user thread more thoroughly. 

The fix only applies to the phonetic stuff, it does seem to be special and can contain a CONTINUE break between two bytes of a unicode character or a 'short' data. 

The trick is to pass a decorated LittleEndianInput to the the ExtRst constructor and this decorated instance properly handles  CONTINUE breaks in the middle of primitive data types. 

Yegor

Comment 8 apptaro 2011-03-13 23:00:03 UTC

As a reporter, I built r1080496, tested and confirmed that the bug is resolved. Thank you for fixing!