52211 – OpenXML4JRuntimeException when opening xlsx files on mainframe

Bug 52211 - OpenXML4JRuntimeException when opening xlsx files on mainframe

Summary: OpenXML4JRuntimeException when opening xlsx files on mainframe

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	XSSF (show other bugs)
Version:	3.8-dev
Hardware:	PC other

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-11-18 20:57 UTC by jxz164
Modified:	2012-10-04 11:53 UTC (History)
CC List:	0 users

Attachments
Test xlsx file (8.14 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) 2011-11-18 21:13 UTC, jxz164	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description jxz164 2011-11-18 20:57:49 UTC

I am using the POI 3.8 beta 5 (from my own build on 10/06) on mainframe to read Excel files. Reading/Writing xls file is OK. I am getting the following stack trace when reading xlsx files.

Exception in thread "main" org.apache.poi.openxml4j.exceptions.OpenXML4JRuntimeException: Package.init() : this exception should never happen, if you read this message please send a mail to the developers team. : The specified content type 'application/vnd.openxmlformats-package.core-properties+xml' is not compliant with RFC 2616: malformed content type.
	at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:166)
	at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:228)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:67)
	at TestWorkbookFactoryCreate.main(TestWorkbookFactoryCreate.java:16)

Here is the output of "java -version".

java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build pmz31dev-20090707 (SR10 ))
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 z/OS s390-31 j9vmmz3123-20090707 (JIT enabled)
J9VM - 20090706_38445_bHdSMr
JIT  - 20090623_1334_r8
GC   - 200906_09)
JCL  - 20090705

Output of "uname -a"
OS/390 ABIZOS08 21.00 03 2818

Test code

import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;

import java.io.FileInputStream;
import java.io.IOException;


public class TestWorkbookFactoryCreate {

  public static void main(String[] args) throws IOException, Exception {
    FileInputStream fileIn = null;

    try
      {
	fileIn = new FileInputStream("utf8.xlsx");
	XSSFWorkbook wb = (XSSFWorkbook) WorkbookFactory.create(fileIn);
	System.out.println("Workbook created");                
      } finally {
	if (fileIn != null)
	  fileIn.close();
      }
  }
    
}

Comment 1 Nick Burch 2011-11-18 20:59:30 UTC

Could you please attach the problematic file too?

Also, do you know how the file was generated?

Comment 2 jxz164 2011-11-18 21:13:03 UTC

Created attachment 27970 [details]
Test xlsx file

Comment 3 jxz164 2011-11-18 21:13:58 UTC

Any xlsx file created by Excel 2007 has this problem. I have attached a sample file.

Comment 4 jxz164 2011-11-18 21:20:55 UTC

I did more testing on this on mainframe and figured out that I have to pass the -Dfile.encoding=utf-8 option.

$ java -Dfile.encoding=UTF-8 TestWorkbookFactoryCreate
Workbook created

$ java  TestWorkbookFactoryCreate
                               
Exception in thread "main" org.apache.poi.openxml4j.exceptions.OpenXML4JRuntimeException: Package.init() : this exception should never happen, if you read this message please send a mail to the developers team. : The specified content type 'application/vnd.openxmlformats-package.core-properties+xml' is not compliant with RFC 2616: malformed content type.
	at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:166)
	at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:228)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:67)
	at TestWorkbookFactoryCreate.main(TestWorkbookFactoryCreate.java:16)

Therefore, the -Dfile.encoding=utf-8 solves my problem. The default encoding in mainframe is ebcdic, and I have to use utf-8. I sent this as a poi bug earlier because the error message said so.

Comment 5 Nick Burch 2011-11-18 21:28:11 UTC

Hmm, we must have an encoding assumption in the OPC code somewhere then

The odd thing is that that error message is coming from the ContentType class, which does hard code the encoding to US-ASCII, so I'm not sure where the issue is

Comment 6 jxz164 2011-11-21 19:33:06 UTC

I hope to get this working without passing passing the -Dfile.encoding=UTF-8 option when calling java.

Comment 7 Nick Burch 2011-11-21 21:16:56 UTC

If you're able to, fire up your JVM with remote debugging enabled, and attach a remote debugger (eg eclipse) to it. Then, step through the problem code, and see if you can work out what is incorrectly encoded that's breaking.

(Nothing springs to mind as wrong from looking at the source code, so it's likely something subtle)

Comment 8 Constantin 2012-09-28 08:44:33 UTC

Hello,

We are using the POI API (stable 3.8) on a system running ibm500 encoding as default encoding.
So we got the same error, when trying to create a Workbook using WorkbookFactory.create(ByteArrayInputStream bais).

We found that the problem lies in the method
org.apache.poi.openxml4j.opc.internal.ContentType.ContentType(String contentType)

In line 139, the follwoing code is called:
contentTypeASCII = new String(contentType.getBytes(), "US-ASCII");

The String.getBytes() causes the system to return the bytes in default system encoding (for instance ibm500). Afterwards this should be converted using encoding US-ASCII. This cannot work.

So, we wonder, why this conversion will be done?

We deleted the line and just put following code:
contentTypeASCII = contentType;

Afterwards it worked fine.

Regards
Constantin

Comment 9 Yegor Kozlov 2012-10-01 13:20:52 UTC

It is very likely that your hypothesis is correct and this oine of code can cause problems.

The problematic piece of code exists since POI-3.5, when OpenXml4j was contributed to Apache POI. 
I guess the intention was to ensure that the string being parsed and validated is in the ASCII encoding. 
This "worked" for years but the conversion does not make sense because if the input argument contains characters above ASCII then they are converted to 0XFFFD ("not a character" unicode) and the subsequent validation against the patternMediaType regex fails.

Consider the following examples:

(a) new ContentType("text/\u007E") 
(b) new ContentType("text/\u0080") 

The first case (a) works because all characters in the input string are in ASCII and the conversion does not change the input string. 
The second case (b) fails no matter if the input argument is re-converted to US-ASCII or not. If you apply your fix (contentTypeASCII=contentType) then the regex check at line 146 fails. Current code first converts the input string to "text/\uFFFD" and then the regex fails.

So I agree that this conversion is extra and can be removed. The fix is coming soon.

Regards,
Yegor

(In reply to comment #8)
> Hello,
> 
> We are using the POI API (stable 3.8) on a system running ibm500 encoding as
> default encoding.
> So we got the same error, when trying to create a Workbook using
> WorkbookFactory.create(ByteArrayInputStream bais).
> 
> We found that the problem lies in the method
> org.apache.poi.openxml4j.opc.internal.ContentType.ContentType(String
> contentType)
> 
> In line 139, the follwoing code is called:
> contentTypeASCII = new String(contentType.getBytes(), "US-ASCII");
> 
> The String.getBytes() causes the system to return the bytes in default
> system encoding (for instance ibm500). Afterwards this should be converted
> using encoding US-ASCII. This cannot work.
> 
> So, we wonder, why this conversion will be done?
> 
> We deleted the line and just put following code:
> contentTypeASCII = contentType;
> 
> Afterwards it worked fine.
> 
> Regards
> Constantin

Comment 10 Yegor Kozlov 2012-10-04 11:53:19 UTC

Should be fixed in r1394001. 

Yegor