Bug 12345

Summary: HSSF api reads extra column and extra rows from an excel file
Product: POI Reporter: Elvira Gurevich <elvira_gurevich>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED INVALID    
Severity: critical    
Priority: P1    
Version: 2.0-dev   
Target Milestone: ---   
Hardware: Other   
OS: All   
Attachments: excel file
test case for excel file previously attached. picks up cells at rows 9,10,11

Description Elvira Gurevich 2002-09-05 20:46:21 UTC
using last night's build 
reading an xls file, row.getLastCellNum() returns one over.

For example, if there is only one cell, the first will be 0, and the last will 
be 1, which makes 2 alltogether. If it is by design, it should be documented. 

with this particular xls file (attached), it also reads non-existent rows.
Code follows:

	 public static final void main(String[] args)
	 {
		  if(args==null || args.length==0)
		  {
				System.out.println("enter name of excel file");
				return;
		  }

		  String filename = args[0];
		  try
		  {
				HSSFWorkbook w = new HSSFWorkbook(new 
FileInputStream(filename));
				for (int sheet = 0; sheet < w.getNumberOfSheets
(); sheet++)
				{
					 String sheetName;
					 HSSFSheet s = w.getSheetAt(sheet); // 
get the sheet from the workbook
					 sheetName = w.getSheetName(sheet);	
	// get the name of the sheet
					 int firstRowNumber= s.getFirstRowNum();
					 int lastRowNumber= s.getLastRowNum();
					 int rowCount = 
s.getPhysicalNumberOfRows();
					 System.out.println
("sheet "+sheetName+" first row "+ firstRowNumber+", last row "+ 
lastRowNumber); 
					 System.out.println
("sheet "+sheetName+" has "+ rowCount+" rows"); 
					 if (rowCount==0)
					 {
						  continue;
					 }

					 for (int row=firstRowNumber; 
row<=lastRowNumber; row++)
					 {
						  HSSFRow r = s.getRow(row);
						  if (r==null)	  // if there 
is nothing in a row, it comes back as null
						  {
								continue;
						  }

						  HSSFCell c = null;
						  int firstCellNumber= 
r.getFirstCellNum();
						  int lastCellNumber= 
r.getLastCellNum();
						  int colCount= lastCellNumber -
 firstCellNumber+1;
						  System.out.println
("row "+row+": first cell number: "+firstCellNumber+
										
			", lastCellNumber: "+lastCellNumber +
										
			", column count: "+colCount);

						  String namesArray[] = new 
String[colCount];
						  for (int 
cell=firstCellNumber; cell<=lastCellNumber; cell++)
						  {
							
	//System.out.println("row  "+row+", cell "+cell); 
								String 
cellValue = null;
								int nameIndex = 
lastCellNumber-cell;
								namesArray
[nameIndex] = makeName(cell);
								String cTitle = 
namesArray[nameIndex];
								c = r.getCell
((short)cell);
								if (c!=null)
	 // if there is nothing in a cell, it comes back as null
								{
									 //trace
(XDState.DEBUG,"Excel Preparser: row  "+row+", 
cell "+cell+": "+c+",type "+c.getCellType()); 
									 switch 
(c.getCellType())
									 {
									 case
(HSSFCell.CELL_TYPE_NUMERIC):
									
	  {
										
		double d = c.getNumericCellValue();
										
		DecimalFormat formatter = new DecimalFormat();  // default 
rounds to 3 decimal digits
										
		cellValue = formatter.format(d);
										
		break;
									
	  }
									 case
(HSSFCell.CELL_TYPE_BOOLEAN):
									
	  {
										
		cellValue = new Boolean(c.getBooleanCellValue()).toString();
										
		break;
									
	  }
									 case
(HSSFCell.CELL_TYPE_STRING):
									
	  {
										
		cellValue = c.getStringCellValue();
										
		break;
									
	  }
									 case
(HSSFCell.CELL_TYPE_FORMULA):
									
	  {
										
		double d = c.getNumericCellValue();
										
		DecimalFormat formatter = new DecimalFormat();  // default 
rounds to 3 decimal digits
										
		cellValue = formatter.format(d);
										
		break;
									
	  }
									 case
(HSSFCell.CELL_TYPE_ERROR):
									
	  {
										
		cellValue = Byte.toString(c.getErrorCellValue());
										
		break;
									
	  }
									 case
(HSSFCell.CELL_TYPE_BLANK):
									
	  {
										
		cellValue = c.getStringCellValue();
										
		break;
									
	  }
									 
default:
									
	  {
										
		cellValue = "Unknown cell type:"+c.getCellType();
									
	  }
									 }

									 
System.out.println("cell at row "+row+", col "+cell+", value "+cellValue);
								}
						  }
					 }
				}
		  }
		  catch(Exception e)
		  {
				System.out.println(e+"; "+e.getMessage());
				e.printStackTrace();
		  }
	 }

Thanks,
Comment 1 Elvira Gurevich 2002-09-05 20:47:36 UTC
Created attachment 2945 [details]
excel file
Comment 2 Andy Oliver 2002-09-06 01:04:37 UTC
cool can you also attach the code?  
Comment 3 Andy Oliver 2002-09-06 01:20:07 UTC
Wow this was a good documentation catch.  I looked through and was baffled at
first.  As it turns out this is the intended behavior, but we should consider
changing it.  I had to look it up in my well worn copy of the "Microsoft Excel
Developer's Kit" on page 303 with some scribbling of mine in the margins.

Why is it like this?  Well originally these values were provided from the
DimensionsRecord which has two fields that use "microsoft math" (off by one like
their buffer checking in the C# JIT ;-)) meaning they are one greater than what
they should be.

As it stands now it looks like we are calculating this field.  It would be good
if you could raise this on the dev list (I'll provide the reference and history
if they miss this).  

Basically: should we just document this or is it worth breaking some backward
compatibility (in a fairly non obvious way) to make this more intuitive.  I'm
not sure.  I'm open to the thoughts of the developers and users.  This is a
fairly minor issue but probably a pretty irritating usability one.
Comment 4 Elvira Gurevich 2002-09-06 14:05:31 UTC
Created attachment 2952 [details]
test case for excel file previously attached. picks up cells at rows 9,10,11
Comment 5 Elvira Gurevich 2002-09-19 20:00:46 UTC
IS ANYBODY WORKING ON THIS?
I understand that the doc has to change for row.getLastCellNum() to indicate 
that the number returned is the index of the NEXT UNUSED cell.

The second part of this problem report, however, seems to have escaped 
everybody's attention, and that is: for the attached excel file, extra (empty, 
unused) rows are detected.

Thank you.
 
Comment 6 Andy Oliver 2002-09-20 01:07:01 UTC
Patience!  This is a project of volunteers and consultants like me who give
paying customers first crack (mortgage to pay and mouths to feed).

Anyhow this part isn't a bug:

recordid = 0x203, size =14
[NUMBER]
    .row            = 8
    .col            = 3
    .xfindex        = 19
    .value          = 0.6666666666666666
[/NUMBER]

============================================
Offset 0xef6 (3830)
recordid = 0x201, size =6
[BLANK]
row       = 9
col       = 2
xf        = 18
[/BLANK]

============================================
Offset 0xf00 (3840)
recordid = 0x201, size =6
[BLANK]
row       = a
col       = 2
xf        = 18
[/BLANK]

============================================
Offset 0xf0a (3850)
recordid = 0x201, size =6
[BLANK]
row       = b
col       = 2
xf        = 18
[/BLANK]


See those?  They're cells.  You perceve them to be blank in the GUI but "BLANK"
is a *real* cell with a *real* row.  BLANKS sometimes have styles attached to
them that look different than the surrounding style (for instance a cell with a
border but no value).

To see what I mean do java org.apache.poi.hssf.dev.BiffViewer myfile.xls

It will dump at text representation of the binary structure to sysout.  You
might want to redirect the output to a file.

hope that helps.
Comment 7 Elvira Gurevich 2002-09-20 20:20:05 UTC
thanks,
i understand about blank cells, i get the same results when i use hssf to read 
these cells, however,
when i open the file in excel, these cells don't look any different to me than 
other ones which are really empty(unused). i cannot detect any styles attached 
to them or anything. in fact, when the excel file was created, there was 
nothing done to those cells. could it be related to the ungodly format of the 
corresponding cells in the preceeding row? unfortunately, i am not familiar 
with the record format, what is xf in your dump? 
my problem (or my qa's problem) is that as far as common knowledge, there is 
NOTHING in these cells, but when read with HSSF, they show blank data.
Comment 8 Andy Oliver 2002-09-21 03:01:40 UTC
Right so you're saying the correct behavior is to ignore the ones you can't see
visibly in Excel?  Why excel created these?  Who knows.  However they are there
and HSSF shows you what is really in the file NOT what you see in Excel.  My
suggestion is to create a utility function that ignores rows with only blank
cells and wrap HSSF.  Regardless its not a bug in HSSF to report these rows with
blank cells.  That is the correct behavior.    

HSSF goal is to abstract you from the nasty details of the file format as much
as possible while giving you access to all the data.  However its also intent to
match the ACTUAL structure and not the perceived structure of the file. 
Sometimes the two differ.