Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1020

Excel 2010 parser missing cell values are not reported resulting in missing columns values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2
    • None
    • parser
    • java 1.6 & 1.7

    Description

      When parting an excel 2010 table, if a worksheet has a missing value, then it is not reported in the sax handler. As a result a missing value can result in unordered data.

      For example given the table:

      Bar.java
      A B B
      1 2 3
      4   6
      7 8 9
      

      the returned sax handler reports elements

      Bar.java
      <tr><td>A</td><td>B</td><td>C</td><tr>
      <tr><td>1</td><td>2</td><td>3</td><tr>
      <tr><td>4</td><td>6</td><tr>
      <tr><td>7</td><td>8</td><td>9</td><tr>
      

      As a result the handler can detect that the third row as incomplete cell values but it is ambiguous which columns have missing data.

      As a possible fix for this excel 2010 xml data contains the cell reference value, which could be returned to the sax handler as an attribute.

      Bar.java
      *** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
      --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
      ***************
      *** 200,206 ****
        
               public void cell(String cellRef, String formattedValue) {
                  try {
      !              xhtml.startElement("td");
        
                     // Main cell contents
                     xhtml.characters(formattedValue);
      --- 200,208 ----
        
               public void cell(String cellRef, String formattedValue) {
                  try {
      !              AttributesImpl attributes = new AttributesImpl();
      !              attributes.addAttribute(null, "cellRef", "cellRef", null, cellRef) ;
      !              xhtml.startElement("td",attributes);
        
                     // Main cell contents
                     xhtml.characters(formattedValue);
      
      
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              neilblue Neil Blue
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: