Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1020

Excel 2010 parser missing cell values are not reported resulting in missing columns values

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.2
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
    • Environment:

      java 1.6 & 1.7

      Description

      When parting an excel 2010 table, if a worksheet has a missing value, then it is not reported in the sax handler. As a result a missing value can result in unordered data.

      For example given the table:

      Bar.java
      A B B
      1 2 3
      4   6
      7 8 9
      

      the returned sax handler reports elements

      Bar.java
      <tr><td>A</td><td>B</td><td>C</td><tr>
      <tr><td>1</td><td>2</td><td>3</td><tr>
      <tr><td>4</td><td>6</td><tr>
      <tr><td>7</td><td>8</td><td>9</td><tr>
      

      As a result the handler can detect that the third row as incomplete cell values but it is ambiguous which columns have missing data.

      As a possible fix for this excel 2010 xml data contains the cell reference value, which could be returned to the sax handler as an attribute.

      Bar.java
      *** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
      --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
      ***************
      *** 200,206 ****
        
               public void cell(String cellRef, String formattedValue) {
                  try {
      !              xhtml.startElement("td");
        
                     // Main cell contents
                     xhtml.characters(formattedValue);
      --- 200,208 ----
        
               public void cell(String cellRef, String formattedValue) {
                  try {
      !              AttributesImpl attributes = new AttributesImpl();
      !              attributes.addAttribute(null, "cellRef", "cellRef", null, cellRef) ;
      !              xhtml.startElement("td",attributes);
        
                     // Main cell contents
                     xhtml.characters(formattedValue);
      
      
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                neilblue Neil Blue
              • Votes:
                2 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: