Bug 51519 - XSSFEventBasedExcelExtractor's Japanese xlsx file processing shouldn't extract t element within rPh elemtnts.
XSSFEventBasedExcelExtractor's Japanese xlsx file processing shouldn't extrac...
Status: NEW
Product: POI
Classification: Unclassified
Component: XSSF
3.9
PC All
: P2 normal with 3 votes (vote)
: ---
Assigned To: POI Developers List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2011-07-17 05:29 UTC by Mamoru Asagami
Modified: 2014-02-07 21:27 UTC (History)
1 user (show)



Attachments
Example files used to reproduce InvocationTargetException (6.30 KB, application/octet-stream)
2011-12-20 16:51 UTC, Michael L.
Details
Example file to reproduce issue (9.97 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2013-12-30 22:19 UTC, Christopher
Details
Patch to ReadOnlySharedStringsTable to address this issue (2.61 KB, patch)
2014-02-07 21:27 UTC, Shaun Kalley
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mamoru Asagami 2011-07-17 05:29:03 UTC
rPh is show pronunciation of the data text.
It is hidden data from user point of view and not accurate.
So, it is confusing the extraction include rPh elements.

This is a sample of sharedStrings.xml.

<si>
   <t>役割</t>  <!-- role  in English --> 
   <rPh sb="0" eb="2">
      <t>ヤクワリ</t> <!-- Japanese phonic symbol called Katakana of role --> 
   </rPh>
   <phoneticPr fontId="1" /> 
</si>
Comment 1 Nick Burch 2011-07-17 16:08:23 UTC
Maybe we should make it an option? Some people may want that data for their indexing?
Comment 2 Michael L. 2011-12-20 16:51:32 UTC
Created attachment 28092 [details]
Example files used to reproduce InvocationTargetException
Comment 3 Michael L. 2011-12-20 16:53:31 UTC
Sorry, my attachment was for Bug #51158.
Comment 4 Christopher 2013-12-30 22:19:11 UTC
Created attachment 31165 [details]
Example file to reproduce issue

This file can be used to reproduce the issue. If you open the file using Excel and then load the file using Apache POI (streaming event model), you can see that extra characters are loaded that are not visible when opened in Excel. 

A good description of the purpose of these extra characters can be found at:

http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-%E2%80%94-an-unsolved-problem/
Comment 5 Shaun Kalley 2014-02-07 21:24:48 UTC
I'm also confronted by this issue.  If we set a goal of having XSSFEventBasedExcel produce output that is at parity with XSSFExcelExtractor, then the phonetic text should not be included in the output.  I'm attaching a patch that achieves that goal.
Comment 6 Shaun Kalley 2014-02-07 21:27:08 UTC
Created attachment 31295 [details]
Patch to ReadOnlySharedStringsTable to address this issue