FOP
  1. FOP
  2. FOP-1874

Greek Extended character throwing ArrayIndexOutOfBoundException.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: 0.95
    • Fix Version/s: None
    • Component/s: renderer/pdf
    • Labels:
      None
    • Environment:
      Operating System: Linux
      Platform: PC
    • External issue ID:
      50471

      Description

      We want to create a PDF using FOP. We used XSL and XML files to transform to create PDF. The xml file contains Greek Extended character and its decimal code is 8062 and its Hex code is 1F7E and its HTML representation is ὾.
      The moment this character is discovered in the string then the transformer.transform method throws TransformerException which actually was caused due to ArrayIndexOutofBoundsException.
      The exact Exception Stack trace is as per below.
      We tried decoding the FOP code and we could not understand the array lineBreakProperties defined in LineBreakUtils.

      Please help us in getting the way out of this exception.

      Base Exception in PDFGenerator.buildPdf() Error in Creating PDF
      at PDFTest.buildPdf(PDFTest.java:140)
      at PDFTest.main(PDFTest.java:50)
      Caused by: javax.xml.transform.TransformerException: java.lang.ArrayIndexOutOfBoundsException: -1
      at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(Unknown Source)
      at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(Unknown Source)
      at PDFTest.buildPdf(PDFTest.java:118)
      ... 1 more
      Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
      at org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668)
      at org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)
      at org.apache.fop.layoutmgr.inline.TextLayoutManager.getNextKnuthElements(TextLayoutManager.java:543)
      at org.apache.fop.layoutmgr.inline.LineLayoutManager.collectInlineKnuthElements(LineLayoutManager.java:658)
      at org.apache.fop.layoutmgr.inline.LineLayoutManager.getNextKnuthElements(LineLayoutManager.java:594)
      at org.apache.fop.layoutmgr.BlockStackingLayoutManager.getNextKnuthElements(BlockStackingLayoutManager.java:294)
      at org.apache.fop.layoutmgr.BlockLayoutManager.getNextKnuthElements(BlockLayoutManager.java:116)
      at org.apache.fop.layoutmgr.table.TableCellLayoutManager.getNextKnuthElements(TableCellLayoutManager.java:170)
      at org.apache.fop.layoutmgr.table.RowGroupLayoutManager.createElementsForRowGroup(RowGroupLayoutManager.java:120)
      at org.apache.fop.layoutmgr.table.RowGroupLayoutManager.getNextKnuthElements(RowGroupLayoutManager.java:60)
      at org.apache.fop.layoutmgr.table.TableContentLayoutManager.getKnuthElementsForRowIterator(TableContentLayoutManager.java:228)
      at org.apache.fop.layoutmgr.table.TableContentLayoutManager.getNextKnuthElements(TableContentLayoutManager.java:172)
      at org.apache.fop.layoutmgr.table.TableLayoutManager.getNextKnuthElements(TableLayoutManager.java:247)
      at org.apache.fop.layoutmgr.BlockStackingLayoutManager.getNextKnuthElements(BlockStackingLayoutManager.java:294)
      at org.apache.fop.layoutmgr.BlockLayoutManager.getNextKnuthElements(BlockLayoutManager.java:116)
      at org.apache.fop.layoutmgr.BlockStackingLayoutManager.getNextKnuthElements(BlockStackingLayoutManager.java:294)
      at org.apache.fop.layoutmgr.BlockLayoutManager.getNextKnuthElements(BlockLayoutManager.java:116)
      at org.apache.fop.layoutmgr.FlowLayoutManager.getNextKnuthElements(FlowLayoutManager.java:107)
      at org.apache.fop.layoutmgr.PageBreaker.getNextKnuthElements(PageBreaker.java:145)
      at org.apache.fop.layoutmgr.AbstractBreaker.getNextBlockList(AbstractBreaker.java:552)
      at org.apache.fop.layoutmgr.PageBreaker.getNextBlockList(PageBreaker.java:137)
      at org.apache.fop.layoutmgr.AbstractBreaker.doLayout(AbstractBreaker.java:302)Stop...s

      at org.apache.fop.layoutmgr.AbstractBreaker.doLayout(AbstractBreaker.java:264)
      at org.apache.fop.layoutmgr.PageSequenceLayoutManager.activateLayout(PageSequenceLayoutManager.java:106)
      at org.apache.fop.area.AreaTreeHandler.endPageSequence(AreaTreeHandler.java:234)
      at org.apache.fop.fo.pagination.PageSequence.endOfNode(PageSequence.java:123)
      at org.apache.fop.fo.FOTreeBuilder$MainFOHandler.endElement(FOTreeBuilder.java:340)
      at org.apache.fop.fo.FOTreeBuilder.endElement(FOTreeBuilder.java:169)
      at com.sun.org.apache.xml.internal.serializer.ToXMLSAXHandler.endElement(Unknown Source)
      at com.sun.org.apache.xml.internal.serializer.ToXMLSAXHandler.endElement(Unknown Source)
      at GregorSamsa.template$dot$0()
      at GregorSamsa.applyTemplates()
      at GregorSamsa.transform()
      at com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(Unknown Source)

        Issue Links

          Activity

          Hide
          Andreas L. Delmelle added a comment -

          Thanks for reporting, and apologies for the late reply...

          At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (= U+1F7E does not appear in the file http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis to generate those arrays in LineBreakUtils.java)

          On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?)

          The most straightforward 'fix' seems to be roughly as follows:

          Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java
          ===================================================================
          — src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision 1054383)
          +++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working copy)
          @@ -87,6 +87,7 @@

          /* Initial conversions */
          switch (currentClass) {
          + case 0: // Unassigned codepoint: consider as AL?
          case LineBreakUtils.LINE_BREAK_PROPERTY_AI:
          case LineBreakUtils.LINE_BREAK_PROPERTY_SG:
          case LineBreakUtils.LINE_BREAK_PROPERTY_XX:

          What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter.
          Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter. Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking...

          That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.

          Show
          Andreas L. Delmelle added a comment - Thanks for reporting, and apologies for the late reply... At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (= U+1F7E does not appear in the file http://www.unicode.org/Public/UNIDATA/LineBreak.txt , which is used as a basis to generate those arrays in LineBreakUtils.java) On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?) The most straightforward 'fix' seems to be roughly as follows: Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java =================================================================== — src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision 1054383) +++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working copy) @@ -87,6 +87,7 @@ /* Initial conversions */ switch (currentClass) { + case 0: // Unassigned codepoint: consider as AL? case LineBreakUtils.LINE_BREAK_PROPERTY_AI: case LineBreakUtils.LINE_BREAK_PROPERTY_SG: case LineBreakUtils.LINE_BREAK_PROPERTY_XX: What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter. Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter. Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking... That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.
          Hide
          Chris Bowditch added a comment -

          Indeed you raise a very good point Andreas. Even if you make the code change, I would expect # to appear in the output, because no font is likely to have a glyph for a reserved code point. So I am also interested to hear the business reason for using such a code point.

          Show
          Chris Bowditch added a comment - Indeed you raise a very good point Andreas. Even if you make the code change, I would expect # to appear in the output, because no font is likely to have a glyph for a reserved code point. So I am also interested to hear the business reason for using such a code point.
          Hide
          tvsudhir added a comment -

          Andreas,

          Thanks a lot for your response.

          Actually we came across some special characters which are not intended to be present in our database. We can figure out the reasons for this corruption and correct but then I do expect FOP to display whatever content is available. Whatever may be the character, till it is a valid code-point (even though it is reserved and do not have any representation of its own) I do not expect FOP to crash due to it.

          At least there should be some configuration available to the end user to tell FOP to use some default line break in such special cases it becomes specific to the customer who is using FOP. Just because of some special character the entire PDF generation should not be put in stake. Isn't it ? If given a choice to the customer to choose from set of options, to get rid of this situation then it is better, rather than crashing.

          Frankly speaking, we lost the hope of getting some response on this issue from Apache. We searched for this problem in google and we have seen many other guys complaining about similar issue (i.e. getting ArrayIndexOutofBoundsException). I believe they also might be having some reserved character in their text. We at least nailed down the cause of the problem. A proper resolution to this issue is of great help, not only to me but many others.

          Thanks again for looking into it and discussing about it in the forum.

          Show
          tvsudhir added a comment - Andreas, Thanks a lot for your response. Actually we came across some special characters which are not intended to be present in our database. We can figure out the reasons for this corruption and correct but then I do expect FOP to display whatever content is available. Whatever may be the character, till it is a valid code-point (even though it is reserved and do not have any representation of its own) I do not expect FOP to crash due to it. At least there should be some configuration available to the end user to tell FOP to use some default line break in such special cases it becomes specific to the customer who is using FOP. Just because of some special character the entire PDF generation should not be put in stake. Isn't it ? If given a choice to the customer to choose from set of options, to get rid of this situation then it is better, rather than crashing. Frankly speaking, we lost the hope of getting some response on this issue from Apache. We searched for this problem in google and we have seen many other guys complaining about similar issue (i.e. getting ArrayIndexOutofBoundsException). I believe they also might be having some reserved character in their text. We at least nailed down the cause of the problem. A proper resolution to this issue is of great help, not only to me but many others. Thanks again for looking into it and discussing about it in the forum.
          Hide
          Andreas L. Delmelle added a comment -

          (In reply to comment #3)
          > At least there should be some configuration available to the end user to tell
          > FOP to use some default line break in such special cases it becomes specific to
          > the customer who is using FOP. Just because of some special character the
          > entire PDF generation should not be put in stake. Isn't it ? If given a choice
          > to the customer to choose from set of options, to get rid of this situation
          > then it is better, rather than crashing.

          Very right indeed.
          So, if no one objects, I will apply the patch as proposed. FOP will no longer crash, but simply show a '#' for such unassigned codepoints in the output. Treating them as regular alphabetic characters seems to be safe enough for the time being.
          Customization of and/or more refined configuration possibilities for the Unicode line-breaking algorithm is something that is still on the wish-list for the longer term.

          Show
          Andreas L. Delmelle added a comment - (In reply to comment #3) > At least there should be some configuration available to the end user to tell > FOP to use some default line break in such special cases it becomes specific to > the customer who is using FOP. Just because of some special character the > entire PDF generation should not be put in stake. Isn't it ? If given a choice > to the customer to choose from set of options, to get rid of this situation > then it is better, rather than crashing. Very right indeed. So, if no one objects, I will apply the patch as proposed. FOP will no longer crash, but simply show a '#' for such unassigned codepoints in the output. Treating them as regular alphabetic characters seems to be safe enough for the time being. Customization of and/or more refined configuration possibilities for the Unicode line-breaking algorithm is something that is still on the wish-list for the longer term.
          Hide
          Andreas L. Delmelle added a comment -
          Show
          Andreas L. Delmelle added a comment - Fixed in Trunk. See: http://svn.apache.org/viewvc?rev=1056518&view=rev
          Hide
          Glenn Adams added a comment -

          batch transition to closed; if someone wishes to restore one of these to resolved in order to perform a verification step, then feel free to do so

          Show
          Glenn Adams added a comment - batch transition to closed; if someone wishes to restore one of these to resolved in order to perform a verification step, then feel free to do so

            People

            • Assignee:
              fop-dev
              Reporter:
              tvsudhir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development