FOP
  1. FOP
  2. FOP-1969

Surrogate pairs not treated as single unicode codepoint for display purposes

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Resolution: Unresolved
    • Affects Version/s: trunk
    • Fix Version/s: None
    • Component/s: unqualified
    • Labels:
      None
    • Environment:
      Operating System: All
      Platform: All
    • External issue ID:
      51843

      Description

      unicode codepoints outside of the BMP (base multilingual plane), i.e., whose scalar value is greater than 0xFFFF (65535), are coded as UTF-16 surrogate pairs in Java strings, which pair should be treated as a single codepoint for the purpose of mapping to a glyph in a font (that supports extra-BMP mappings);

      at present, FOP does not correctly handle this case in simple (non complex script) rendering paths;

      furthermore, though some support has been added to handle this in the complex script rendering path, it has not yet been tested, so is not necessarily working there either;

      1. pcltest.zip
        0.7 kB
        simon steiner
      2. single-byte.zip
        2 kB
        simon steiner
      3. testing.fo
        0.6 kB
        ngkit
      4. testing.fo
        0.8 kB
        ngkit
      5. testing.pdf
        5 kB
        ngkit
      6. testing.pdf
        5 kB
        ngkit
      7. testing.xml
        0.0 kB
        ngkit
      8. testing.xsl
        0.9 kB
        ngkit
      9. tiffttc.zip
        0.7 kB
        simon steiner
      10. Urdu.zip
        2 kB
        simon steiner

        Issue Links

          Activity

          Hide
          Glenn Adams added a comment -

          resetting P2 open bugs to P3 pending further review

          Show
          Glenn Adams added a comment - resetting P2 open bugs to P3 pending further review
          Hide
          Thomas T. added a comment -

          request to fix this to support surrogate pairs characters.

          Show
          Thomas T. added a comment - request to fix this to support surrogate pairs characters.
          Hide
          Glenn Adams added a comment -

          (In reply to comment #2)
          > request to fix this to support surrogate pairs characters.

          thanks for your request; could you provide additional information:

          1. what specific non-BPM characters you would like to use?
          2. what specific fonts will you use for these characters?

          Show
          Glenn Adams added a comment - (In reply to comment #2) > request to fix this to support surrogate pairs characters. thanks for your request; could you provide additional information: 1. what specific non-BPM characters you would like to use? 2. what specific fonts will you use for these characters?
          Hide
          Glenn Adams added a comment -

          (In reply to comment #3)
          > (In reply to comment #2)
          > > request to fix this to support surrogate pairs characters.
          >
          > thanks for your request; could you provide additional information:
          >
          > 1. what specific non-BPM characters you would like to use?
          > 2. what specific fonts will you use for these characters?

          s/BPM/BMP/

          Show
          Glenn Adams added a comment - (In reply to comment #3) > (In reply to comment #2) > > request to fix this to support surrogate pairs characters. > > thanks for your request; could you provide additional information: > > 1. what specific non-BPM characters you would like to use? > 2. what specific fonts will you use for these characters? s/BPM/BMP/
          Hide
          Saašha Metsärantala added a comment -

          Hello!

          Today, the majority of Unicode's characters are outside the BMP. This involves many alphabets and other character sets. Here are links to two of these non-BMP planes:
          http://www.unicode.org/roadmaps/smp/ and
          http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of alphabets are not part of Unicode, yet.

          Today's FOP supports no more than a minority of Unicode's characters and this minority will become proportionally less and less in the future. I consider that there is a need to solve this problem in the long run.

          Trying to "solve" the problem for some specific non-BMP characters will lead to this problem coming back again and again ...

          I will use FOP much more often as soon as it supports non-BMP characters.

          Regards!

          Saašha,

          Show
          Saašha Metsärantala added a comment - Hello! Today, the majority of Unicode's characters are outside the BMP. This involves many alphabets and other character sets. Here are links to two of these non-BMP planes: http://www.unicode.org/roadmaps/smp/ and http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of alphabets are not part of Unicode, yet. Today's FOP supports no more than a minority of Unicode's characters and this minority will become proportionally less and less in the future. I consider that there is a need to solve this problem in the long run. Trying to "solve" the problem for some specific non-BMP characters will lead to this problem coming back again and again ... I will use FOP much more often as soon as it supports non-BMP characters. Regards! Saašha,
          Hide
          Glenn Adams added a comment -

          (In reply to comment #5)
          > Hello!
          >
          > Today, the majority of Unicode's characters are outside the BMP. This
          > involves many alphabets and other character sets. Here are links to two of
          > these non-BMP planes:
          > http://www.unicode.org/roadmaps/smp/ and
          > http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of
          > alphabets are not part of Unicode, yet.
          >
          > Today's FOP supports no more than a minority of Unicode's characters and
          > this minority will become proportionally less and less in the future. I
          > consider that there is a need to solve this problem in the long run.
          >
          > Trying to "solve" the problem for some specific non-BMP characters will lead
          > to this problem coming back again and again ...
          >
          > I will use FOP much more often as soon as it supports non-BMP characters.
          >
          > Regards!
          >
          > Saašha,

          I'm sorry Saašha but I do not accept the rationale of your argument. First,
          FOP supports the representation of all BMP characters which is the vast
          majority of modern usage, >99.994%.

          If you cannot demonstrate to me a real, current need to use non-BMP characters
          or cannot demonstrate a font that actually supports these character mappings
          that you need to use, then I will leave this bug prioritized low (P5).

          If you wish to contribute a patch that adds non-BMP support, then the FOP
          team would be happy to apply it. In the mean time, you shall have to wait
          until this enhancement gets higher in the priority queue, and that will have
          to await many other enhancements in my opinion, such as finishing support
          for complex scripts, adding full CJK support, etc.

          Show
          Glenn Adams added a comment - (In reply to comment #5) > Hello! > > Today, the majority of Unicode's characters are outside the BMP. This > involves many alphabets and other character sets. Here are links to two of > these non-BMP planes: > http://www.unicode.org/roadmaps/smp/ and > http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of > alphabets are not part of Unicode, yet. > > Today's FOP supports no more than a minority of Unicode's characters and > this minority will become proportionally less and less in the future. I > consider that there is a need to solve this problem in the long run. > > Trying to "solve" the problem for some specific non-BMP characters will lead > to this problem coming back again and again ... > > I will use FOP much more often as soon as it supports non-BMP characters. > > Regards! > > Saašha, I'm sorry Saašha but I do not accept the rationale of your argument. First, FOP supports the representation of all BMP characters which is the vast majority of modern usage, >99.994%. If you cannot demonstrate to me a real, current need to use non-BMP characters or cannot demonstrate a font that actually supports these character mappings that you need to use, then I will leave this bug prioritized low (P5). If you wish to contribute a patch that adds non-BMP support, then the FOP team would be happy to apply it. In the mean time, you shall have to wait until this enhancement gets higher in the priority queue, and that will have to await many other enhancements in my opinion, such as finishing support for complex scripts, adding full CJK support, etc.
          Hide
          Saašha Metsärantala added a comment -

          Hello!

          Thanks for your reply! Here are a few clarifications!

          > the vast majority of modern usage,
          Many, many software do not support non-BMP characters. I would like to clarify that FOP is not the only one. The fact that non-BMP characters are poorly supported is among the main reasons why non-BMP characters are seldom encoded as such. Instead, work-arounds are used. For example, non-BMP characters are often converted to parts of the so called "private use area" (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are used, where the glyphs of one alphabet are just copied to a BMP-alphabet's place – reminding of the (early) nineties, where greek and cyrillic glyphs (among others) were often living in "ASCII"-fonts. Sometimes, they are replaced by PNG's. All these work-arounds contribute to many confusions and also contribute to the "non-visibility" of these alphabets and to great difficulties to find text written with these character sets.

          In other words, the poor support for non-BMP characters is indeed one of the main reasons for their "non-visibility". It is important to avoid misinterpretations here: these characters are both used and useful.

          > demonstrate to me a real, current need to use non-BMP characters
          To be accepted as part of Unicode, an alphabet or other character set (such as mathematical symbols, etc.) needs to be supported by a VERY active community during a long time. Otherwise, the Unicode consortium does not include this alphabet. The very fact that Unicode includes non-BMP alphabets and other character sets is a proof that an active community needs those characters.

          On the other hand, the fact that dozens of alphabets are still absent from Unicode shall not be misinterpreted as a non-usage of these alphabets.

          > adding full CJK support,
          Thousands of CJK characters live outside the BMP. A full CJK support requires support for non-BMP characters.

          > If you wish to contribute a patch that adds non-BMP support,
          I plan to try to write some kind of fix this summer.

          Regards!

          Saašha,

          Show
          Saašha Metsärantala added a comment - Hello! Thanks for your reply! Here are a few clarifications! > the vast majority of modern usage, Many, many software do not support non-BMP characters. I would like to clarify that FOP is not the only one. The fact that non-BMP characters are poorly supported is among the main reasons why non-BMP characters are seldom encoded as such. Instead, work-arounds are used. For example, non-BMP characters are often converted to parts of the so called "private use area" (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are used, where the glyphs of one alphabet are just copied to a BMP-alphabet's place – reminding of the (early) nineties, where greek and cyrillic glyphs (among others) were often living in "ASCII"-fonts. Sometimes, they are replaced by PNG's. All these work-arounds contribute to many confusions and also contribute to the "non-visibility" of these alphabets and to great difficulties to find text written with these character sets. In other words, the poor support for non-BMP characters is indeed one of the main reasons for their "non-visibility". It is important to avoid misinterpretations here: these characters are both used and useful. > demonstrate to me a real, current need to use non-BMP characters To be accepted as part of Unicode, an alphabet or other character set (such as mathematical symbols, etc.) needs to be supported by a VERY active community during a long time. Otherwise, the Unicode consortium does not include this alphabet. The very fact that Unicode includes non-BMP alphabets and other character sets is a proof that an active community needs those characters. On the other hand, the fact that dozens of alphabets are still absent from Unicode shall not be misinterpreted as a non-usage of these alphabets. > adding full CJK support, Thousands of CJK characters live outside the BMP. A full CJK support requires support for non-BMP characters. > If you wish to contribute a patch that adds non-BMP support, I plan to try to write some kind of fix this summer. Regards! Saašha,
          Hide
          Glenn Adams added a comment -

          (In reply to comment #7)
          > Hello!
          >
          > Thanks for your reply! Here are a few clarifications!
          >
          > > the vast majority of modern usage,
          > Many, many software do not support non-BMP characters. I would like to
          > clarify that FOP is not the only one. The fact that non-BMP characters are
          > poorly supported is among the main reasons why non-BMP characters are seldom
          > encoded as such. Instead, work-arounds are used. For example, non-BMP
          > characters are often converted to parts of the so called "private use area"
          > (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are
          > used, where the glyphs of one alphabet are just copied to a BMP-alphabet's
          > place – reminding of the (early) nineties, where greek and cyrillic glyphs
          > (among others) were often living in "ASCII"-fonts. Sometimes, they are
          > replaced by PNG's. All these work-arounds contribute to many confusions and
          > also contribute to the "non-visibility" of these alphabets and to great
          > difficulties to find text written with these character sets.
          >
          > In other words, the poor support for non-BMP characters is indeed one of the
          > main reasons for their "non-visibility". It is important to avoid
          > misinterpretations here: these characters are both used and useful.
          >
          > > demonstrate to me a real, current need to use non-BMP characters
          > To be accepted as part of Unicode, an alphabet or other character set (such
          > as mathematical symbols, etc.) needs to be supported by a VERY active
          > community during a long time. Otherwise, the Unicode consortium does not
          > include this alphabet. The very fact that Unicode includes non-BMP alphabets
          > and other character sets is a proof that an active community needs those
          > characters.
          >
          > On the other hand, the fact that dozens of alphabets are still absent from
          > Unicode shall not be misinterpreted as a non-usage of these alphabets.
          >
          > > adding full CJK support,
          > Thousands of CJK characters live outside the BMP. A full CJK support
          > requires support for non-BMP characters.
          >
          > > If you wish to contribute a patch that adds non-BMP support,
          > I plan to try to write some kind of fix this summer.
          >
          > Regards!
          >
          > Saašha,

          again you are giving me general reasons, but not specific ones that drive your immediate needs; i am extremely familiar with Unicode, having been a co-author of Unicode 2.0, a technical director of the Unicode consortium from 93-98, and Unicode's representative to the ISO SC2/WG2 IRG (Ideographic Rapporteur Group), who created the CJK encodings in Unicode;

          i want to know specifically what non-BMP characters you want to use and what specific fonts you will use to print these non-BMP characters; if you can demonstrate a good, real need (as opposed to generalities), then perhaps I will
          be inclined to give non-BMP support a greater priority; if not, I will
          continue to assign higher priority to other features that better support
          non-Roman scripts that use the BMP; regarding CJK and non-BMP, I agree that
          it is useful to support those characters, however, i'd like to see fonts
          that are available for these characters first;

          Show
          Glenn Adams added a comment - (In reply to comment #7) > Hello! > > Thanks for your reply! Here are a few clarifications! > > > the vast majority of modern usage, > Many, many software do not support non-BMP characters. I would like to > clarify that FOP is not the only one. The fact that non-BMP characters are > poorly supported is among the main reasons why non-BMP characters are seldom > encoded as such. Instead, work-arounds are used. For example, non-BMP > characters are often converted to parts of the so called "private use area" > (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are > used, where the glyphs of one alphabet are just copied to a BMP-alphabet's > place – reminding of the (early) nineties, where greek and cyrillic glyphs > (among others) were often living in "ASCII"-fonts. Sometimes, they are > replaced by PNG's. All these work-arounds contribute to many confusions and > also contribute to the "non-visibility" of these alphabets and to great > difficulties to find text written with these character sets. > > In other words, the poor support for non-BMP characters is indeed one of the > main reasons for their "non-visibility". It is important to avoid > misinterpretations here: these characters are both used and useful. > > > demonstrate to me a real, current need to use non-BMP characters > To be accepted as part of Unicode, an alphabet or other character set (such > as mathematical symbols, etc.) needs to be supported by a VERY active > community during a long time. Otherwise, the Unicode consortium does not > include this alphabet. The very fact that Unicode includes non-BMP alphabets > and other character sets is a proof that an active community needs those > characters. > > On the other hand, the fact that dozens of alphabets are still absent from > Unicode shall not be misinterpreted as a non-usage of these alphabets. > > > adding full CJK support, > Thousands of CJK characters live outside the BMP. A full CJK support > requires support for non-BMP characters. > > > If you wish to contribute a patch that adds non-BMP support, > I plan to try to write some kind of fix this summer. > > Regards! > > Saašha, again you are giving me general reasons, but not specific ones that drive your immediate needs; i am extremely familiar with Unicode, having been a co-author of Unicode 2.0, a technical director of the Unicode consortium from 93-98, and Unicode's representative to the ISO SC2/WG2 IRG (Ideographic Rapporteur Group), who created the CJK encodings in Unicode; i want to know specifically what non-BMP characters you want to use and what specific fonts you will use to print these non-BMP characters; if you can demonstrate a good, real need (as opposed to generalities), then perhaps I will be inclined to give non-BMP support a greater priority; if not, I will continue to assign higher priority to other features that better support non-Roman scripts that use the BMP; regarding CJK and non-BMP, I agree that it is useful to support those characters, however, i'd like to see fonts that are available for these characters first;
          Hide
          ngkit added a comment -

          Hello,
          I have used FOP library to generate PDF files for a serval years. It was a great library to perform the task. However, I found some "?" exist in PDF files recently. I have tried to find the root cause, the problem character byte code is not same with my previous using one. According to Microsoft document(here is the link http://www.microsoft.com/en-us/download/details.aspx?id=12080), some of the characters can be represented by both PUA or Unicode 4.1 byte code. And PUA is just a backward compatiable solution. And it seems PUA support is going to fade out in coming future. So is it possible to put this enhancment to higher priority?
          Kit

          Show
          ngkit added a comment - Hello, I have used FOP library to generate PDF files for a serval years. It was a great library to perform the task. However, I found some "?" exist in PDF files recently. I have tried to find the root cause, the problem character byte code is not same with my previous using one. According to Microsoft document(here is the link http://www.microsoft.com/en-us/download/details.aspx?id=12080 ), some of the characters can be represented by both PUA or Unicode 4.1 byte code. And PUA is just a backward compatiable solution. And it seems PUA support is going to fade out in coming future. So is it possible to put this enhancment to higher priority? Kit
          Hide
          Glenn Adams added a comment -

          (In reply to comment #9)
          > Hello,
          > I have used FOP library to generate PDF files for a serval years. It was a
          > great library to perform the task. However, I found some "?" exist in PDF
          > files recently. I have tried to find the root cause, the problem character
          > byte code is not same with my previous using one. According to Microsoft
          > document(here is the link
          > http://www.microsoft.com/en-us/download/details.aspx?id=12080), some of the
          > characters can be represented by both PUA or Unicode 4.1 byte code. And PUA
          > is just a backward compatiable solution. And it seems PUA support is going
          > to fade out in coming future. So is it possible to put this enhancment to
          > higher priority?
          > Kit

          I don't understand your comment. You need to provide more details to know if you have a problem or not, and if you do, whether it relates to this bug or not. If you have a problem with a specific input FO file, then attach that file along with the PDF file you obtain when running FOP. Also attach any console output. Once you do these things, I can evaluate whether your problem is legitimate or not and whether it is related or not.

          Show
          Glenn Adams added a comment - (In reply to comment #9) > Hello, > I have used FOP library to generate PDF files for a serval years. It was a > great library to perform the task. However, I found some "?" exist in PDF > files recently. I have tried to find the root cause, the problem character > byte code is not same with my previous using one. According to Microsoft > document(here is the link > http://www.microsoft.com/en-us/download/details.aspx?id=12080 ), some of the > characters can be represented by both PUA or Unicode 4.1 byte code. And PUA > is just a backward compatiable solution. And it seems PUA support is going > to fade out in coming future. So is it possible to put this enhancment to > higher priority? > Kit I don't understand your comment. You need to provide more details to know if you have a problem or not, and if you do, whether it relates to this bug or not. If you have a problem with a specific input FO file, then attach that file along with the PDF file you obtain when running FOP. Also attach any console output. Once you do these things, I can evaluate whether your problem is legitimate or not and whether it is related or not.
          Hide
          ngkit added a comment -

          Hello,
          Thanks for your comment and sorry for my misleading message and poor English.
          Here is my problem:
          When XML data files contains Chinese character with byte code does not exist in PUA, "?" will be displayed.
          And here is the fonts library information
          http://www.microsoft.com/en-us/download/details.aspx?id=12080
          And here is the character I failed to generated
          Unicde code (Hex):2070E

          According to the above URL, old PUA characters have been moved to non PUA code point assignment. It seems that Chinese characters in PUA will not have any enhancement or support in coming future. So is it possible to put this enhancment (support surrogate pairs characters) to higher priority?

          Cheers,
          Kit

          Show
          ngkit added a comment - Hello, Thanks for your comment and sorry for my misleading message and poor English. Here is my problem: When XML data files contains Chinese character with byte code does not exist in PUA, "?" will be displayed. And here is the fonts library information http://www.microsoft.com/en-us/download/details.aspx?id=12080 And here is the character I failed to generated Unicde code (Hex):2070E According to the above URL, old PUA characters have been moved to non PUA code point assignment. It seems that Chinese characters in PUA will not have any enhancement or support in coming future. So is it possible to put this enhancment (support surrogate pairs characters) to higher priority? Cheers, Kit
          Hide
          Glenn Adams added a comment -

          (In reply to comment #11)
          > Hello,
          > Thanks for your comment and sorry for my misleading message and poor
          > English.
          > Here is my problem:
          > When XML data files contains Chinese character with byte code does not exist
          > in PUA, "?" will be displayed.
          > And here is the fonts library information
          > http://www.microsoft.com/en-us/download/details.aspx?id=12080
          > And here is the character I failed to generated
          > Unicde code (Hex):2070E
          >
          > According to the above URL, old PUA characters have been moved to non PUA
          > code point assignment. It seems that Chinese characters in PUA will not have
          > any enhancement or support in coming future. So is it possible to put this
          > enhancment (support surrogate pairs characters) to higher priority?
          >
          > Cheers,
          > Kit

          Irrelevant. Characters encoded using PUA are not interchangeable. Private means Private. In any case, I'll ignore your comment unless and until you provide a sample FO/PDF pair demonstrating a problem.

          May I remind you that work on FOP (or any other Apache project) is done on a volunteer or sponsorship basis. If you want the priority placed higher, then either volunteer to do the work or sponsor someone to do the work. I welcome all improvements to FOP and will do my utmost to apply patches quickly, but your request to prioritize a particular feature has no weight unless you do something concrete to assist.

          Just as an FYI, my personal priority is to improve support for BMP encoded scripts, and then move on to non-BMP features.

          Respectfully, Glenn

          Show
          Glenn Adams added a comment - (In reply to comment #11) > Hello, > Thanks for your comment and sorry for my misleading message and poor > English. > Here is my problem: > When XML data files contains Chinese character with byte code does not exist > in PUA, "?" will be displayed. > And here is the fonts library information > http://www.microsoft.com/en-us/download/details.aspx?id=12080 > And here is the character I failed to generated > Unicde code (Hex):2070E > > According to the above URL, old PUA characters have been moved to non PUA > code point assignment. It seems that Chinese characters in PUA will not have > any enhancement or support in coming future. So is it possible to put this > enhancment (support surrogate pairs characters) to higher priority? > > Cheers, > Kit Irrelevant. Characters encoded using PUA are not interchangeable. Private means Private. In any case, I'll ignore your comment unless and until you provide a sample FO/PDF pair demonstrating a problem. May I remind you that work on FOP (or any other Apache project) is done on a volunteer or sponsorship basis. If you want the priority placed higher, then either volunteer to do the work or sponsor someone to do the work. I welcome all improvements to FOP and will do my utmost to apply patches quickly, but your request to prioritize a particular feature has no weight unless you do something concrete to assist. Just as an FYI, my personal priority is to improve support for BMP encoded scripts, and then move on to non-BMP features. Respectfully, Glenn
          Hide
          Glenn Adams added a comment -

          (In reply to comment #11)
          > Hello,
          > Thanks for your comment and sorry for my misleading message and poor
          > English.
          > Here is my problem:
          > When XML data files contains Chinese character with byte code does not exist
          > in PUA, "?" will be displayed.
          > And here is the fonts library information
          > http://www.microsoft.com/en-us/download/details.aspx?id=12080
          > And here is the character I failed to generated
          > Unicde code (Hex):2070E
          >
          > According to the above URL, old PUA characters have been moved to non PUA
          > code point assignment. It seems that Chinese characters in PUA will not have
          > any enhancement or support in coming future. So is it possible to put this
          > enhancment (support surrogate pairs characters) to higher priority?
          >
          > Cheers,
          > Kit

          i've asked once, and i'll ask again: please provide a minimal input FO file and an output PDF file demonstrating a problem; if you can't or won't do this, i can not do anything to help

          Show
          Glenn Adams added a comment - (In reply to comment #11) > Hello, > Thanks for your comment and sorry for my misleading message and poor > English. > Here is my problem: > When XML data files contains Chinese character with byte code does not exist > in PUA, "?" will be displayed. > And here is the fonts library information > http://www.microsoft.com/en-us/download/details.aspx?id=12080 > And here is the character I failed to generated > Unicde code (Hex):2070E > > According to the above URL, old PUA characters have been moved to non PUA > code point assignment. It seems that Chinese characters in PUA will not have > any enhancement or support in coming future. So is it possible to put this > enhancment (support surrogate pairs characters) to higher priority? > > Cheers, > Kit i've asked once, and i'll ask again: please provide a minimal input FO file and an output PDF file demonstrating a problem; if you can't or won't do this, i can not do anything to help
          Hide
          ngkit added a comment -

          Attachment testing.xsl has been added with description: Sample XSL file to generate Chinese character. It use "Mingliu" Chinese fonts

          Show
          ngkit added a comment - Attachment testing.xsl has been added with description: Sample XSL file to generate Chinese character. It use "Mingliu" Chinese fonts
          Hide
          ngkit added a comment -

          XML data files contains both characters from PUA and non PUA

          Show
          ngkit added a comment - XML data files contains both characters from PUA and non PUA
          Hide
          ngkit added a comment -

          Attachment testing.pdf has been added with description: Result PDF file

          Show
          ngkit added a comment - Attachment testing.pdf has been added with description: Result PDF file
          Hide
          ngkit added a comment -

          Attachment testing.xml has been added with description: Sample XML file contains both PUA and non-PUA chinese character

          Show
          ngkit added a comment - Attachment testing.xml has been added with description: Sample XML file contains both PUA and non-PUA chinese character
          Hide
          ngkit added a comment -

          Hello,
          I have uploaded XML data file, XSL template file and result PDF file. Any other information require?
          Cheers,
          Kit

          Show
          ngkit added a comment - Hello, I have uploaded XML data file, XSL template file and result PDF file. Any other information require? Cheers, Kit
          Hide
          Pascal Sancho added a comment -

          (In reply to comment #17)
          > Hello,
          > I have uploaded XML data file, XSL template file and result PDF file. Any
          > other information require?
          > Cheers,
          > Kit

          Hi,
          as Glenn said, you should attach the resulting XSL-FO resulting from the XML+XSLT transformation, this will be very helpful to reproduce (or not) the issue and identify what causes it.

          See bug reporting guidelines at [1] for further info.

          [1] http://xmlgraphics.apache.org/fop/bugs.html#issues_new

          Show
          Pascal Sancho added a comment - (In reply to comment #17) > Hello, > I have uploaded XML data file, XSL template file and result PDF file. Any > other information require? > Cheers, > Kit Hi, as Glenn said, you should attach the resulting XSL-FO resulting from the XML+XSLT transformation, this will be very helpful to reproduce (or not) the issue and identify what causes it. See bug reporting guidelines at [1] for further info. [1] http://xmlgraphics.apache.org/fop/bugs.html#issues_new
          Hide
          ngkit added a comment -

          Attachment testing.fo has been added with description: Sample FO file to generate PDF file

          Show
          ngkit added a comment - Attachment testing.fo has been added with description: Sample FO file to generate PDF file
          Hide
          ngkit added a comment -

          Attachment testing.fo has been added with description: Sample FO file to generate PDF

          Show
          ngkit added a comment - Attachment testing.fo has been added with description: Sample FO file to generate PDF
          Hide
          ngkit added a comment -

          Attachment testing.pdf has been added with description: Result PDF file

          Show
          ngkit added a comment - Attachment testing.pdf has been added with description: Result PDF file
          Hide
          ngkit added a comment -

          Hello,
          Sorry, I have upload wrong files before. I have uploaded XSL-FO result file result PDF file. Any other information require?
          Cheers,
          Kit

          Show
          ngkit added a comment - Hello, Sorry, I have upload wrong files before. I have uploaded XSL-FO result file result PDF file. Any other information require? Cheers, Kit
          Hide
          Jacky added a comment -

          Hi all,
          Glad to see the thread is active again as I had similiar concerns of using non-BMP characters. The support of non-BMP characters are very important as there are Street names that no other characters can be substituted.

          If FOP can support the double surrogates, I'm sure many more developers can enjoy it as the generated PDF embedded the font by default that solved many physical printing problems of printer loaded fonts.

          Jacky

          Show
          Jacky added a comment - Hi all, Glad to see the thread is active again as I had similiar concerns of using non-BMP characters. The support of non-BMP characters are very important as there are Street names that no other characters can be substituted. If FOP can support the double surrogates, I'm sure many more developers can enjoy it as the generated PDF embedded the font by default that solved many physical printing problems of printer loaded fonts. Jacky
          Hide
          Rick added a comment -

          Regarding above problem, we encountered same issue on my applications.
          It looks an common issue for chinese characters applications. Hoping that fix could be provided soon. Many Thanks.

          Rick

          Show
          Rick added a comment - Regarding above problem, we encountered same issue on my applications. It looks an common issue for chinese characters applications. Hoping that fix could be provided soon. Many Thanks. Rick
          Hide
          TY@Taiwan added a comment -

          Great that finally searched some related information about support non-BMP characters issue with FOP, & also wanna to know if it is due to FOP, & that problem quite annoying if my APPL should finally go ahead for deploy with FOP @production.

          Join thread to hear gd news.
          TY

          Show
          TY@Taiwan added a comment - Great that finally searched some related information about support non-BMP characters issue with FOP, & also wanna to know if it is due to FOP, & that problem quite annoying if my APPL should finally go ahead for deploy with FOP @production. Join thread to hear gd news. TY
          Hide
          Glenn Adams added a comment -

          (In reply to comment #25)
          > Great that finally searched some related information about support non-BMP
          > characters issue with FOP, & also wanna to know if it is due to FOP, & that
          > problem quite annoying if my APPL should finally go ahead for deploy with
          > FOP @production.
          >
          > Join thread to hear gd news.
          > TY

          don't jump to the conclusion that anything has changed in FOP: it hasn't!

          also, keep in mind that adding support for non-BMP characters in FOP is only a part of the solution; the larger part of the solution is outside of the scope of FOP, namely, the availability of OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of the following:

          • platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later),
            format 10.0 (trimmed array)
          • platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later),
            format 12.0 (segmented coverage)
          • platform ID 3 (windows), encoding ID 10 (ucs-4),
            format 12.0 (segmented coverage)

          so far, nobody has provide me a link to or a copy of such a font, and, until i have such a font in hand, i'm not going to take any action with respect to this bug

          Show
          Glenn Adams added a comment - (In reply to comment #25) > Great that finally searched some related information about support non-BMP > characters issue with FOP, & also wanna to know if it is due to FOP, & that > problem quite annoying if my APPL should finally go ahead for deploy with > FOP @production. > > Join thread to hear gd news. > TY don't jump to the conclusion that anything has changed in FOP: it hasn't! also, keep in mind that adding support for non-BMP characters in FOP is only a part of the solution; the larger part of the solution is outside of the scope of FOP, namely, the availability of OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of the following: platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later), format 10.0 (trimmed array) platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later), format 12.0 (segmented coverage) platform ID 3 (windows), encoding ID 10 (ucs-4), format 12.0 (segmented coverage) so far, nobody has provide me a link to or a copy of such a font, and, until i have such a font in hand, i'm not going to take any action with respect to this bug
          Hide
          C.C added a comment -

          Same problem here! Do you guys can provide me any work around before the bug is fixed? you know, it takes time to seek a suitable fonts to fit. Anyway, will keep an eye on the thread.

          Cusson

          Show
          C.C added a comment - Same problem here! Do you guys can provide me any work around before the bug is fixed? you know, it takes time to seek a suitable fonts to fit. Anyway, will keep an eye on the thread. Cusson
          Hide
          Thomas T. added a comment -

          Hi Glenn,

          Sorry not understand your requested fonts clearly. Is there any software/tools
          to check the fonts supported the 'cmap' you mentioned?
          I tried Microsoft Font Properties extension tools
          http://www.microsoft.com/typography/TrueTypeProperty21.mspx
          to check if i got fonts that suit, but it didn't involve the cmap properties.
          Thanks.

          Thomas T.

          (In reply to comment #26)
          > (In reply to comment #25)
          > Great that finally searched some related
          > information about support non-BMP
          > characters issue with FOP, & also wanna
          > to know if it is due to FOP, & that
          > problem quite annoying if my APPL
          > should finally go ahead for deploy with
          > FOP @production.
          >
          > Join thread
          > to hear gd news.
          > TY

          don't jump to the conclusion that anything has
          > changed in FOP: it hasn't!

          also, keep in mind that adding support for
          > non-BMP characters in FOP is only a part of the solution; the larger part of
          > the solution is outside of the scope of FOP, namely, the availability of
          > OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of
          > the following:

          • platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or
            > later),
            format 10.0 (trimmed array)
          • platform ID 0 (unicode), encoding
            > ID 3 (unicode 2.0 or later),
            format 12.0 (segmented coverage)
          • platform
            > ID 3 (windows), encoding ID 10 (ucs-4),
            format 12.0 (segmented coverage)
            > so far, nobody has provide me a link to or a copy of such a font, and, until
            > i have such a font in hand, i'm not going to take any action with respect to
            > this bug
          Show
          Thomas T. added a comment - Hi Glenn, Sorry not understand your requested fonts clearly. Is there any software/tools to check the fonts supported the 'cmap' you mentioned? I tried Microsoft Font Properties extension tools http://www.microsoft.com/typography/TrueTypeProperty21.mspx to check if i got fonts that suit, but it didn't involve the cmap properties. Thanks. Thomas T. (In reply to comment #26) > (In reply to comment #25) > Great that finally searched some related > information about support non-BMP > characters issue with FOP, & also wanna > to know if it is due to FOP, & that > problem quite annoying if my APPL > should finally go ahead for deploy with > FOP @production. > > Join thread > to hear gd news. > TY don't jump to the conclusion that anything has > changed in FOP: it hasn't! also, keep in mind that adding support for > non-BMP characters in FOP is only a part of the solution; the larger part of > the solution is outside of the scope of FOP, namely, the availability of > OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of > the following: platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or > later), format 10.0 (trimmed array) platform ID 0 (unicode), encoding > ID 3 (unicode 2.0 or later), format 12.0 (segmented coverage) platform > ID 3 (windows), encoding ID 10 (ucs-4), format 12.0 (segmented coverage) > so far, nobody has provide me a link to or a copy of such a font, and, until > i have such a font in hand, i'm not going to take any action with respect to > this bug
          Hide
          Glenn Adams added a comment -

          (In reply to comment #28)
          > Sorry not understand your requested fonts clearly. Is there any
          > software/tools
          > to check the fonts supported the 'cmap' you mentioned?
          > I tried Microsoft Font Properties extension tools
          > http://www.microsoft.com/typography/TrueTypeProperty21.mspx
          > to check if i got fonts that suit, but it didn't involve the cmap properties.

          One option is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)

          Show
          Glenn Adams added a comment - (In reply to comment #28) > Sorry not understand your requested fonts clearly. Is there any > software/tools > to check the fonts supported the 'cmap' you mentioned? > I tried Microsoft Font Properties extension tools > http://www.microsoft.com/typography/TrueTypeProperty21.mspx > to check if i got fonts that suit, but it didn't involve the cmap properties. One option is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)
          Hide
          Thomas T. added a comment -

          Hi Glenn,
          From your suggested tools, i found 4 kinds of fonts bundled in windows 7 with the following cmap supported, are that what you are looking for?
          ebrima.ttf
          <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583">

          ebrimabd.ttf
          <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583">

          seguisym.ttf
          <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="1900" language="0" nGroups="157">

          simsunb.ttf
          <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="40" language="0" nGroups="2">

          Thomas T.

          (In reply to comment #29)
          > (In reply to comment #28)
          > Sorry not understand your requested fonts
          > clearly. Is there any
          > software/tools
          > to check the fonts supported the
          > 'cmap' you mentioned?
          > I tried Microsoft Font Properties extension tools
          >
          > http://www.microsoft.com/typography/TrueTypeProperty21.mspx
          > to check if i
          > got fonts that suit, but it didn't involve the cmap properties.

          One option
          > is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)

          Show
          Thomas T. added a comment - Hi Glenn, From your suggested tools, i found 4 kinds of fonts bundled in windows 7 with the following cmap supported, are that what you are looking for? ebrima.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583"> ebrimabd.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583"> seguisym.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="1900" language="0" nGroups="157"> simsunb.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="40" language="0" nGroups="2"> Thomas T. (In reply to comment #29) > (In reply to comment #28) > Sorry not understand your requested fonts > clearly. Is there any > software/tools > to check the fonts supported the > 'cmap' you mentioned? > I tried Microsoft Font Properties extension tools > > http://www.microsoft.com/typography/TrueTypeProperty21.mspx > to check if i > got fonts that suit, but it didn't involve the cmap properties. One option > is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)
          Hide
          Thomas T. added a comment -

          Hi,
          Is my suggested fonts help? Or i need to find another??

          Thomas T.

          Show
          Thomas T. added a comment - Hi, Is my suggested fonts help? Or i need to find another?? Thomas T.
          Hide
          Glenn Adams added a comment -

          (In reply to comment #31)
          > Is my suggested fonts help? Or i need to find another??

          Yes, it will be helpful when I am ready to start working on this bug. I do not have a schedule for when I will start. Thanks for your checking on Win fonts that support non-BMP encodings.

          Show
          Glenn Adams added a comment - (In reply to comment #31) > Is my suggested fonts help? Or i need to find another?? Yes, it will be helpful when I am ready to start working on this bug. I do not have a schedule for when I will start. Thanks for your checking on Win fonts that support non-BMP encodings.
          Hide
          Shepard Lee added a comment -

          Hi All,

          I encountered the same issue on my applications using fop 1.0.
          Glad to see the issue is going to be fixed in the coming version.
          May I know if this bug will be fixed in version 1.1 only or it will be patched in version 1.0, too?

          Shepard

          Show
          Shepard Lee added a comment - Hi All, I encountered the same issue on my applications using fop 1.0. Glad to see the issue is going to be fixed in the coming version. May I know if this bug will be fixed in version 1.1 only or it will be patched in version 1.0, too? Shepard
          Hide
          Glenn Adams added a comment -

          (In reply to comment #33)
          > Hi All,
          >
          > I encountered the same issue on my applications using fop 1.0.
          > Glad to see the issue is going to be fixed in the coming version.
          > May I know if this bug will be fixed in version 1.1 only or it will be
          > patched in version 1.0, too?

          No, this is NOT going to be fixed in the upcoming version. I have made NO statements about when this will be addressed in FOP.

          In particular, it will NOT be patched in 1.0 and will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.

          Show
          Glenn Adams added a comment - (In reply to comment #33) > Hi All, > > I encountered the same issue on my applications using fop 1.0. > Glad to see the issue is going to be fixed in the coming version. > May I know if this bug will be fixed in version 1.1 only or it will be > patched in version 1.0, too? No, this is NOT going to be fixed in the upcoming version. I have made NO statements about when this will be addressed in FOP. In particular, it will NOT be patched in 1.0 and will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.
          Hide
          Jacky added a comment -

          (In reply to comment #34)
          > (In reply to comment #33)
          > Hi All,
          >
          > I encountered the same issue on my
          > applications using fop 1.0.
          > Glad to see the issue is going to be fixed in
          > the coming version.
          > May I know if this bug will be fixed in version 1.1
          > only or it will be
          > patched in version 1.0, too?

          No, this is NOT going to
          > be fixed in the upcoming version. I have made NO statements about when this
          > will be addressed in FOP.

          In particular, it will NOT be patched in 1.0 and
          > will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.

          (In reply to comment #34)
          > (In reply to comment #33)
          > Hi All,
          >
          > I encountered the same issue on my
          > applications using fop 1.0.
          > Glad to see the issue is going to be fixed in
          > the coming version.
          > May I know if this bug will be fixed in version 1.1
          > only or it will be
          > patched in version 1.0, too?

          No, this is NOT going to
          > be fixed in the upcoming version. I have made NO statements about when this
          > will be addressed in FOP.

          In particular, it will NOT be patched in 1.0 and
          > will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.

          Hi Adams,
          Nice to see you'd considered this thread. As I knew, even mainframe has similiar issues in using supporting surrogate pairs. Is that any workaround if stick to latest FOP version, or any news on tenative rollout of v1.2?

          Jacky

          Show
          Jacky added a comment - (In reply to comment #34) > (In reply to comment #33) > Hi All, > > I encountered the same issue on my > applications using fop 1.0. > Glad to see the issue is going to be fixed in > the coming version. > May I know if this bug will be fixed in version 1.1 > only or it will be > patched in version 1.0, too? No, this is NOT going to > be fixed in the upcoming version. I have made NO statements about when this > will be addressed in FOP. In particular, it will NOT be patched in 1.0 and > will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix. (In reply to comment #34) > (In reply to comment #33) > Hi All, > > I encountered the same issue on my > applications using fop 1.0. > Glad to see the issue is going to be fixed in > the coming version. > May I know if this bug will be fixed in version 1.1 > only or it will be > patched in version 1.0, too? No, this is NOT going to > be fixed in the upcoming version. I have made NO statements about when this > will be addressed in FOP. In particular, it will NOT be patched in 1.0 and > will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix. Hi Adams, Nice to see you'd considered this thread. As I knew, even mainframe has similiar issues in using supporting surrogate pairs. Is that any workaround if stick to latest FOP version, or any news on tenative rollout of v1.2? Jacky
          Hide
          Glenn Adams added a comment -

          (In reply to comment #35)
          > Nice to see you'd considered this thread. As I knew, even mainframe has
          > similiar issues in using supporting surrogate pairs. Is that any workaround
          > if stick to latest FOP version, or any news on tenative rollout of v1.2?

          FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be released. After that, I intend to put this work item on my list for possible 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end of this year.

          Show
          Glenn Adams added a comment - (In reply to comment #35) > Nice to see you'd considered this thread. As I knew, even mainframe has > similiar issues in using supporting surrogate pairs. Is that any workaround > if stick to latest FOP version, or any news on tenative rollout of v1.2? FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be released. After that, I intend to put this work item on my list for possible 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end of this year.
          Hide
          Sameh Ayoub added a comment -

          (In reply to comment #36)
          > (In reply to comment #35)
          > > Nice to see you'd considered this thread. As I knew, even mainframe has
          > > similiar issues in using supporting surrogate pairs. Is that any workaround
          > > if stick to latest FOP version, or any news on tenative rollout of v1.2?
          >
          > FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be
          > released. After that, I intend to put this work item on my list for possible
          > 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end
          > of this year.

          Hi Glenn,
          Thanks for considering adding this feature in FOP 1.2 by the end of this year.

          We are using FOP 1.1, and we want to have this feature as soon as it is get added.

          So, we are wondering:

          • Do you think this will be done in the near future?
          • Will the solution can be patched to FOP 1.1 easily?

          Thanks for your coordination.

          Show
          Sameh Ayoub added a comment - (In reply to comment #36) > (In reply to comment #35) > > Nice to see you'd considered this thread. As I knew, even mainframe has > > similiar issues in using supporting surrogate pairs. Is that any workaround > > if stick to latest FOP version, or any news on tenative rollout of v1.2? > > FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be > released. After that, I intend to put this work item on my list for possible > 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end > of this year. Hi Glenn, Thanks for considering adding this feature in FOP 1.2 by the end of this year. We are using FOP 1.1, and we want to have this feature as soon as it is get added. So, we are wondering: Do you think this will be done in the near future? Will the solution can be patched to FOP 1.1 easily? Thanks for your coordination.
          Hide
          Gaja Sutra added a comment -

          (In reply to comment #3)

          > > request to fix this to support surrogate pairs characters.
          I'm interested too, by this feature.

          > thanks for your request; could you provide additional information:
          > 1. what specific non-BMP characters you would like to use?
          Alchemical Symbols, by example: 1F701 -> 1F704

          > 2. what specific fonts will you use for these characters?
          Symbola font by example.
          http://users.teilar.gr/~g1951d/

          Thanks,
          Gaja.

          Show
          Gaja Sutra added a comment - (In reply to comment #3) > > request to fix this to support surrogate pairs characters. I'm interested too, by this feature. > thanks for your request; could you provide additional information: > 1. what specific non-BMP characters you would like to use? Alchemical Symbols, by example: 1F701 -> 1F704 > 2. what specific fonts will you use for these characters? Symbola font by example. http://users.teilar.gr/~g1951d/ Thanks, Gaja.
          Hide
          Justus Piater added a comment -

          > 1. what specific non-BMP characters you would like to use?
          Mathematical Alphanumeric Symbols (1D400–1D7FF)

          > 2. what specific fonts will you use for these characters?
          E.g. STIX, GNU FreeFont

          Show
          Justus Piater added a comment - > 1. what specific non-BMP characters you would like to use? Mathematical Alphanumeric Symbols (1D400–1D7FF) > 2. what specific fonts will you use for these characters? E.g. STIX, GNU FreeFont
          Hide
          Vinesh Kumar added a comment -

          Hi Glenn,

          We are using FOP 2.0 and looking for non-BMP characters support (for FULL CJK Unicode ranges). So, When can we expect the support of non-BMP characters in FOP.

          Regards,
          Vinesh Kumar. D

          Show
          Vinesh Kumar added a comment - Hi Glenn, We are using FOP 2.0 and looking for non-BMP characters support (for FULL CJK Unicode ranges). So, When can we expect the support of non-BMP characters in FOP. Regards, Vinesh Kumar. D
          Hide
          Simone Rondelli added a comment -

          Hi FOP Users,

          I am working on a project that uses Apache FOP and, as part of that project, need to fix FOP-1969 [1], which has to do with supplementary character support (surrogate pairs). I have obtained approval to contribute these changes back to the community. I want to run my design past the list (and especially Glenn Adams) and ask a few questions before proceeding:

          1. Read the CMAP from OpenFont.readCMAP() implementing the case: cmapPID == 3 && cmapEID == 10 and cmapFormat == 12. This way I could fill correctly the unicodeMappings List.
          2. Fix the class GLyphMapping to support non-BMP code points (there are already some TODO in the class for the support of the non-BMP code points)
          3. The class GLyphMapping uses the org.apache.fop.fonts.Font class methods like Font.hasChar(char c), Font.getCharWidth(char c), Font.mapChar(char c) etc.. since they accept a single char and a surrogate pair is composed by two chars I will need to modify the Font class as well. I think that I should either:
            1. add overloaded methods that accept int so that we can pass the code points. An alternative is to create a different set of method with the Codepoint suffix: Font.hasCodepoint(int cp), Font.getCodePointWidth(int cp), Font.mapCodepoint(int cp) etc...
            2. Change the methods firm to accept/return int
          4. The class Font uses the interface Typeface that has the same problem: methods that accept char. We should either change this interface or one of its subclasses like MultiByteFont or CIDFont (which denote font with a large set of code points.

          So far my research stopped at this point and before to proceed I would like some feedback to know wether I'm taking a good direction and If I'm missing something.

          (I sent the same message to the mailing list, I'm posting here to make clear that somebody is willing to work on it)

          Regards
          Simone Rondelli

          Show
          Simone Rondelli added a comment - Hi FOP Users, I am working on a project that uses Apache FOP and, as part of that project, need to fix FOP-1969 [1] , which has to do with supplementary character support (surrogate pairs). I have obtained approval to contribute these changes back to the community. I want to run my design past the list (and especially Glenn Adams) and ask a few questions before proceeding: Read the CMAP from OpenFont.readCMAP() implementing the case: cmapPID == 3 && cmapEID == 10 and cmapFormat == 12 . This way I could fill correctly the unicodeMappings List. Fix the class GLyphMapping to support non-BMP code points (there are already some TODO in the class for the support of the non-BMP code points) The class GLyphMapping uses the org.apache.fop.fonts.Font class methods like Font.hasChar(char c) , Font.getCharWidth(char c) , Font.mapChar(char c) etc.. since they accept a single char and a surrogate pair is composed by two chars I will need to modify the Font class as well. I think that I should either: add overloaded methods that accept int so that we can pass the code points. An alternative is to create a different set of method with the Codepoint suffix: Font.hasCodepoint(int cp) , Font.getCodePointWidth(int cp) , Font.mapCodepoint(int cp) etc... Change the methods firm to accept/return int The class Font uses the interface Typeface that has the same problem: methods that accept char. We should either change this interface or one of its subclasses like MultiByteFont or CIDFont (which denote font with a large set of code points. So far my research stopped at this point and before to proceed I would like some feedback to know wether I'm taking a good direction and If I'm missing something. (I sent the same message to the mailing list, I'm posting here to make clear that somebody is willing to work on it) Regards Simone Rondelli
          Hide
          Saašha Metsärantala added a comment -

          Hello Simone!

          THANKS a lot for that very welcome news!

          Unfortunately, I am not a Java expert (otherwise, I would have fixed this bug long time ago!), but your description seems good as far as I can tell.

          Good luck!

          Regards!

          Saašha,

          Show
          Saašha Metsärantala added a comment - Hello Simone! THANKS a lot for that very welcome news! Unfortunately, I am not a Java expert (otherwise, I would have fixed this bug long time ago!), but your description seems good as far as I can tell. Good luck! Regards! Saašha,
          Hide
          Glenn Adams added a comment -

          Yes, this is the basic approach one would take. You will need to track down all of the methods/fields where char is is used and change it to int or add a new method with an int signature (for methods). Then you will need to find all call sites for these use sites and change them to extract Unicode code points (integers in the range [0,1114111]) to pass to the changed/new methods.

          You will need to do this while not breaking the current tests and you will need to add new tests to cover non-BMP use cases.

          You will probably want to create a fork of the FOP repository in github and do your work on a branch of that fork.

          Good Luck,
          Glenn

          Show
          Glenn Adams added a comment - Yes, this is the basic approach one would take. You will need to track down all of the methods/fields where char is is used and change it to int or add a new method with an int signature (for methods). Then you will need to find all call sites for these use sites and change them to extract Unicode code points (integers in the range [0,1114111] ) to pass to the changed/new methods. You will need to do this while not breaking the current tests and you will need to add new tests to cover non-BMP use cases. You will probably want to create a fork of the FOP repository in github and do your work on a branch of that fork. Good Luck, Glenn
          Hide
          Simone Rondelli added a comment -

          Hi Glenn,

          Thanks for replying. Let me ask a couple of more clarifications: regarding these 2 options

          1. If I change the method signature using int instead of char (Eg. starting from the Typeface interface) then lot of code will need to be changed. I can see more than 20 direct/indirect subclasses which are used quite broadly in the project. I'm fine in changing them all but I want to be sure that this is what you (or the ApzcheFOP maintainers wants).
          2. Otherwise I can add overloaded methods at some point deeper in the hierarchy in order to reduce the number of changed classes (most of them would not even need to support code points since they are not meant to show such a big range of characters). I think the right place were to put the overloaded methods would be either MultiByteFont or CIDFont and then change accordingly the client of the chosen class.

          Could you tell me which approach do you think is the best for the Project?

          As soon as I'll have a working version I'll create a fork and branch the Project.

          Thanks
          Simone

          Show
          Simone Rondelli added a comment - Hi Glenn, Thanks for replying. Let me ask a couple of more clarifications: regarding these 2 options If I change the method signature using int instead of char (Eg. starting from the Typeface interface) then lot of code will need to be changed. I can see more than 20 direct/indirect subclasses which are used quite broadly in the project. I'm fine in changing them all but I want to be sure that this is what you (or the ApzcheFOP maintainers wants). Otherwise I can add overloaded methods at some point deeper in the hierarchy in order to reduce the number of changed classes (most of them would not even need to support code points since they are not meant to show such a big range of characters). I think the right place were to put the overloaded methods would be either MultiByteFont or CIDFont and then change accordingly the client of the chosen class. Could you tell me which approach do you think is the best for the Project? As soon as I'll have a working version I'll create a fork and branch the Project. Thanks Simone
          Hide
          Glenn Adams added a comment -

          There are pros and cons to both approaches.

          The biggest problem to changing signatures rather than adding new signatures is that it will require a new major version update because it will be changing public APIs or APIs that have been effectively treated as public even though they might be argued to be internal. Doing this will cause more difficulty for existing programmatic uses of FOP since it will require more changes than the alternative.

          However, the alternative, to add new signatures, means that users of existing signatures will not benefit from the change, and may end up viewing these cases as bugs. In such case, some code paths will function correctly with non-BMP content but others will not, and determining which is which and addressing those cases will create a long "tail" for this change, i.e., one that requires many follow-on changes.

          Having said that, I would probably take the conservative approach and do the latter (add signatures) rather than the former (change signatures). That should allow you to get some key code paths working sooner than others, but it will create an obligation for further downstream work.

          Show
          Glenn Adams added a comment - There are pros and cons to both approaches. The biggest problem to changing signatures rather than adding new signatures is that it will require a new major version update because it will be changing public APIs or APIs that have been effectively treated as public even though they might be argued to be internal. Doing this will cause more difficulty for existing programmatic uses of FOP since it will require more changes than the alternative. However, the alternative, to add new signatures, means that users of existing signatures will not benefit from the change, and may end up viewing these cases as bugs. In such case, some code paths will function correctly with non-BMP content but others will not, and determining which is which and addressing those cases will create a long "tail" for this change, i.e., one that requires many follow-on changes. Having said that, I would probably take the conservative approach and do the latter (add signatures) rather than the former (change signatures). That should allow you to get some key code paths working sooner than others, but it will create an obligation for further downstream work.
          Hide
          Addison Phillips added a comment -

          Adding signatures would be consistent with javas own approach to the problem some years ago, and thus has the benefit of being familiar to at least some Java developers.

          Show
          Addison Phillips added a comment - Adding signatures would be consistent with javas own approach to the problem some years ago, and thus has the benefit of being familiar to at least some Java developers.
          Hide
          Simone Rondelli added a comment -

          Hi Glenn,

          I got a first proof of concept that renders the emoji. The first problem is that the some emoji is composed by more then one codepoint like flags (🇮🇹) and families (👨‍👨‍👦) . I found that the information one how to merge more codepoints (or glyph) into a unique glyph is described in the ligatures table in the font.

          The ligatures table is associated to a script and in the font that I'm using (EmojiOne) this table is associated with latn script. The problem is in GlyphMapping.processWordMapping where the script of the text is retrieved using String script = text.getScript();. The value returned is zyyy, SCRIPT_UNDEFINED, for text composed by just emojies and auto for mixed text (latn/cjk + emoji).

          1. Is this a bug of the font or ApacheFOP?
          2. What would be a good approach to fix it?

          I thought that I could modify the logic inside GliphTable.matchLookups to select * when the script is zyyy or auto. But I jhave te feeling that this could break something. Am I right?

          Show
          Simone Rondelli added a comment - Hi Glenn, I got a first proof of concept that renders the emoji. The first problem is that the some emoji is composed by more then one codepoint like flags ( 🇮🇹 ) and families ( 👨‍👨‍👦 ) . I found that the information one how to merge more codepoints (or glyph) into a unique glyph is described in the ligatures table in the font. The ligatures table is associated to a script and in the font that I'm using (EmojiOne) this table is associated with latn script. The problem is in GlyphMapping.processWordMapping where the script of the text is retrieved using String script = text.getScript(); . The value returned is zyyy , SCRIPT_UNDEFINED , for text composed by just emojies and auto for mixed text (latn/cjk + emoji). Is this a bug of the font or ApacheFOP? What would be a good approach to fix it? I thought that I could modify the logic inside GliphTable.matchLookups to select * when the script is zyyy or auto . But I jhave te feeling that this could break something. Am I right?
          Hide
          Glenn Adams added a comment -

          Firstly, note that this is not related to this issue (FOP-1969), so it is better to create a new issue to document this problem. Secondly, this is somewhat related to FOP-2094 [1].

          [1] https://issues.apache.org/jira/browse/FOP-2094

          Modifying GlyphTable.matchLookups as you suggest would not be the correct solution.

          You might try adding a script="dflt" or script="zyyy" value for short-term fix.

          Show
          Glenn Adams added a comment - Firstly, note that this is not related to this issue ( FOP-1969 ), so it is better to create a new issue to document this problem. Secondly, this is somewhat related to FOP-2094 [1] . [1] https://issues.apache.org/jira/browse/FOP-2094 Modifying GlyphTable.matchLookups as you suggest would not be the correct solution. You might try adding a script="dflt" or script="zyyy" value for short-term fix.
          Hide
          Simone Rondelli added a comment -

          I think it's exactly the same issue, and probably I could recycle part of http://svn.apache.org/viewvc?view=revision&revision=r1623885 to fix it.

          Show
          Simone Rondelli added a comment - I think it's exactly the same issue, and probably I could recycle part of http://svn.apache.org/viewvc?view=revision&revision=r1623885 to fix it.
          Hide
          Glenn Adams added a comment -

          No, the issue of script tag handling is definitely not related to surrogate pairs. That the surrogate pairs you are testing happen to have issues regarding script mapping is a coincidence.

          Show
          Glenn Adams added a comment - No, the issue of script tag handling is definitely not related to surrogate pairs. That the surrogate pairs you are testing happen to have issues regarding script mapping is a coincidence.
          Hide
          Simone Rondelli added a comment -

          I meant the same as FOP-2094.

          Show
          Simone Rondelli added a comment - I meant the same as FOP-2094 .
          Hide
          Glenn Adams added a comment -

          I agree they are related, but not identical. It is still appropriate to create a new issue, in which you can reference 2094.

          Show
          Glenn Adams added a comment - I agree they are related, but not identical. It is still appropriate to create a new issue, in which you can reference 2094.
          Show
          Simone Rondelli added a comment - Created: https://issues.apache.org/jira/browse/FOP-2638
          Hide
          Glenn Adams added a comment -

          thanks

          Show
          Glenn Adams added a comment - thanks
          Hide
          Simone Rondelli added a comment - - edited

          Would it be a problem if I change the variable names? I found pretty hard to understand the flow with abbreviations and maybe it would help a bit future developers.

          EG:

          CharSequence  ncs = normalize(cs, associations); //normalizedCharSeq
          GlyphSequence igs = mapCharsToGlyphs(ncs, associations); //glyphSeq
          GlyphSequence ogs = gsub.substitute(igs, script, language); //substitutedGlyphSeq
          
          Show
          Simone Rondelli added a comment - - edited Would it be a problem if I change the variable names? I found pretty hard to understand the flow with abbreviations and maybe it would help a bit future developers. EG: CharSequence ncs = normalize(cs, associations); //normalizedCharSeq GlyphSequence igs = mapCharsToGlyphs(ncs, associations); //glyphSeq GlyphSequence ogs = gsub.substitute(igs, script, language); //substitutedGlyphSeq
          Hide
          Glenn Adams added a comment -

          yes, please don't change, as it will make merging a possible future patch more difficult; you should restrict your changes in a possible patch the minimum required to support non-BMP; the larger the patch, the more difficult to merge

          Show
          Glenn Adams added a comment - yes, please don't change, as it will make merging a possible future patch more difficult; you should restrict your changes in a possible patch the minimum required to support non-BMP; the larger the patch, the more difficult to merge
          Hide
          Simone Rondelli added a comment -

          Ok, get it.
          One more thing: I found that Control Characters are elided in MultiByteFont.performSobstitution:

          if (!retainControls) {
              ogs = elideControls(ogs);
          }
          

          This prevents from correctly show some emoji like 👨‍👩‍👦 (formed by \u1f468\u200d\u1f469\u200d\u1f466). In this case \u200d is elided making impossible to correctly show the emoji. The value of

          {retainControls}

          is statically set to false in

          {TextLayoutManager}

          .

          • Is there any reason why this is always false?
          • What would be the logic to set it to true?
          Show
          Simone Rondelli added a comment - Ok, get it. One more thing: I found that Control Characters are elided in MultiByteFont.performSobstitution: if (!retainControls) { ogs = elideControls(ogs); } This prevents from correctly show some emoji like 👨‍👩‍👦 (formed by \u1f468\u200d\u1f469\u200d\u1f466). In this case \u200d is elided making impossible to correctly show the emoji. The value of {retainControls} is statically set to false in {TextLayoutManager} . Is there any reason why this is always false? What would be the logic to set it to true?
          Hide
          Glenn Adams added a comment -

          I'm not sure why you mean that "this prevents from correctly show some emoji". The semantics of the elided controls, e.g., ZWJ, ZWNJ, need to be processed by GSUB/GPOS during the process of creating the output glyph sequence. No further processing based on ZWJ/ZWNJ should occur after that process. The elision of controls operates on the output glyph sequence.

          The value of retainControls is presently constant, but it is my plan to introduce a new fox:... property that allows an author to determine whether control characters are themselves displayed (as control characters). So it's a placeholder which value is to be determined by the to be implemented fox:... property. Note that whether controls can be displayed by a font is a per-font dependency.

          Show
          Glenn Adams added a comment - I'm not sure why you mean that "this prevents from correctly show some emoji". The semantics of the elided controls, e.g., ZWJ, ZWNJ, need to be processed by GSUB/GPOS during the process of creating the output glyph sequence. No further processing based on ZWJ/ZWNJ should occur after that process. The elision of controls operates on the output glyph sequence. The value of retainControls is presently constant, but it is my plan to introduce a new fox:... property that allows an author to determine whether control characters are themselves displayed (as control characters). So it's a placeholder which value is to be determined by the to be implemented fox:... property. Note that whether controls can be displayed by a font is a per-font dependency.
          Hide
          Simone Rondelli added a comment - - edited

          I see what you mean. So maybe there is a bug in elideControls because if I comment it out I get the emoji printed otherwise no.

          I have as an input a GlyphSequence with the following values:

          • characters: \u1f468\u200d\u1f469\u200d\u1f466
          • glyphs: 1643
          • association: [0, 5]

          After calling elideControls on it i have as output a GlyphSequence with the following values:

          • characters: \u1f468\u200d\u1f469\u200d\u1f466
          • glyphs: 0
          • association: []

          As you can see the characters are still there while the glyph and the association got elided. This makes the following operation like mapGlyphsToChars(ogs) return an empty array therefore nothing is printed in the PDF.

          If I understand correctly the semantic of elide controls you want something like this:

          • characters: \u1f468\u1f469\u1f466
          • glyphs: 1643
          • association: [0, 3]

          Am I correct or I still missing something?

          Thanks

          Show
          Simone Rondelli added a comment - - edited I see what you mean. So maybe there is a bug in elideControls because if I comment it out I get the emoji printed otherwise no. I have as an input a GlyphSequence with the following values: characters: \u1f468\u200d\u1f469\u200d\u1f466 glyphs: 1643 association: [0, 5] After calling elideControls on it i have as output a GlyphSequence with the following values: characters: \u1f468\u200d\u1f469\u200d\u1f466 glyphs: 0 association: [] As you can see the characters are still there while the glyph and the association got elided. This makes the following operation like mapGlyphsToChars(ogs) return an empty array therefore nothing is printed in the PDF. If I understand correctly the semantic of elide controls you want something like this: characters: \u1f468\u1f469\u1f466 glyphs: 1643 association: [0, 3] Am I correct or I still missing something? Thanks
          Hide
          Glenn Adams added a comment -
          If I understand correctly the semantic of elide controls you want something like this:
          characters: \u1f468\u1f469\u1f466
          glyphs: 1643
          association: [0, 3]

          elideControls() does not change the original characters array, it only removes (elides) glyphs associated with elidable control characters in the original characters array;

          you need to step through the code in elideControls in an IDE in order to find why it isn't inserting the non-elided glyphs

          Show
          Glenn Adams added a comment - If I understand correctly the semantic of elide controls you want something like this: characters: \u1f468\u1f469\u1f466 glyphs: 1643 association: [0, 3] elideControls() does not change the original characters array, it only removes (elides) glyphs associated with elidable control characters in the original characters array; you need to step through the code in elideControls in an IDE in order to find why it isn't inserting the non-elided glyphs
          Hide
          ASF GitHub Bot added a comment -

          GitHub user monejava opened a pull request:

          https://github.com/apache/fop/pull/3

          FOP-1969: Surrogate pairs not treated as single unicode codepoint for…

          Implemented correct handling of surrogate pairs in ApacheFOP. The supported Renderes are PDF, PS and PNG. Tests implemented when it was possible.

          Here a brief explanation of the design choice that I have made to modify the public API:

          `mapChar(char)`/`hasChar(char)`: are defined in `Typeface` which means that they have more then 20 implementations. Modify this interface would require lot of work and might introduce lot of bugs. That's why Glenn Adams (our contact in ApacheFOP project) asked us to create new methods rather the existing ones. In some of these implementations, such as `SingleByteFont`, is semantically correct to have a character represented by a single UTF-16 character. In some other implementation such as `CIDFont` (http://www.adobe.com/products/postscript/pdfs/cid.pdf) is not since they are meant to cover a wider range then 2^16 characters.

          `mapCodePoint(int)`/`hasCodePoint(int)`: I have added these 2 methods to the `CIDFont` class that uses int (code points) instead of char so that we can cover the full Unicode range. As you can see from the `Typeface` hierarchy this change affect only 2 classes.

          `getUnicode()`: is defined in `CIDSet` (is not a property of the `Typeface` class or one of its subclasses). I changed the firm of this method to handle int instead of char because it is semantically incorrect to represent unicode with a single UTF-16 char. As you can see from the `CIDSet` hierarchy the change affect only 3 classes.

          `getUnicodeFromGID()`: this method is defined in `CustomFont` and `CIDSet`. It never get called from the `MultiByteFont` path, probably becuase getUnicode is used instead. That is why I'm down casting the return value from int to char in `CIDFull` and `CIDSubset`. Probably the best thing to do would be to get rid of this method or make it handle int, but again the change would affect more classes then the ones in our scope.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/monejava/fop surrogate_pairs

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/fop/pull/3.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3


          commit 111d6a6fa58c313293e9b79e245c8521778de2c8
          Author: Rondelli <rondelli@amazon.com>
          Date: 2016-09-19T15:13:09Z

          FOP-1969: Surrogate pairs not treated as single unicode codepoint for display purposes


          Show
          ASF GitHub Bot added a comment - GitHub user monejava opened a pull request: https://github.com/apache/fop/pull/3 FOP-1969 : Surrogate pairs not treated as single unicode codepoint for… Implemented correct handling of surrogate pairs in ApacheFOP. The supported Renderes are PDF, PS and PNG. Tests implemented when it was possible. Here a brief explanation of the design choice that I have made to modify the public API: `mapChar(char)`/`hasChar(char)`: are defined in `Typeface` which means that they have more then 20 implementations. Modify this interface would require lot of work and might introduce lot of bugs. That's why Glenn Adams (our contact in ApacheFOP project) asked us to create new methods rather the existing ones. In some of these implementations, such as `SingleByteFont`, is semantically correct to have a character represented by a single UTF-16 character. In some other implementation such as `CIDFont` ( http://www.adobe.com/products/postscript/pdfs/cid.pdf ) is not since they are meant to cover a wider range then 2^16 characters. `mapCodePoint(int)`/`hasCodePoint(int)`: I have added these 2 methods to the `CIDFont` class that uses int (code points) instead of char so that we can cover the full Unicode range. As you can see from the `Typeface` hierarchy this change affect only 2 classes. `getUnicode()`: is defined in `CIDSet` (is not a property of the `Typeface` class or one of its subclasses). I changed the firm of this method to handle int instead of char because it is semantically incorrect to represent unicode with a single UTF-16 char. As you can see from the `CIDSet` hierarchy the change affect only 3 classes. `getUnicodeFromGID()`: this method is defined in `CustomFont` and `CIDSet`. It never get called from the `MultiByteFont` path, probably becuase getUnicode is used instead. That is why I'm down casting the return value from int to char in `CIDFull` and `CIDSubset`. Probably the best thing to do would be to get rid of this method or make it handle int, but again the change would affect more classes then the ones in our scope. You can merge this pull request into a Git repository by running: $ git pull https://github.com/monejava/fop surrogate_pairs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/fop/pull/3.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3 commit 111d6a6fa58c313293e9b79e245c8521778de2c8 Author: Rondelli <rondelli@amazon.com> Date: 2016-09-19T15:13:09Z FOP-1969 : Surrogate pairs not treated as single unicode codepoint for display purposes
          Hide
          simon steiner added a comment - - edited

          checkstyle and findbugs is failing

          and i get from few examples i tried:

          java.lang.NullPointerException
          at org.apache.fop.fonts.CIDSubset.getGIDFromChar(CIDSubset.java:133)

          java.lang.NullPointerException
          at org.apache.fop.svg.font.FOPGVTGlyphVector.buildBoundingBoxes(FOPGVTGlyphVector.java:429)

          java.nio.BufferOverflowException
          at java.nio.Buffer.nextPutIndex(Buffer.java:521)
          at java.nio.HeapCharBuffer.put(HeapCharBuffer.java:169)
          at org.apache.fop.fonts.MultiByteFont.mapGlyphsToChars(MultiByteFont.java:716

          Show
          simon steiner added a comment - - edited checkstyle and findbugs is failing and i get from few examples i tried: java.lang.NullPointerException at org.apache.fop.fonts.CIDSubset.getGIDFromChar(CIDSubset.java:133) java.lang.NullPointerException at org.apache.fop.svg.font.FOPGVTGlyphVector.buildBoundingBoxes(FOPGVTGlyphVector.java:429) java.nio.BufferOverflowException at java.nio.Buffer.nextPutIndex(Buffer.java:521) at java.nio.HeapCharBuffer.put(HeapCharBuffer.java:169) at org.apache.fop.fonts.MultiByteFont.mapGlyphsToChars(MultiByteFont.java:716
          Hide
          Simone Rondelli added a comment -

          Hi Simon,

          Thanks for taking a look to this. Could you please share the examples you tried?

          Show
          Simone Rondelli added a comment - Hi Simon, Thanks for taking a look to this. Could you please share the examples you tried?
          Hide
          Glenn Adams added a comment -

          Simon,

          Don't merge this PR until it gets considerable testing. Also, the submitter
          needs to do a thorough job of writing new tests.

          Thanks,
          Glenn

          On Tue, Sep 20, 2016 at 3:16 AM, simon steiner (JIRA) <jira@apache.org>

          Show
          Glenn Adams added a comment - Simon, Don't merge this PR until it gets considerable testing. Also, the submitter needs to do a thorough job of writing new tests. Thanks, Glenn On Tue, Sep 20, 2016 at 3:16 AM, simon steiner (JIRA) <jira@apache.org>
          Hide
          simon steiner added a comment - - edited

          single-byte.zip needs DejaVuSans.ttf

          Show
          simon steiner added a comment - - edited single-byte.zip needs DejaVuSans.ttf
          Hide
          simon steiner added a comment -

          pcltest.zip needs FRE3OF9X.TTF

          Show
          simon steiner added a comment - pcltest.zip needs FRE3OF9X.TTF
          Hide
          simon steiner added a comment -

          Urdu.zip needs ScheherazadeRegOT.ttf

          Show
          simon steiner added a comment - Urdu.zip needs ScheherazadeRegOT.ttf
          Hide
          simon steiner added a comment - - edited

          visual issue tiffttc.zip needs cambria.ttc

          Show
          simon steiner added a comment - - edited visual issue tiffttc.zip needs cambria.ttc
          Hide
          Simone Rondelli added a comment - - edited

          I see the problem.

          MultiByteFont.java
          private CharSequence mapGlyphsToChars(GlyphSequence gs) {
              int ng = gs.getGlyphCount();
              CharBuffer cb = CharBuffer.allocate(gs.getUTF16CharacterCount());  \\ <-- Here
              int ccMissing = Typeface.NOT_FOUND;
              for (int i = 0, n = ng; i < n; i++) {
                  int gi = gs.getGlyph(i);
                  int cc = findCharacterFromGlyphIndex(gi); \\ <--Problem
                  if ((cc == 0) || (cc > 0x10FFFF)) {
                      cc = ccMissing;
                      log.warn("Unable to map glyph index " + gi
                               + " to Unicode scalar in font '"
                               + getFullName() + "', substituting missing character '"
                               + (char) cc + "'");
                  }
                  if (cc > 0x00FFFF) {
                      int sh;
                      int sl;
                      cc -= 0x10000;
                      sh = ((cc >> 10) & 0x3FF) + 0xD800;
                      sl = ((cc >>  0) & 0x3FF) + 0xDC00;
                      cb.put((char) sh);
                      cb.put((char) sl);
                  } else {
                      cb.put((char) cc);
                  }
              }
              cb.flip();
              return cb;
          }
          

          In Urdu language one character is mapped to multiple glyphs. This sequence is enough to make the program crash اآخری. Before my modification the CharBuffer was initialized in this way: CharBuffer.allocate(gs.getGlyphCount();. This cause again a BufferOverflow error when you deal with Surrogate Pairs because you have one glyph corresponding to multiple characters. This is why I have changed it to CharBuffer.allocate(gs.getUTF16CharacterCount();. Which is not working in this case were a single character is mapped to multiple glyphs.

          Now the question is: what is the correct way to count the characters into the GlyphSequence?

          1. I could use the GlyphSequence.association list and the content of GlyphSequence.characters to count the real number of characters that corresponds to the given glyph sequence. The problem that I can see is that the findCharacterFromGlyphIndex(gi); might return a different chars (with different sizes) from the ones into GlyphSequence.characters.
          2. Resize the CharBuffer when it gets full
          3. Put the chars into a List and then into a CharBuffer

          Any thoughts?

          PS: Why the character is retrieved using findCharacterFromGlyphIndex(gi); instead of using the characters inside the GlyphSequence?

          Show
          Simone Rondelli added a comment - - edited I see the problem. MultiByteFont.java private CharSequence mapGlyphsToChars(GlyphSequence gs) { int ng = gs.getGlyphCount(); CharBuffer cb = CharBuffer.allocate(gs.getUTF16CharacterCount()); \\ <-- Here int ccMissing = Typeface.NOT_FOUND; for ( int i = 0, n = ng; i < n; i++) { int gi = gs.getGlyph(i); int cc = findCharacterFromGlyphIndex(gi); \\ <--Problem if ((cc == 0) || (cc > 0x10FFFF)) { cc = ccMissing; log.warn( "Unable to map glyph index " + gi + " to Unicode scalar in font '" + getFullName() + "', substituting missing character '" + ( char ) cc + "'" ); } if (cc > 0x00FFFF) { int sh; int sl; cc -= 0x10000; sh = ((cc >> 10) & 0x3FF) + 0xD800; sl = ((cc >> 0) & 0x3FF) + 0xDC00; cb.put(( char ) sh); cb.put(( char ) sl); } else { cb.put(( char ) cc); } } cb.flip(); return cb; } In Urdu language one character is mapped to multiple glyphs. This sequence is enough to make the program crash اآخری. Before my modification the CharBuffer was initialized in this way: CharBuffer.allocate(gs.getGlyphCount(); . This cause again a BufferOverflow error when you deal with Surrogate Pairs because you have one glyph corresponding to multiple characters. This is why I have changed it to CharBuffer.allocate(gs.getUTF16CharacterCount(); . Which is not working in this case were a single character is mapped to multiple glyphs. Now the question is: what is the correct way to count the characters into the GlyphSequence? I could use the GlyphSequence.association list and the content of GlyphSequence.characters to count the real number of characters that corresponds to the given glyph sequence. The problem that I can see is that the findCharacterFromGlyphIndex(gi); might return a different chars (with different sizes) from the ones into GlyphSequence.characters. Resize the CharBuffer when it gets full Put the chars into a List and then into a CharBuffer Any thoughts? PS: Why the character is retrieved using findCharacterFromGlyphIndex(gi); instead of using the characters inside the GlyphSequence?
          Hide
          Glenn Adams added a comment -

          It will be necessary to change all GlyphSequence to use IntBuffer instead
          of CharBuffer, then you will have to convert from UTF-16 to UTF-32 to fill
          the IntBuffer

          On Tue, Sep 20, 2016 at 8:47 AM, Simone Rondelli (JIRA) <jira@apache.org>

          Show
          Glenn Adams added a comment - It will be necessary to change all GlyphSequence to use IntBuffer instead of CharBuffer, then you will have to convert from UTF-16 to UTF-32 to fill the IntBuffer On Tue, Sep 20, 2016 at 8:47 AM, Simone Rondelli (JIRA) <jira@apache.org>
          Hide
          Simone Rondelli added a comment -

          GlyphSequence is already using an IntBuffer to store the characters. Did you mean MultiByteFont? This would mean to change lot of interfaces such as Substitutable and Positionable and their subclasses (Font, MultiByteFont, CustomFontMetricMapper, LazyFont). Is this what you want?

          Regardless the answer of the previous question it's still not cleat to me why to use findCharacterFromGlyphIndex(gi); instead of using the characters inside GlyphSequence.

          Show
          Simone Rondelli added a comment - GlyphSequence is already using an IntBuffer to store the characters. Did you mean MultiByteFont? This would mean to change lot of interfaces such as Substitutable and Positionable and their subclasses (Font, MultiByteFont, CustomFontMetricMapper, LazyFont). Is this what you want? Regardless the answer of the previous question it's still not cleat to me why to use findCharacterFromGlyphIndex(gi); instead of using the characters inside GlyphSequence.
          Hide
          Glenn Adams added a comment -

          It appears I introduced this code in:

          r1293736 | gadams | 2012-02-26 02:29:01 +0000 (Sun, 26 Feb 2012) | 1 line

          http://svn.apache.org/viewvc/xmlgraphics/fop/trunk/src/java/org/apache/fop/fonts/MultiByteFont.java?limit_changes=0&r1=1293736&r2=1293735&pathrev=1293736

          I don't have a direct recollection of the rationality for using findCharacterFromGlyphIndex instead of using the GlyphSequence, but I would speculate that it is because the chars in the GS correspond to the original input characters while the font's reverse mapping from glyph indices to characters include dynamically generated character codes (assigned to the PUA) when a glyph index is not associated with a standard Unicode character in the CMAP.

          For each font instance, new character codes from the PUA are dynamically assigned when a reverse mapping can't be found in the CMAP.

          However, I would have to run some tests through a debugger to verify this case. My guess is that if you change this code to use the GS input chars, then it will break things in such a scenario.

          Show
          Glenn Adams added a comment - It appears I introduced this code in: r1293736 | gadams | 2012-02-26 02:29:01 +0000 (Sun, 26 Feb 2012) | 1 line http://svn.apache.org/viewvc/xmlgraphics/fop/trunk/src/java/org/apache/fop/fonts/MultiByteFont.java?limit_changes=0&r1=1293736&r2=1293735&pathrev=1293736 I don't have a direct recollection of the rationality for using findCharacterFromGlyphIndex instead of using the GlyphSequence, but I would speculate that it is because the chars in the GS correspond to the original input characters while the font's reverse mapping from glyph indices to characters include dynamically generated character codes (assigned to the PUA) when a glyph index is not associated with a standard Unicode character in the CMAP. For each font instance, new character codes from the PUA are dynamically assigned when a reverse mapping can't be found in the CMAP. However, I would have to run some tests through a debugger to verify this case. My guess is that if you change this code to use the GS input chars, then it will break things in such a scenario.
          Hide
          Simone Rondelli added a comment -

          I see, let's keep it like this for now. But what about the usage of IntBuffer?

          BTW Here I would first fill a list and then fill the buffer (solution 3) since it looks the only 100% safe.

          Show
          Simone Rondelli added a comment - I see, let's keep it like this for now. But what about the usage of IntBuffer? BTW Here I would first fill a list and then fill the buffer (solution 3) since it looks the only 100% safe.
          Hide
          Simone Rondelli added a comment -

          I think that I get were is the case. In the ligatures the glyph get assigned a single character created with the MultiByteFont.createPrivateUsageMapping(int gi).

          Eg: In the Emoji case the following text \uD83D\uDC68\u200D\uD83D\uDC69\u200D\uD83D\uDC66 (👨‍👩‍👦 ) get assigned the character 57344 (\uDFF6 -> lower surrogate).

          Show
          Simone Rondelli added a comment - I think that I get were is the case. In the ligatures the glyph get assigned a single character created with the MultiByteFont.createPrivateUsageMapping(int gi) . Eg: In the Emoji case the following text \uD83D\uDC68\u200D\uD83D\uDC69\u200D\uD83D\uDC66 (👨‍👩‍👦 ) get assigned the character 57344 (\uDFF6 -> lower surrogate).
          Hide
          Simone Rondelli added a comment - - edited

          Hi SImon,

          I have updated the pull request:

          • Checkstyle now succeed
          • Findbug fixed: I don't have any issue, only lot of warning that were already there. I had to use the version 3.0.4 of the maven plugin to make it work with Java8 (https://github.com/DavidWhitlock/PortlandStateJava/issues/44). Maybe is worth to keep this version.
          • Fixed NullPointerException
          • Fixed BufferOverflowException: The fix it is actually not definitive, I'm still discussing with Glenn what is the best way to go about it.

          As for the visual issue in the TFF I think you referred to the text printed with Cambria Math. I checked the CAMBRIA.TTC font and it looks like the sub font Cambria Math contains only one glyph. This is probably what is causing the issue. I tried with a different font file (cambria.ttf) and it worked fine.

          Now it is probably possible to run some more test and start to review my code. Let me know if I can add more tests then the ones that I have currently implemented.

          Show
          Simone Rondelli added a comment - - edited Hi SImon, I have updated the pull request: Checkstyle now succeed Findbug fixed: I don't have any issue, only lot of warning that were already there. I had to use the version 3.0.4 of the maven plugin to make it work with Java8 ( https://github.com/DavidWhitlock/PortlandStateJava/issues/44 ). Maybe is worth to keep this version. Fixed NullPointerException Fixed BufferOverflowException: The fix it is actually not definitive, I'm still discussing with Glenn what is the best way to go about it. As for the visual issue in the TFF I think you referred to the text printed with Cambria Math. I checked the CAMBRIA.TTC font and it looks like the sub font Cambria Math contains only one glyph. This is probably what is causing the issue. I tried with a different font file (cambria.ttf) and it worked fine. Now it is probably possible to run some more test and start to review my code. Let me know if I can add more tests then the ones that I have currently implemented.
          Hide
          simon steiner added a comment -

          I run findbugs 2.0.3 under java7 since java8 is not compatible and findbugs 3 doesnt support java6

          Show
          simon steiner added a comment - I run findbugs 2.0.3 under java7 since java8 is not compatible and findbugs 3 doesnt support java6
          Hide
          Simone Rondelli added a comment -

          It looks like Findbugs 3.0.4 works fine with Java8 as runtime and Java6 as target platform. BTW I have run again findbugs 2.5.5 (the one defined in the pom.xml) with Java7 as runtime and the results are the same: I don't see new issues in the code I've added/modified.

          Show
          Simone Rondelli added a comment - It looks like Findbugs 3.0.4 works fine with Java8 as runtime and Java6 as target platform. BTW I have run again findbugs 2.5.5 (the one defined in the pom.xml) with Java7 as runtime and the results are the same: I don't see new issues in the code I've added/modified.
          Hide
          simon steiner added a comment -

          Seems mvn config is not running checkstyle on test code, could you try ant checkstyle in fop dir

          Show
          simon steiner added a comment - Seems mvn config is not running checkstyle on test code, could you try ant checkstyle in fop dir
          Hide
          simon steiner added a comment -

          Before your changes this is working fine:
          <fo:block font-family="CambriaM">test</fo:block>

          Show
          simon steiner added a comment - Before your changes this is working fine: <fo:block font-family="CambriaM">test</fo:block>
          Hide
          simon steiner added a comment -

          Where you able to fix single-byte.zip issue

          Show
          simon steiner added a comment - Where you able to fix single-byte.zip issue
          Hide
          Simone Rondelli added a comment -
          Show
          Simone Rondelli added a comment - Yep! You can check the renderings here https://www.dropbox.com/sh/tlbxihfr912b03p/AADv-4RsQBSBFk90-NXVxLH0a?dl=0
          Hide
          simon steiner added a comment -

          For me postscript is failing for single-byte.zip

          Show
          simon steiner added a comment - For me postscript is failing for single-byte.zip
          Hide
          Simone Rondelli added a comment -

          Do you mean that the text is messed up or do you get an error?

          Show
          Simone Rondelli added a comment - Do you mean that the text is messed up or do you get an error?
          Hide
          simon steiner added a comment -

          I get a error

          Show
          simon steiner added a comment - I get a error
          Hide
          Simone Rondelli added a comment -

          I cannot reproduce it, could you please share you're exact configuration with run parameters?

          Show
          Simone Rondelli added a comment - I cannot reproduce it, could you please share you're exact configuration with run parameters?
          Hide
          Simone Rondelli added a comment - - edited

          Done, I run checkstyle with ant and fixed all the issues and updated the pull request. It pretty simple to do the same with maven <includeTestSourceDirectory>true</includeTestSourceDirectory>, do you want me to do that?

          Show
          Simone Rondelli added a comment - - edited Done, I run checkstyle with ant and fixed all the issues and updated the pull request. It pretty simple to do the same with maven <includeTestSourceDirectory>true</includeTestSourceDirectory> , do you want me to do that?
          Hide
          simon steiner added a comment -

          Updated zip with fop.xconf

          Show
          simon steiner added a comment - Updated zip with fop.xconf
          Hide
          Simone Rondelli added a comment -

          Can reproduce it, now I'm working on it. As for the checkstyle with MVN I have enabled it on test source as well and fixed the few additional error (some probelm with statements like this one {{ try

          { is.close(); }

          catch (Exception e)

          { /* NOP */ }

          }}) do you want me to merge these changes as well in next commit?

          Show
          Simone Rondelli added a comment - Can reproduce it, now I'm working on it. As for the checkstyle with MVN I have enabled it on test source as well and fixed the few additional error (some probelm with statements like this one {{ try { is.close(); } catch (Exception e) { /* NOP */ } }}) do you want me to merge these changes as well in next commit?
          Hide
          Simone Rondelli added a comment -

          Updated Pull Request. Now It handle correctly single byte fonts.

          Show
          Simone Rondelli added a comment - Updated Pull Request. Now It handle correctly single byte fonts.
          Hide
          simon steiner added a comment -

          Patch has:
          + System.out.println("Added width "
          + + mtxTab[ansiGlyphIdx].getWx()
          + + " uni: " + j
          + + " ansi: " + aIdx);

          Show
          simon steiner added a comment - Patch has: + System.out.println("Added width " + + mtxTab [ansiGlyphIdx] .getWx() + + " uni: " + j + + " ansi: " + aIdx);
          Hide
          Simone Rondelli added a comment -

          Updated.

          Show
          Simone Rondelli added a comment - Updated.
          Hide
          Simone Rondelli added a comment -

          Updated again.

          Show
          Simone Rondelli added a comment - Updated again.
          Hide
          Simone Rondelli added a comment -

          updated again with a fix in Java2DRenderer

          Show
          Simone Rondelli added a comment - updated again with a fix in Java2DRenderer
          Hide
          Glenn Adams added a comment -

          Please document your design and implementation approach at http://wiki.apache.org/xmlgraphics-fop/DeveloperPages.

          Show
          Glenn Adams added a comment - Please document your design and implementation approach at http://wiki.apache.org/xmlgraphics-fop/DeveloperPages .
          Hide
          Simone Rondelli added a comment -

          I don;t have the rights to create/edit pages. Could you give me the permissions?

          Show
          Simone Rondelli added a comment - I don;t have the rights to create/edit pages. Could you give me the permissions?
          Hide
          Simone Rondelli added a comment -

          Since I cannot modify the Wiki I have updated the Design document in the GitHub pull request here: https://github.com/apache/fop/pull/3 As soon as I'll receive the permission I'll put it in the wiki as well.

          Show
          Simone Rondelli added a comment - Since I cannot modify the Wiki I have updated the Design document in the GitHub pull request here: https://github.com/apache/fop/pull/3 As soon as I'll receive the permission I'll put it in the wiki as well.
          Hide
          Luis Bernardo added a comment -

          You should have write permissions now.

          Show
          Luis Bernardo added a comment - You should have write permissions now.
          Hide
          Simone Rondelli added a comment -

          Wiki page created: https://wiki.apache.org/xmlgraphics-fop/SurrogatePairs

          Let me know if you have any comments.

          Show
          Simone Rondelli added a comment - Wiki page created: https://wiki.apache.org/xmlgraphics-fop/SurrogatePairs Let me know if you have any comments.
          Hide
          simon steiner added a comment -

          Should we be depending on pdfbox in fop, will affect using ant junit? shouldnt you use pddocument.load in extractTextFromPDF

          Show
          simon steiner added a comment - Should we be depending on pdfbox in fop, will affect using ant junit? shouldnt you use pddocument.load in extractTextFromPDF
          Hide
          Simone Rondelli added a comment -

          The project is already using org.apache.pdfbox:fontbox thus I thought that using org.apache.pdfbox:pdfbox as test dependency would not be a big deal.

          And yes I did not consider ANT build, but I can easily add the jar.

          PDDocument do not contain a load() method. The current implementation of extractTextFromPDF() works as expected.

          Show
          Simone Rondelli added a comment - The project is already using org.apache.pdfbox:fontbox thus I thought that using org.apache.pdfbox:pdfbox as test dependency would not be a big deal. And yes I did not consider ANT build, but I can easily add the jar. PDDocument do not contain a load() method. The current implementation of extractTextFromPDF() works as expected.
          Show
          simon steiner added a comment - See https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/pdmodel/PDDocument.html#load%28byte[]%29
          Hide
          Simone Rondelli added a comment -

          Ok, now I see it. Updated the pull request to use PDDocument.load() and added the pdfbox jar into the build directory for ANT even though don;t know how to make ANT to run all the tests (I tried several targets: tests, junit, junit-all etc..).

          Show
          Simone Rondelli added a comment - Ok, now I see it. Updated the pull request to use PDDocument.load() and added the pdfbox jar into the build directory for ANT even though don;t know how to make ANT to run all the tests (I tried several targets: tests, junit, junit-all etc..).
          Hide
          Simone Rondelli added a comment -

          Hi Simon/Glenn,

          As Glenn mentioned in an offline thread he wants this code to be reviewed by at least 2 to 3 PMC reviews before proceeding with merging the code.

          What is the procedure to have this done?

          Show
          Simone Rondelli added a comment - Hi Simon/Glenn, As Glenn mentioned in an offline thread he wants this code to be reviewed by at least 2 to 3 PMC reviews before proceeding with merging the code. What is the procedure to have this done?
          Hide
          Simone Rondelli added a comment -

          Any update on this issue?

          Show
          Simone Rondelli added a comment - Any update on this issue?
          Hide
          Simone Rondelli added a comment -

          Hello Any update on when somebody will review the code for this fix?

          Show
          Simone Rondelli added a comment - Hello Any update on when somebody will review the code for this fix?

            People

            • Assignee:
              Unassigned
              Reporter:
              Glenn Adams
            • Votes:
              5 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development