Bug 27901 - [PATCH] TextCharIterator.remove() does not work properly
Summary: [PATCH] TextCharIterator.remove() does not work properly
Status: CLOSED FIXED
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: page-master/layout (show other bugs)
Version: trunk
Hardware: PC Windows XP
: P3 normal
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-03-24 11:50 UTC by Luca Furini
Modified: 2012-04-01 07:09 UTC (History)
0 users



Attachments
Sample fo file to see the problem (819 bytes, text/plain)
2004-03-24 11:51 UTC, Luca Furini
Details
proposed patch (4.54 KB, patch)
2004-03-26 11:32 UTC, Luca Furini
Details | Diff
proposed patch (4.54 KB, patch)
2004-03-26 11:32 UTC, Luca Furini
Details | Diff
another possible solution (2.87 KB, text/plain)
2004-03-26 11:43 UTC, Christian Z.
Details
my patch (2.38 KB, patch)
2004-04-03 20:03 UTC, Simon Pepping
Details | Diff
An fo file that shows the results of the patch (1.56 KB, text/xml)
2004-04-03 20:04 UTC, Simon Pepping
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Furini 2004-03-24 11:50:37 UTC
The method remove() of the TextCharIterator class, defined private in the file 
fo/FOText.java, should remove the current character, pointed to by curIndex, 
but its effect is to remove the one pointed to by the start index.

The fo file I am going to attach can be used to see the problem; there are two 
block with the same text, but in the second one there are some extra spaces 
that should be removed as white-space-collapse default value is true; instead 
of the extra spaces, the first character are removed.

Now, the code is:

  public void remove() {
      if (start < ca.length) {
          start++;
      }
  }

A simple solution could be shifting the whole array of character every time 
remove is called, but that would not be very efficient, especially with big 
arrays:

  public void remove() {
      if (start < ca.length) {
          for (int i = curIndex - 1;
               i > start;
               ca[i] = ca[--i]) {
              // nothing
          }
          start++;
      }
  }

Maybe a better idea would be to store the indexes of the character to remove 
(a kind of "black list"), and to actually remove them when creating the 
TextLayouyManager.
Instead of calling a single System.arraycopy, the TextLayoutManager 
constructor should call a method of the FOText class, which would copy sub-
array of consecutive "good" indexes between  two indexes in the black list (or 
between the array start and the first black index, or between the last black 
index and the array end).
Comment 1 Luca Furini 2004-03-24 11:51:29 UTC
Created attachment 10947 [details]
Sample fo file to see the problem
Comment 2 Glen Mazza 2004-03-24 22:49:56 UTC
Hi Luca,

Thanks, you're completely correct.  I've spent a lot of time trying to get this 
right, but obviously I'm not done yet.

Just FYI, for a long time, we had a problem of a leading space remaining at the 
beginning of the first FO-Text instance of the fo:block (During processing, the 
fo:block is chopped up into FOText instances.)  

The first thing to remember about the FOText instance is not all of the string 
needs to be processed--we have leading spaces to remove for each instance:

{ leading space }  the text is here

Old process (which had the leading space bug), did this:
1.) remove a space
2.) arraycopy everything over one to the left (see [1] at the bottom)
3.) repeat 1&2 until leading space gone (lots of arraycopy's's)

What we were left with is as follows for the TextLayoutManager to use:

Valid text<--- invalid trailing chars --->
^        ^                               ^
a        b                               c

a - 1st position of string to process (always at position 0) by Text LM
b - "length" param to stop processing for Text LM
c - (irrelevant) end of FOText char array (always remains the same after 
allocation).

So the TextLayoutManager would render from position 0 to a "length" value which 
would point to the last valid value.

My new algorithm was also an attempt to get rid of all the arraycopy's.  So 
instead of removing characters and shifting everything to the left, I just 
wanted to increment a start pointer.  So what TextLayoutManager renders is from 
(start character, natural end of char buffer of the FOText object.)

{ leading space }  the text is here
                   ^              ^
                   a              b

a - start value
b - natural end of char array  (i.e., array.length)

But my simple solution failed to take into account the need to remove spaces 
*within* the fragment, not just the leading spaces.  So, as you mention, "the   
text is here" would be rendered as "e   text is here."

I may need to bring back the old code, but I'd also like a hybrid approach.  
Extra spaces between words are relatively rare compared to initial leading 
spaces (at least, that what my testing showed.)  I'd like to continue 
incrementing the start value until the first word occurs, *then* start with the 
arraycopy's for the extra spaces between words.

This would mean:

{ leading space }  the text is here  
                   ^              ^  ^
                   a              b  c

a - start value
b - end value (couple of decrements from (c) below due to extra spaces between 
words
c - natural end of char array  (i.e., array.length)

That would involve re-creating the "length" (or "end") value in FOText, and 
using it within TextLayoutManager.

I'll look into this soon (but patches of course are welcome in case you're 
faster than me! ;)

Thanks again (Grazie!), 
Glen

[1] (line 159) http://cvs.apache.org/viewcvs.cgi/xml-
fop/src/java/org/apache/fop/fo/FOText.java?r1=1.16&r2=1.17&diff_format=h
Comment 3 Christian Z. 2004-03-25 15:46:29 UTC
Hi,
if I got everything right, I think the most oo way do this would be to use
java.util.LinkedList:

new LinkedList(Arrays.asList(ca));

Then you could also use the ListIterator provided by LinkedList and
delegate all calls to it.

Hope this helps,
Christian
Comment 4 Glen Mazza 2004-03-25 22:49:22 UTC
For this, I think we need to focus more on the fastest/most optimized solution 
(perhaps making a slight allowance for readability) rather than oo, because 
this is a very critically-called portion of the code.  *All* text gets run 
through here, and this process occurs for approximately every line of the 
output document.  (Perhaps not much of a bad thing, because with optimization, 
what you lose with oo elegance you sometimes make up with mathematical 
elegance!)

Glen
Comment 5 Christian Z. 2004-03-26 07:55:36 UTC
The advance if the LinkedList solution would be that remove/replace operations
would be very fast. It's only a replacement of two references, no arraycopy.

E.g.

a -> b -> c

is represented in the sun SDK implementation (only looked into 1.4.2)

a.previous = null
a.next = b
b.previous = a
b.next = c
c.previous = b
c.next = null

Removing b would be

a.next = c
c.previous = a

b is thrown away.

The drawback: Besides the LinkedList and its iterator which need to be
generated, further 2*chrarray.length objects (length Characters from
Arrays.asList + length internally from the LinkedList which generates
LinkedList.Entry instances) are needed.

If you prefer little construction overhead this is the wrong thing, if you focus
on fast removing/replacing I think this is the best solution.

Christian Z.
Comment 6 Luca Furini 2004-03-26 11:31:11 UTC
Hi again, this is my try:

- added an ArrayList called "blackList" to the FOText class; when remove is 
called, the index of the character to be removed is stored in this list (and 
nothing else is done); the first element in this blackList is always -1, and 
the last ca.length
- added an int called "length" to the FOText class, to store the computed 
length of the text, which is < ca.length if some characters are removed
- created a new method for the FOText class, called "compactText"; it is 
called in the TextLayoutManager constructor, before modifying textArray; it 
modifies the start and length variables of the FOText node
- textArray is given a dimension of "length" (not ca.length) and then 
arraycopy is called with the start and length computed before

The compactText method:
1) starts examining the black list from the beginning
2) founds two black indexes (position of char to be removed) which are not 
consecutive, so that they define a range of "good" characters
3) if this range is the first one found, nothing is done except updating start 
and length, else the range is shifted to the left, thus removing the "bad" 
characters
and then 2) and 3) until the end of the blacklist is found

 |   Text is   here and  here.|  <- this is the ca array
 xxxx        xx         x     x  <- these are the bad indexes
     --------  --------- -----   <- these are the ranges found beween
                                    two not-consecutive bad indexes
 |   Text is here and here.   |  <- this is the compacted array
     ^start                ^start+length

In this way every good character is shifted at most one time; two times, 
actually, as arraycopy is called with src = dest, and a temporary array is 
created; maybe the TextLayoutManager should use the blacklist to perform 
arraycopy directly between ca and textArray, without compacting the FOText 
array?.

I am going to attach the patch with these changes.

Bye
    Luca
Comment 7 Luca Furini 2004-03-26 11:32:16 UTC
Created attachment 11006 [details]
proposed patch
Comment 8 Luca Furini 2004-03-26 11:32:50 UTC
Created attachment 11007 [details]
proposed patch
Comment 9 Christian Z. 2004-03-26 11:42:05 UTC
OK,
seems I'm too late. :-)
Anyway, yet another quite fast and simple approach. No extra objects -- only one
arraycopy.

Although not really a patch I tried to make it as copy & paste friendly as possible.

Every nextChar call only shifts one char to the left. This method has one little
drawback): You have to ensure that the whole array was processed once
completely, which is the normal case. If not (e.g. if there is a break statement
in the while-loop) the array contains corrupted data, a bug which would be hard
to find. Therefore I think the ca array should be encapsulated in the FOText
class -- what involves some refactoring (the same issue applies to the current
approach). The getText method ensures that the array is processed once.

Hope this is useful
Christian Z.
Comment 10 Christian Z. 2004-03-26 11:43:53 UTC
Created attachment 11008 [details]
another possible solution
Comment 11 Glen Mazza 2004-03-26 12:44:37 UTC
Vielen Dank, Christian Z & Luca--this will be a fun weekend for me--I am
fortunate to have TWO possible implementations to choose from.  I will study
them up. 

Glen
Comment 12 Glen Mazza 2004-04-02 22:05:17 UTC
I hope to get back to this over the weekend--I was working the logging issue 
last weekend.
Comment 13 Simon Pepping 2004-04-03 20:01:49 UTC
I like the idea of the black list. But it is not necessary to make it
explicit. It is possible to remove multiple white space characters in
a single loop. In the attached patch I have implemented this idea in
the constructor of FOText, so that FOText contains a text without
multiple spaces.

The attached test fo file demonstrates that it works well, except when
the block is split up in several FOs. In that case text disappears.  I
have checked that that does not happen in the FOText constructor and
that TextLayoutManager.textArray is also correct. In view of the fact
that the results are different whether the inlines have color or not,
it may also be an error in the renderer.

The intermediate array caLong could be skipped if it were not
necessary to truncate ca to its exact length. This could be done if
every method and client would use node.length instead of
node.ca.length.
Comment 14 Simon Pepping 2004-04-03 20:03:01 UTC
Created attachment 11114 [details]
my patch
Comment 15 Simon Pepping 2004-04-03 20:04:18 UTC
Created attachment 11115 [details]
An fo file that shows the results of the patch
Comment 16 Glen Mazza 2004-04-04 06:41:40 UTC
The BlackList method is not my favorite, because I suspect only a few percent of
extra spaces found are between words--the extra complexity appears unwarranted.
 My debugging has indicated that 95%+ of extra spaces all appear at the
beginning of the FOText object, so it is best to just increment a start index
for those.  For the few percent of errant extra spaces between words, then an
arraycopy can be called, removing the spaces one-by-one.  (Keep in mind, one
month ago we were doing an arraycopy for *every* space, even those at the
beginning of the string, so a 95% reduction in arraycopies is already a huge
improvement.)  Please advise if you've come to different conclusions--but the
complexity doesn't appear necessary to me.  HOWEVER/Note:  I've only been
working with fo:blocks, not with fo:inlines, so my above reasoning may not be
valid when those are taken into consideration.

Anyway, please check the latest solution applied.  Here I brought in the old
solution to solve the extra spaces problem, while keeping the new solution for
the more commonly occurring spaces at the front of fo:text.

The (commented-out) System.out.println's in the remove() method are useful for
visualizing the space removal, and indicate the following:

RemoveA:  Newest method, that just increments the start index until a
word is found.  No array copies needed here, very speedy.

RemoveB and RemoveC:  The previous method, ideally just activated for the
relatively rare occurrences of extra spaces between words.  Unfortunately,
RemoveB and RemoveC are called much more often (50% or so) than they should be 
(As stated above, I suspect < 10%), so more optimization may be needed here. 
(Perhaps an error in my logic of when RemoveA vs. RemoveB should be called.)

Comments most welcome.

Glen
Comment 17 Andreas L. Delmelle 2008-04-28 15:02:34 UTC
Bug seems to be fixed in current trunk. 
Both supplied fo-files seem to render as expected, so I'm closing this one.
Comment 18 Glenn Adams 2012-04-01 07:09:42 UTC
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed