Issue 15666

Summary: Search and Replace - can't substitute regular expression subexpression in replace
Product: Writer Reporter: cdunham <openoffice>
Component: editingAssignee: stefan.baltzer
Status: CLOSED FIXED QA Contact: issues <issues.openoffice.org>
Severity: trivial    
Priority: P3 CC: basileia, ch.ey, dan.mellem, ecastro, gerhard.schaber, gudmundpublic, gurubert-ooo, info, issues, jes, ooo, openoffice, sashimanek, stx123
Version: OOo 1.1 Beta2Keywords: ms_interoperability, oooqa, rfe_eval_ok, usability
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: PATCH Latest Confirmation on: ---
Developer Difficulty: ---
Attachments:
Description Flags
Sample expected behavior
none
proposed patch
none
another try
none
version 3 of my proposal, now with real & support
none
Test Case none

Description cdunham 2003-06-16 08:03:43 UTC
Expected to be able to search for "si(ll)y" with RE option checked, and replace 
occurances using subexpression replacement: "\1ogical" -> "llogical". 
 
Also, '&' in replacement string seems to get substituted with search string...
Comment 1 ingenstans 2003-07-16 11:57:01 UTC
& in replacement string behaves as expected and specified in the help 
file. I know it's not what 'normal' REs do.

The inability to use matched substrings in replacement expressions is 
confirmed. It's not stated in the help that it should be possible: 
againl coming fropm experience of 'real' REs, that looks silly. I'm 
not sure, though, whether it counts as a defect or an enhancement. 
Comment 2 h.ilter 2003-07-16 15:54:00 UTC
Reassigned to SBA
Comment 3 stefan.baltzer 2003-07-21 19:18:52 UTC
It's true that a "&" in the replace box is designed to replace the
found string with itself. Have a look at the help to see the RegEx
options (as they were designed for this product)

I don't get the first problem you report. Please be more precise with
the strings you enter, and the strings you hope to replace, and with
WHAT. I think that a small sample (i.e. in a attached document that
has the respective strings "ready for copy") could be helpful here.

Please comment. Thank you.
Comment 4 cdunham 2003-07-21 20:04:47 UTC
Created attachment 7904 [details]
Sample expected behavior
Comment 5 cdunham 2003-07-21 20:05:29 UTC
Also in attachment: 
 
Here are a list of people: 
 
LN=Blister FN=Mister 
LN=Blow FN=Joe 
LN=Sally FN=Silly 
 
But I really wanted a friendlier format, so I turn on regexp find and 
replace, search for: 
 
^\s*LN=(.*)\s+FN=(.*)\s*$ 
 
with a replace string of: 
 
Hello, \2. How are you and the rest of the \1's? 
 
But alas, it does not work that way in OpenOffice.org. I can only 
substitute the whole string (where is that in the docs, anyway?), and 
\s doesn't seem to match [:space:]... 
 
Hope this helps... 
Comment 6 akrioukov 2003-08-12 14:28:45 UTC
I also think that possibility to substitute regular expressions in the
replace string is important. For example, I often have to search for
digit ranges dilimited by hyphens (like 121-122), and replace them
with the same ranges, but delimited with endashes. Curently I use
regexps to find matching strings, but I have to edit them manually :(. 

And there are other requests for the same functionality: see 11037 and
15515, which probably should be marked as duplicates of this issue (or
this issue as a duplicate of one of those two).
Comment 7 lohmaier 2003-08-13 18:05:40 UTC
\# should be replaced by the contents of the #th pair of round brackets ()

This is stated in the OnlineHelp:
( )
Defines the characters inside the brackets as a reference. You can
then refer to the first reference in the current expression with "\1",
to the second reference with "\2", and so on.
For example, if your text contains the number 13487889 and you search
using the regular expression (8)7\1\1, "8788" is found.
####
It works in the search field (example: "(d)i\1" finds "did") but you
cannot use this in the replace-field (where it would be much more useful).
As mentioned before OnlineHelp doesn't state that it is possible to
use this in the replace-field (so documentation is not wront) - but it
would be nice if it worked.

(I also disagree that & is unusual to be replaced by the matched
string - sed, for example, uses the same thing)

> "and \s doesn't seem to match [:space:]"
       
\s is not listed as a short form of [:space:] why do you expect this
to work?
The index-entry is: "regular expressions; list of"
Comment 8 lohmaier 2003-08-13 18:08:05 UTC
*** Issue 11037 has been marked as a duplicate of this issue. ***
Comment 9 lohmaier 2003-08-13 18:13:28 UTC
*** Issue 15515 has been marked as a duplicate of this issue. ***
Comment 10 lohmaier 2003-08-13 18:19:43 UTC
Real-life example from issue 15515:
"The date is 2003-06-11." should be changed to
"The date is 11-06-2003." unsing the regex-search:

Search for: ([:digit]{4})-([:digit]{2})-([:digit]{2})
            #  1st pair # <  2nd pair > ~  3rd pair ~  ..of Brackets
replace with: \3-\2-\1
Is expected to do the job, but instead 2003-06-11 gets replced by
"\3-\2-\1" 
Comment 11 Unknown 2003-09-04 12:29:21 UTC
I'm new here so if I make a mistake, please apologize me.

I don't know if what I'm going to describe should be a different 
issue...

When you find a regular expression that includes [:cntrl:] and 
replace with &, it should leave the text unchanged, but instead the 
control characters disappear.
For example, I have: text[illustration]
Search for text[:cntrl:]
Replace with &

When I hit replace only text remains, the illustration field is gone.

Thanks for your help
Comment 12 ingenstans 2003-12-26 21:19:39 UTC
I have changed this to an enhancement, though a very important one. It's not, 
strictly speaking, a defect that you can't replace portions of a regex: the help 
doesn't say that it should be possible. 

But it is a useful and powerful technique, found in almost every other 
implementation of regexes, and we ought to have it. 

So it's an enhancement
Comment 13 lohmaier 2004-02-16 18:45:36 UTC
*** Issue 22592 has been marked as a duplicate of this issue. ***
Comment 14 vpe 2004-03-08 23:48:27 UTC
I also think backreferences in the replace string is an important feature and 
should be implemented (even msword has them).

A small suggestion about improving Help on Regular Expression:

An example of using backreferences in the find string is pointless (explanation 
of "()" in the List of Regular Expressions). Who needs "(8)7\1\1" to find 
"8788"?  Users who never used regular expressions before might not appreciate 
the value of backreferences. A more informative example would be something like 
finding palindrome words 3 chars long (did, dad, bob): "\<(.).\1\>"
Comment 15 stefan.baltzer 2004-07-16 16:45:22 UTC
SBA: Reassigned to AMA.
Comment 16 andreas.martens 2004-07-19 17:01:38 UTC
We should rework our support of RE, but we will not be able to do this in OOo2.0 :-(
Comment 17 lohmaier 2004-08-28 22:54:19 UTC
keywords & component set according to new RFE-eval process... OS, Platform=ALL
Comment 18 lohmaier 2005-02-23 18:45:51 UTC
*** Issue 43397 has been marked as a duplicate of this issue. ***
Comment 19 glebovitz 2005-04-04 22:47:31 UTC
I can't believe OpenOffice made it to version 2 and still doesn't have a proper
regular expression replacement feature. No offence intended mr andrewb, but this
IS a DEFECT not an ENHANCEMENT. Who came up with the pointless backreferences
"FEATURE"? WHat were they thinking?
Comment 20 erpel 2005-04-05 03:57:03 UTC
I don't think this is a Writer specific issue. It's a general OOo issue.
So please change the component accordingly (IMHO to framework). Thanks.

The current behaviour is really suboptimal. RegExes would be MUCH more useful,
if one could address matched subexpressions in the replacement string.
When this feature is implemented, issue 46015 (Support for less greedy RegExes)
would make more sense, too.
Comment 21 lohmaier 2005-05-30 14:17:58 UTC
*** Issue 50043 has been marked as a duplicate of this issue. ***
Comment 22 bigserpent 2005-05-30 15:19:05 UTC
Maybe it's time to change the issue type to "bug" from "enhancement" ? And set a
target milestone. There are two reasons:
1. the common, standard feature is not available.
2. this may lead developers not to forget it :-).
Comment 23 ingenstans 2005-07-05 11:34:51 UTC
The developers are perfectly capable of postponing issues marked as bugs, so I 
don't think that changing this designation will help.

I agree that it's an urgently needed improvement. I agree that the specification 
is pretty much useless. But as a QA volunteer I know I can do nothing to 
persuade developers to take up issues just because I think they are urgent. 
Classifying things correctly by the rules does help a little. If MS Office does 
this the right and sensible way, I will add a keyword for office 
interoperability, which does seem to get management attention. 
Comment 24 jojo4u 2005-07-05 22:08:41 UTC
The follwoing excerpt from http://office.microsoft.com/en-us/assistance/
HA010873051033.aspx shows, that Office XP supports backreferences:

5. Click the Replace tab, and then enter the following characters in the Find 
what box. Make sure you include the space between the two sets of parentheses: 
(<*>) (<*>)
6. In the Replace with box, enter the following characters. Make sure you 
include the space between the comma and the second slash: \2, \1
Comment 25 noise_e_piranha 2005-08-21 07:34:28 UTC
I agree that this is a serious defect (if not quite a bug), and I'm quite 
disheartened that this has been an issue for over 2 years.
Comment 26 lohmaier 2005-08-25 17:47:42 UTC
*** Issue 53775 has been marked as a duplicate of this issue. ***
Comment 27 gurubert 2006-01-05 14:11:31 UTC
Hi!

This issue has been open for more than two and a half years now.

Does anybody care to solve it?
Comment 28 glebovitz 2006-01-05 15:53:21 UTC
I just sent a letter to the governing council explaining that this is very basic
functionality that is missing from the product suite and it's embarrassing that
they allow teams to explore lofty new projects without getting the basics
completed first. I don't know about your, but this continues to be a deployment
showstopper for me.
Comment 29 cdunham 2006-01-06 04:22:44 UTC
>> "and \s doesn't seem to match [:space:]"
      
>\s is not listed as a short form of [:space:] why do you expect this
to work?

It's pretty standard: http://www.regular-expressions.info/charclass.html#shorthand
Comment 30 ecastro 2006-01-15 20:19:07 UTC
*** Issue 60029 has been marked as a duplicate of this issue. ***
Comment 31 rnhainsworth 2006-01-16 08:03:29 UTC
MS advertises that its products are used by knowledge professionals. As a
financial analyst, I fall into that group. I use spreadsheets extensively to
handle data imported from a variety of sources. Often the data needs to be
massaged into a form that can then be manipulated by standard spreadsheet functions

The lack of a RegEx replace functionality is a critical defect. If I were using
Windows, I would have to revert to Excel for this single functional absence.

If OO meets its specification as given by Help and does not have a RegEx
replace, then the DEFECT is in the Help specification.

This issue is given as part of Write, but it is a part of the whole suite.
Comment 32 cheyrich 2006-02-11 13:56:11 UTC
Created attachment 34070 [details]
proposed patch
Comment 33 cheyrich 2006-02-11 13:57:05 UTC
I've attached my proposal to this problem (only for writer for now). Since this
is my first patch, I don't know that much about hacking OOo. So while the code
does what I want (using \n in replace string inserts the content of the nth
bracket), it might not please everyone.
And at some places I'm not even sure how to handle the one or another situation
(How should '\x' in the replace string be treated? since it hasn't any special
meaning. Or how should '\4' in the replace string be handled if there's no
capture group 4?).

Besides adding the ability to use \1, \2 a.s.o. in the replace string, I also
removed the special meaning of & (in text replace, not in attribute or format
replace) since this is non standard and unexpected (although documented
behaviour). Instead \0 now does the same trick and is more straightforward.
Also something like '\\\t' now works as I'd expect: '\'0x09. Formerly it
resulted in the string '\\t'.

I didn't try implementing \s as placeholder for whitespace. This would have
required hacking the regex lib (OOo uses a modified version of the GNU regular
expression library 0.12 which also misses some advanced features like non greedy
quantifiers) quite deep. IMHO this should be better some other bug.
Comment 34 rnhainsworth 2006-02-14 07:23:53 UTC
At last someone is looking at this issue!!!! Well done cheyrich.

Whilst hacking the S&R code is needed, I wondered if a macro could be written in
Python (which has reasonable regex behaviour), and then attached to a tool bar.

Since I am a Perl person (and no time to learn python now), and there is not yet
an OO/perl interface, I have not investigated this possibility.
Comment 35 glebovitz 2006-02-14 13:10:13 UTC
cheyrich,

This is a good first step for the RE functionality. It get's us passed the
current limitations. I am not sure what you are saying about the & substitution
versus using \0. & in the replace field should return the entire matched string
from the search field.

If OO is using a broken version of GNU Regular expressions, then should this
also be fixed? I haven't hacked OO either, but I am willing to give some of this
a go.

Gregg
Comment 36 cheyrich 2006-02-14 17:36:40 UTC
> This is a good first step for the RE functionality. It get's us passed the
> current limitations.

Thanks. I just relized that in this code, registers usable in the replace string
are limited to 1-digit, so 9 submatches. This could be extended to
2-or-more-digit registers, but would complicate the code. I guess 9 is a plenty.

> I am not sure what you are saying about the & substitution versus using \0.
> & in the replace field should return the entire matched string from the
> search field.

Of course it's possible to reimplement the & as special character. Users that
are used it this way wouldn't have to change over, I don't see a reason for
having & match the whole string. \0 for all and \1-\9 is more straightforward
(to code an to learn).

> If OO is using a broken version of GNU Regular expressions

Er no, I didn't write broken - it's modified (mostly to make it use classes).
What I meant with "misses some advanced features" is the lib misses them - also
in the FSF's original version.

If some more PCRE functionality is required it will be harder to implement since
registers have already been supported and used internally. More than that, in my
patch actually only *one* line of code was added to the regex code, the rest is
in OOo code.
Comment 37 danm 2006-02-14 19:42:25 UTC
I thought '&' means the entire match. For example,

Text: This is a search in the Open Office suite.

Search: Op(.*)ice

Doing a replace with '\1' (or '\0') would return "en Off" (which would become
"...in the en Off suite." while '&' would return "Open Office" (which would then
become "...in the Open Office suite."). Without '&' the search would need to be
"(Op)(.*)(ice)" instead. Say you wanted to bold all instances of "Open Office"
in HTML; you could just search for "open office" (case insensitive) and replace
with "<b>&</b>" and not change the case everywhere.
Comment 38 glebovitz 2006-02-14 21:53:50 UTC
If it is possible, I would keep the '&' as the whole pattern substitution since
it is standard for regular expressions. Even microsoft uses '&' for whole
pattern substitution.

I would be happy with 9 sub-expression registers.

I agree that for the time being basic RE functionality similar to the basic
substitution capability of 'sed' would be fine. In the future, it might be
worthwhile to update the RE code to include the newer GNU extensions.

Does the current RE library in OO support posix extensions such as [:space:]
etc? I think that would be more important that '\s'.
Comment 39 glebovitz 2006-02-14 21:54:27 UTC
If it is possible, I would keep the '&' as the whole pattern substitution since
it is standard for regular expressions. Even microsoft uses '&' for whole
pattern substitution.

I would be happy with 9 sub-expression registers.

I agree that for the time being basic RE functionality similar to the basic
substitution capability of 'sed' would be fine. In the future, it might be
worthwhile to update the RE code to include the newer GNU extensions.

Does the current RE library in OO support posix extensions such as [:space:]
etc? I think that would be more important than '/s'.
Comment 40 lohmaier 2006-02-15 00:19:22 UTC
Some comments from me...

> Besides adding the ability to use \1, \2 a.s.o. in the replace string, I also
> removed the special meaning of & (in text replace, not in attribute or format
> replace) since this is non standard and unexpected 

No, this is a standard. e.g. sed and a couple of other utilities use it.
Furthermore: Having it work in one situation and not in another is a nightmare
both documentation wise as regarding to usability.

I'd suggest to keep it since it has been there for ages (has been there long
before OOo was born)

As mentioned in another comment \0 could be "match all groups"

> (although documented behaviour). Instead \0 now does the same trick and is 
> more straightforward.

I'd call that more ecotic and far from being straightforward.

> Also something like '\\\t' now works as I'd expect:

Insert a backslash followed by a tabulator?

> '\'0x09. Formerly it resulted in the string '\\t'.

I guess you just misplaced the quote.

Is it possible to add the other escape-sequences as well? (like for newline (as
opposed to paragraph-break that unfortunately already is \n in the replace-box)

glebovitz wrote:
> Does the current RE library in OO support posix extensions such as [:space:]
> etc? I think that would be more important than '/s'.

Sure. Just have a look at the help for regular-expression search.
Comment 41 cheyrich 2006-02-15 14:31:00 UTC
Created attachment 34181 [details]
another try
Comment 42 cheyrich 2006-02-15 14:31:36 UTC
> I thought '&' means the entire match. For example,

It does and \0 also does.
But I must admit that I was wrong in thinking perl knows \0, it's only available
in PHP's preg_* and ereg_*.

& is new to me for being an regex operator, but I mainly know regex from Perl
and PHP, not from sed a.s.o.

>> Also something like '\\\t' now works as I'd expect:
>
> Insert a backslash followed by a tabulator?

Yes. But the current code only removes *one* backslash, regardless how much
exist in the sequence.

> Is it possible to add the other escape-sequences as well? (like for
> newline (as opposed to paragraph-break that unfortunately already is
> \n in the replace-box)

Should be possible, if I know what sequences to use. Currently I can't see in
what difference Return vs. Shift+Return results in reality.


Frankly said, I don't like & (as well as $1...$9) because it complicates (and
slows down) the replace code. If you've looked at the patch, you know it loops
over the replace string, searching for a backslash. If it encounters one, it
looks what character is next. If the code has to be able to handle \ as well as
& (and maybe $) as first char of a special sequence, I currently think I'll need
to loop over the string several times in several loops.


For now I've modified ActualStrReplace() and prefaced the main Search loop with
another in which unescaped & are replaced by \0. So & and \0 are synonyms now.


I'd be happy if someone would say anything about the actual code. Maybe there's
some even simpler way to do the replace. And there might exist some better
method to use for the main loop than Search (like it would be strcspn for plain
C-strings), but I'm not that firm with all the string handling in OOo.
Comment 43 glebovitz 2006-02-15 21:00:31 UTC
I looked at the code and it seems reasonable. It looks like you aren't handling
\n in the replacement string. You could add functionality where \n inserts a
paragraph mark and \r (return) inserts a line break.

Handling the & looks a little complicated. It seems like you need a search
function that can look for multiple strings at once. That way you wouldn't need
to go through the contortion of replacing all the unescaped & with \0.

Gregg
Comment 44 prisonerofpain 2006-02-15 21:22:25 UTC
> It seems like you need a search
> function that can look for multiple strings at once.
The Boyer-More string search algorithm* seems to handle those cases pretty well.



* http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
Comment 45 olo 2006-02-16 17:15:48 UTC
I'd like \r and \n , too (like glebovitz has suggested)!

It's currently a bit silly that OpenOffice lets to substitite line breaks with
paragraph breaks, but not vice versa. And it's silly that one does that by
substituting \n with \n.

The current behaviour is counter-intuitive, one would expect that substituting
\n iwth \n wll result in no change.

Cheyrich, it would be nice to have \r and \n, so we can change line breaks to
paragraph breaks (\r -> \n) and paragraph breaks to line breaks (\n -> \r).

What do You think about it?
Comment 46 cheyrich 2006-02-17 11:23:56 UTC
> I'd like \r and \n , too (like glebovitz has suggested)!

While that looks like a reasonable request and I'd try to solve that issue, I
guess it's better to file a separate bug on this. I had some closer looks at the
code and the return, paragraph, newline handling involves different code, mainly
because it doesn't manipulate only a plain string but also copes with creating,
separating and joining nodes.

I also think \n in search and \n in replace should mean the same, but I fear
this will also raise resistance since it breaks with the current design and thus
will confuse long time users.

So I request you to please file another bug on this and tell us/me the No.
Comment 47 cheyrich 2006-02-17 13:06:11 UTC
Created attachment 34241 [details]
version 3 of my proposal, now with real & support
Comment 48 cheyrich 2006-02-17 13:14:40 UTC
@glebovitz
> Handling the & looks a little complicated. It seems like you need a search
> function that can look for multiple strings at once. That way you wouldn't need
> to go through the contortion of replacing all the unescaped & with \0.

It not only looks little complicated, it really is.
Indeed needed a search function the can look for multiple strings, resp.
multiple characters in this case. That's what I meant with "strcspn" in my comment.

"Needed" because as you might have already noticed from my latest attachment,
with SearchChar() I found it. So now the &-handling only takes a few lines of
additional code in the main loop.
So I'm quite happy now, hopefully any real OOo hacker will also be.
Comment 49 glebovitz 2006-02-17 23:58:58 UTC
cheyrich,

I looked at all the code and I think I can take the search method from the
String class and write a function that will take an array of chars and search
simultaneously for all of them.

Would you like me to take a crack at this?

Gregg
Comment 50 cheyrich 2006-02-18 13:17:24 UTC
Gregg,

> I looked at all the code and I think I can take the search method from the
> String class and write a function that will take an array of chars and search
> simultaneously for all of them.

Sorry, I must have missed something. As I mentioned, I already found
SearchChar() which does exactly this. Shouldn't I use this method of the warning
in string.hxx:
"THIS CODE IS DEPRECATED.  DO NOT USE IT IN ANY NEW CODE.
Use the string classes in rtl/ustring.hxx and rtl/ustrbuf.hxx (and
rtl/string.hxx and rtl/strbuf.hxx for byte-sized strings) instead."

I just discovered this while looking around because of your comment.

So if this SearchChar() method should be reimplemented in the String class, you
can of course do this. I don't get the difference between ByteString/UniString
and String at the moment (besides that String looks like it misses many
functions and isn't Unicode-capable).

Christian
Comment 51 glebovitz 2006-02-18 19:35:46 UTC
christian,

I missed your comments about the SearchChar function. If you found a function
that does what you need then of by all means use that. I spent some time looking
at the various string classes and it looks to me like byte_string and unistring
are part of the old String class. I think byte_string and unistring are
currently '#defined' as String. The new replacement, I think, are OUstring and
Ustring in the rtl libraries.

By the way, I looked at prisonerofpain's suggestion for the boyer-moore search
function and it is much too complex for what you (we) need 'cuz we are only
searching for single characters ('\' and '&').

Gregg
Comment 52 cheyrich 2006-02-18 21:32:51 UTC
Gregg,

> I missed your comments about the SearchChar function. If you found a function
> that does what you need then of by all means use that.

Ah, ok - that's an explanation.
So I'll stick with that.

> The new replacement, I think, are OUstring and Ustring in the rtl libraries.

From the comment I quoted I'd say yes. At that time I hadn't looked at these
classes. And as it looks to me now, they're not equivalent since read only
String and no useable search method.

> I looked at prisonerofpain's suggestion for the boyer-moore search

Yep, BM is overkill for short search strings and it's also new to me that you
can search for multiple independent chars with it.
Comment 53 olo 2006-08-11 22:30:25 UTC
So, will this patch get accepted to the OO code tree?
Comment 54 ooo 2006-08-14 11:43:40 UTC
Andreas, Christian,

I took a short glance at this patch, it looks viable in general (apart from
German comments in new code, hey, we should stop that ;-)

However, as regexp search&replace is also used by Calc, we should offer the
replacement functionality of ActualStrReplace() at a more common place, i.e. the
utl::TextSearch wrapper. The method (name? RegexReplace?) should take parameters
util::SearchResult, original string and replacement string. It should return a
value indicating whether all replacements were done, errors occured, e.g. more
backreferences than search groups, or result string overflows. Maybe a sequence
of the length of the number of backreferences to be able point to the place of
error. Just a quick thought.

  Eike
Comment 55 rnhainsworth 2006-08-14 11:50:52 UTC
It would be good to have a summary of the RegEx behaviour, viz., what is
replaced by what.
Comment 56 cheyrich 2006-08-14 13:10:52 UTC
Eike

> I took a short glance at this patch, it looks viable in general (apart from
> German comments in new code, hey, we should stop that ;-)

First it's easier for me to write in German and second I used it because
comments in that function already where in this language. But yes, though most
of my comments aren't really necessary (I always prefer commenting to much than
to less) I can change that if you want.

> However, as regexp search&replace is also used by Calc, we should offer the
> replacement functionality of ActualStrReplace() at a more common place, i.e.
> the utl::TextSearch wrapper. [...]

I'll try addressing this ASAP. But because I'm still a OOo-coding-newbie I can't
guarantee *if* and I'm in a transition to another OS on my computer I can't tell
*when*.
Comment 57 rnhainsworth 2006-08-14 14:14:18 UTC
Sorry, my previous request (below) was too short to be meaningful.

Since there are a number of RegEx behaviours, it would be useful to have a
summary of the way the current Search and Replace patch is supposed to work.
(viz., the & replacement).

>It would be good to have a summary of the RegEx behaviour, viz., what is
replaced by what. 
Comment 58 lohmaier 2006-08-21 21:42:11 UTC
@cheyrich:
heavy documentation in the code is not bad! er just wanted to say that new
comments should be written in english, not german (so that all devs can
understand them)

@all:

IMHO the following is how it *should* be:

..in replacement box    does this..
&                       inserts complete string that matched (as it does now)
\1                      inserts group number one - the match enclosed in the 
                        first pair of matching (round) parentheses
\2 \#                   inserts the second and #th matching group
\0                      inserts all matching groups

\n                      inserts a newline (linebreak - as inserted with
                        <shift>+>enter> in normal text)
\r                      inserts a paragraph break
\t                      inserts a tabulator
\xFFFF                  inserts the character matching the hex-code FFFF
\c                      where c is not one of [0-9nrtx] inserts the character c (*)
\\                      inserts a backslash

(*) the list may not be exclusive, depending on what other escape-sequences are
added - maybe to insert a non-breaking hyphen/space
Comment 59 cheyrich 2006-08-21 23:57:28 UTC
> er just wanted to say that new comments should be written in english,

When continuing work, I'll translate all comments.

> \0                      inserts all matching groups

I guess there are different opinions on how it *should* work. As I wrote
earlier, I know \0 to work like &, i.e. contain the complete string. That's how
PHP's ereg* and preg* functions as well as - and that's the main reason it works
as it does - FSF's GNU regular expression library, which OO uses, work.
Comment 60 glebovitz 2006-08-22 02:34:05 UTC
Where does the functionality for \0 => all match groups specified? POSIX? Gnu?
If there is a conflict over expected function, then we should probably follow
the standards.

Gregg
Comment 61 ooo 2006-08-22 14:12:32 UTC
AFAIK there is no standardized specification for \0. However, quite a few
implementations use that extension, e.g. GNU sed and, as I've read, the .NET
framework. Btw, IEEE Std 1003.1, 2004 Edition, also does not define the '&'
ampersand for backreference. See
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
Comment 62 aexl 2006-08-22 15:41:05 UTC
I would be happy to get rid of the "&".
I'm not used to treat it as a special character.

For many characters (like backslash and all kinds of brackets eg) i have the
intuitive feeling of "oh, this might be a special character, so i better escape
it...", but OOo is the only place i know where "&" has a special meaning so i
would never have the idea to escape it...
Comment 63 noise_e_piranha 2006-08-22 16:33:27 UTC
I've been using "vi" for 21 years and ALAICR "&" has had a special meaning.  I
use it all the time in regular expressions and would feel the functionality is
missing if it were removed.
Comment 64 glebovitz 2006-08-22 16:57:49 UTC
I may be missing something, but the IEEE document that er posted specifies the
regular expression for regexec and regcomp only, and does not specify the syntax
for replace. Back references in this document appears to discuss references in
the match pattern only.

IMHO, We've already had this discussion and decided that implementing a replace
without an '&' does not follow convention. Since there is no dispute over the
definition of '&' in the replace string, I believe we should not be trying to
change this behavior.

Here is the behavior of Perl, PHP, and sed using the '&' and '\0' substitution
patters for the string "123 ABC DEF GHI":

perl -> 's/123 (AB). (DE). (GH)./$&'       -> '123 ABC DEF GHI'
perl -> 's/123 (AB). (DE). (GH)./$0'       -> '_'

sed  -> 's/123 \(AB\). \(DE\). \(GH\)./&'  -> '123 ABC DEF GHI'
sed  -> 's/123 \(AB\). \(DE\). \(GH\)./\0' -> '123 ABC DEF GHI'

php  -> 's/123 (AB). (DE). (GH)./&/'       -> '&'
php  -) 's/123 (AB). (DE). (GH)./\0/'      -> '123 ABC DEF GHI'

Note: while php and sed support the '\0' behavior, perl does not support '$0'.

Only Microsoft replaces '\0' with all the matched subgroups.

This creates a dilemma since Microsoft users expect one behavior and the open
source world expects another. 

The best of all worlds would support two syntaxes and allow the user to select
between them. I can see having a replace check box option labeled "use Microsoft
replace" that would change the behavior to use the complete Microsoft syntax.

For the time being, I suggest that we try to stick with GNU sed, Perl, and PHP
and support both '&' and '\0' for whole string replace and revisit Microsoft
compatibility at a later date.
Comment 65 lohmaier 2006-08-29 22:39:51 UTC
Well - \0 for all matching groups just seemed to be more logical to me. (again
remember that this is for the replace-box, not for the search). We already have
a common character to match the whole string (&) - so having one for all groups
seemed logical (simply \0 instead of writing \1\2\3\4)

Also note that many of the other regex implementations cannot use \0 since that
is often used to specify a character by its octal code - but OOo uses hex values
instead (\x) - so this wouldn't collide.

All in all I don't have a strong opiontion about it. If you decide to make \0
behave like &, I won't complain....

But keep the &. Even if it would not be common in regex (it is very common), it
still would be an expression that used to be available in OOo (and its
predecessor) for years. Removing it would be a regression.
Comment 66 glebovitz 2006-08-30 21:42:15 UTC
@cloph

Good points.

An important issue in changing the behavior of \0 to all matching groups is that
Christian implemented & by converting it to a \0 on the input buffer and then
expanding it to the entire match string in the output buffer. Therefore
providing for different behaviors between & and \0 is a large coding change.

I suggest we keep the behavior as proposed by Christian.
Comment 67 cheyrich 2006-08-31 00:04:57 UTC
> An important issue in changing the behavior of \0 to all matching groups is that
> Christian implemented & by converting it to a \0 on the input buffer

Er, did I? Oh, yes, but that was in version 2, in version 3 it's different since
I had found SearchChar() in the meantime.
But I nevertheless prefer keeping the behaviour of version 3 (& == \0). Mainly
because the regex lib delivers \0 that way.
Comment 68 glebovitz 2006-08-31 14:31:06 UTC
Christian,

I helps to click on the correct link before making comments *blush*. I WAS
looking at the version 2 document.

Looks really good.

Gregg
Comment 69 cheyrich 2006-09-21 18:45:08 UTC
I'm currently in the phase of reorientate myself in the code. I hope I can
realise the proposed changes soon.

> we should offer the replacement functionality of
> ActualStrReplace() at a more common place, i.e. the
> utl::TextSearch wrapper.

Making it globally available is ok for me. But I don't know if a class named
TextSearch should contain a method that replaces. Not that I want to create
another class TextReplace or so, but maybe there's a more fitting already in
existance. However, if someone who's deeper in the system than me says I should
put it in TextSearch, I'll do that.
Comment 70 glebovitz 2006-09-21 20:08:19 UTC
Christian,

Shouldn't the functionality of textsearch (and textreplace) be integrated into
the base string class? Seems to me that strings, in an editor, should be self
searching and self replacing? The QString class in Qt 4 supports search and
replace. This makes the class very useful.

Gregg
Comment 71 cheyrich 2006-09-22 12:52:09 UTC
> Shouldn't the functionality of textsearch (and textreplace) be integrated into
> the base string class? Seems to me that strings

Having methods for search and various manipulations in string classes seems
reasonably to me. OOo's UniString and ByteString look already quite rich equiped
with methods. And maybe having a regex version of their SearchAndReplace()
method would be good. But that's neither what I want to do nor what I can do.

Searching through a whole document and replacing matches it's different from
just searching through a string only. Either that or those who implemented the
currenct S&R were insane. At least for me it's just a mess in which I was
finally able to find the right point to insert my code.

That just to inform you what you can expect from me (resp. what not).
Comment 72 stefan.baltzer 2006-11-09 14:18:47 UTC
*** Issue 25177 has been marked as a duplicate of this issue. ***
Comment 73 cbrunet 2006-12-08 16:57:34 UTC
What is happening with that issue? When will it be integrated? 73 votes...
Shouldn't a target be set? 2.2?
Comment 74 cheyrich 2006-12-08 19:15:34 UTC
Getting this in would be great.
I understand the requests for availability of the functionality to all parts OOo
but that's too deep for me. So if not someone at least can point me in the right
directions this will be open for another three years I fear.
Comment 75 schaber 2007-03-03 11:27:17 UTC
Could you please integrate the patch into the product? You can make improvements
at any time, but meanwhile it would be great to have the functionality in the
product, even if it is only available in the Writer module.

80 votes so far!

Best regards,

   Gerhard
Comment 76 andreas.martens 2007-03-05 12:19:22 UTC
I will set up a CWS for this patch and we will see, how far we'll get. Stay tuned!
Comment 77 schaber 2007-03-21 11:16:19 UTC
Hi!

Any news?
Comment 78 lohmaier 2007-04-11 12:32:30 UTC
*** Issue 76188 has been marked as a duplicate of this issue. ***
Comment 79 floris_v 2007-07-17 16:47:27 UTC
I programmed a replace function in Delphi (I'm sorry to say that I'm absolutely
no good at C++) with a different set of regular expressions or wildcards but
with the possibility to use \1 - \9 in the replace by string to access variable
text in the search string. My very simple system worked like this:
The search and replace methods call a match function that returns true if a
match is found. The match function has a lot of parameters, like the start and
end position of the found text, the search and replace string, and of course a
reference to the string that holds the text you're trying to find the search
string in.
The match function assembles text matching expressions in () in an array of
strings; when a match is found the switches  in the replace string are simply
replaced by the corresponding strings in the array. I didn't include & and \0 -
not having & was an oversight but I'm not sure that \0 is used a lot in word
processors (it's definitely not done in MS Word) and I feel that comparing OO.o
Writer with a (in my humble view) low-level editor like Sed isn't quite correct. 

I hope the support of switches in the replace by string will be implemented soon. 
Comment 80 stx123 2007-07-23 11:27:48 UTC
It looks like the target should be reconsidered; setting type to patch.
Comment 81 andreas.martens 2007-07-23 11:37:40 UTC
Yes, I created already a CWS regexp01 for this, but did not find the time for
OOo2.3. My planning is to improve our regular expression support in one of the
next versions, hopefully 2.4.
Comment 82 ooo 2007-07-23 12:12:20 UTC
Andreas, please consider my comments in #desc55 from Mon Aug 14 10:43:40 +0000
2006 and rework the patch accordingly.

Thanks
  Eike
Comment 83 andreas.martens 2007-11-01 13:26:02 UTC
I will integrate some improvements for regular expressions into OOo2.4.
CWS regexp02 is on its way.
Comment 84 andreas.martens 2007-11-09 08:01:29 UTC
Any volunteers for doing the specification?
We have a first draft at 

http://specs.openoffice.org/appwide/find_and_replace/Regular_Expressions.odt

In CWS regexp02 the backwards references are already implemented (with $0 - $9)
for Writer and Calc
Comment 85 andreas.martens 2007-11-09 16:22:30 UTC
Fixed in CWS regexp02.

Only the specification needs a little bit improvement ;-)
Comment 86 andreas.martens 2007-11-09 16:23:23 UTC
Ready for QA.
Comment 87 drking 2007-11-10 07:30:41 UTC
> Any volunteers for doing the specification?

It looks as if this needs someone who has done one before, and knows what is 
required.

btw note that the 3rd example on page 4 (Detail Spec) should be ([1-9]+) not 
([1-9]).

I'll volunteer to update the wiki regex HowTo, unless someone beats me to it.

But I am very puzzled why $1 - $9 has been chosen, rather than /1 - /9 as in 
the Search For box. In the HowTo this is going to look silly - along the lines 
of, well when you want a backref in the Search For you use /1 but in the 
Replace with box ....

Could someone enlighten me if there's a good reason? Or will $1 - $9 now work 
in the Search For box as well?

Not knocking the effort - it's a good step forward. Thank you.
Comment 88 ooo 2007-11-10 10:58:07 UTC
@drking: $n was chosen because later at some point we will switch to the ICU
regex engine that also knows this syntax, see
http://www.icu-project.org/userguide/regexp.html for a complete reference. The
$n is also what perl users are acquainted with. And no, $n in search is not
supported, that would conflict with $ being the end-of-text anchor.
Comment 89 gudmund 2007-11-10 11:43:46 UTC
Please pardon my ignorance as a layman. Does the "Resolved" and "FIXED" in this
issue mean that the issues in bugs
http://www.openoffice.org/issues/show_bug.cgi?id=46165 and
http://www.openoffice.org/issues/show_bug.cgi?id=70554 are also covered and fixed?

In short: will an ordinary user be able to
- search and find line breaks, any kind
- search and find paragraph breaks, any kind
- substitute any of the above, be it one or many with one or many of any
combination of the above?

There is a whole lot of translators and other users waiting for this good news,
since these bugs make it impossible for us to use Openoffice as anything much
more than an (resource heavy) auxiliary for petty tasks.
Comment 90 glebovitz 2007-11-10 16:46:52 UTC
Question to AMA,

When Christian implemented the regex substitution code, he supported both the $n
and \n syntax. Did this change in the final version? I was looking at the regex
specification document above and it only mentions the $n syntax.

It doesn't make that much difference to me, but outside of perl, most regex
packages seem to support the \n syntax, including MS office.

Gregg
Comment 91 drking 2007-11-10 18:35:50 UTC
@er
Thank you - that will be useful when explaining the rationale. 

@gudmund
There are close to 40 issues about regex, and they're all treated separately, 
so no - I'm afraid the other issues you mention are not fixed. The good news is 
that if OOo migrates to the ICU regex engine, many of the existing issues may 
be resolved at a stroke. Although (looking at the ICU regex spec) probably not 
all of them.
Comment 92 andreas.martens 2007-11-14 09:42:18 UTC
ama->glebovitz: The current implementation will support $n, not \n. See comment
from er (Nov 10) about the reason for choosing $n.
Comment 93 sashiman 2007-11-22 15:53:28 UTC
I have read this thread several times now, and am ecstatic to see that it will
be possible to use back-references as described above.  For the moment however,
I can use back-references in the search box (the palindrome "algorithm"
described above works perfectly), but the most recent version 2.3.0 does not
seem to have the $n feature incorporated.  Is this going to be integrated in a
future release, or is it just that I have missed some crucial part of the syntax?

Thank you for adding this feature! If someday it becomes possible to add style
info to the search and replace boxes, I may be able to stop using MS Word
entirely!  
Comment 94 ooo 2007-11-22 19:02:57 UTC
@sashiman: Please see the issue's target that reads OOo2.4, so the change most
certainly is not available in OOo2.3 ...

> If someday it becomes possible to add style info to the search and replace boxes

If you used styles it was always possible to search for and replace with styles,
see the "More Options" button and "Search for Styles". If you used hard
formatting attributes instead then see "More Options" and the "Attributes..."
and "Format..." buttons.
Comment 95 sashiman 2007-11-25 10:42:59 UTC
Ok, thanks for the info on backreferences.

With regard to styles, I know I can replace one style with another, but what I
would like to be able to do is replace a character style with an XML tag, i.e.
find all that is marked with a user-created style, e.g. author, with
<person>^&</person>.  I'll doublecheck to be sure I'm not mistaken, but this
feature does not seem to be available, contrary to in MS Word.  (That said your
HTML is readable, contrary to MS Word, such that I could do the replacement with
a basic text editor (or with Oo) once I've converted to HTML.)  I don't mean to
be hijacking this thread with a separate issue, so my apologies, I just wanted
you to be sure you understood the issue that I was raising.
Comment 96 sashiman 2007-11-25 14:18:54 UTC
In fact, just to make even clearer that this is related to the REGexp issue: 
imagine that one has different heading1 level titles that one wishes to convert
to XML in a TEI type format:  in the other product I would search for 
(Chapter) ([0-9]{1;3})(*)^13 
having the attribute style="heading 1" 
and replace with 
<div type="\1" n="\2">\3</div>.
potentially having the default style (of little relevance, as in exporting to
encoded text all styles will be lost, whereas obviously the tags will not.

Working with texts that have different labels for identical level items (or
captions with different labels, for example) is certainly possible when working
with digitized books.  Using a high-level word-processor always non-specialists
(or those who would rather not see all the tags) to work on markup, leaving the
XML conversion to a macro...

In any case, I'm glad to see that these concerns are being taken seriously, the
plans for a major overhaul mentioned elsewhere (in the CRLF discussion) is
superb, and from what I saw from the link the icu project looks like a fantastic
target. Thank you!
Comment 97 stefan.baltzer 2007-11-26 17:07:06 UTC
Created attachment 49915 [details]
Test Case
Comment 98 stefan.baltzer 2007-11-26 17:10:33 UTC
SBA: Verified in CWS regexp02.
Comment 99 Joe Smith 2007-12-12 23:44:59 UTC
Hey, this is great! Fantastic job. This will be a big enhancement for OOo's Find
& Replace--thanks for tackling this!

I just ran through a few example tasks that people have asked about. I only
found one glitch:

Capitalize words beginning with h:
s/\<h([a-z]+)/
r/H$1/
Match case = Yes

Starting text:   He heard quiet steps behind him.
Expected result: He Heard quiet steps behind Him.
Actual result:   He H$1 quiet steps behind H$1

OOo-Dev SRC680_m239 on Fedora Linux 8
Comment 100 stefan.baltzer 2008-05-27 17:33:21 UTC
OK in OOo 2.4. Closed.

SBA->jes: To capitalize all words with "h", simply replace the "h" with "H" :-)
Search for: "\<h"
replace with: "H"
Match Case and RegEx checked, Click "Replace all", works.

But you are right, in this case the sub-expression do not work correctly.
Please file another issue for that one because this issue here was about Sub
expressions GENARALLY working. "Mutating" issues can not be handled with
feasible effort.
Comment 101 gudmund 2008-05-28 09:42:05 UTC
From the 2.4 help file (node "regular expressions;list of"):

"& or $0	Adds the string that was found by the search criteria in the
Search for box to the term in the Replace with box when you make a
replacement. For example, if you enter "window" in the Search for box
and "&frame" in the Replace with box, the word "window" is replaced with
"windowframe". You can also enter an "&" in the Replace with box to
modify the Attributes or the Format of the string found by the search
criteria."

"^$	Finds an empty paragraph."

But: & or $0 in the Replace box do not insert an empty paragraph mark but
instead the characters & or $0.

Nicely enough, & works for inserting \n (with e. g. &&& inserting \n\n\n
if only one \n and one \n alone was in the Search box), which indicates
that correct handling of \n (for inserting line breaks (newline)) and
\r (for inserting paragraph breaks) might not be impossible after all.

@SBA: Should this too be viewed as a special case, with an issue of its
own? I thought issue 46165
http://www.openoffice.org/issues/show_bug.cgi?id=46165 was supposed to be some
sort of collector issue for regular expressions in general.

Or should I reopen issue 70554
http://www.openoffice.org/issues/show_bug.cgi?id=70554 as being a special
case/"mutating" issue?

I can't find a "spec template" that you've mentioned in issue 46165. Is this it?:
http://specs.openoffice.org/  
http://specs.openoffice.org/collaterals/template/2.0/OpenOffice-org-Specification-Template.ott
http://specs.openoffice.org/collaterals/OpenOffice_org_Specification_guide.sxw
http://eis.services.openoffice.org/EIS2/guide.CheckSpecification

If this is the kind of thing it takes, I guess I should file an RFE to the OOo
Bugzilla that such pointers be included in every issue page, "Write a
specification template".
Comment 102 drking 2008-05-28 10:28:44 UTC
@gudmund
>But: & or $0 in the Replace box do not insert an empty paragraph mark 

I think that's how the thing works - you found an empty *paragraph* but tried 
to insert a *paragraph mark*. The Application Help is rather sparse on this 
topic; you might like to read the Wiki:
http://wiki.services.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressio
ns_in_Writer
?