#2243 regex replaceAll problem

elyashiv Wed 26 Feb 2014

I tried out the Regex API running the following code:

fansh> re := Regex("a*")
a*
fansh> rm := re.matcher("aaaa")
fan.sys.RegexMatcher@d4dd2d
fansh> rm.replaceAll("b")
bb

I expected the result to be like the result of the following command:

~ echo "aaaa" | sed s/a*/b/g
b

Any idea?

tcolar Wed 26 Feb 2014

That seem correct "a*" means a then anything, so it will match "aa", twice

I think you probably wanted Regex("a+") if you want to replace all series of a by b. or maybe Regex("a.*") if you want to replace a followed by anything.

in sed s/a*/b/g you used /g which means greedy, so the * will behave like a .*

SlimerDude Wed 26 Feb 2014

Yep, Tcolar is right:

Regex.fromStr("a*").matcher("aaaa").replaceAll("b")    // --> bb
Regex.fromStr("a+").matcher("aaaa").replaceAll("b")    // --> b

Or to force a match on the entire Str, you can use the start ^ and end $ anchors:

Regex.fromStr("^a*\$").matcher("aaaa").replaceAll("b") // --> b

elyashiv Wed 26 Feb 2014

tcolar: According to this site the g doesn't mean greedy, but globally. The matching of aa is illogical - if the matching is greedy I will expect the matching to be aaaa, and if the matching is not greedy I will expect the matching to be a or epsilon (an empty string).

What I think have happened is that the matching matched aaaa and then matched the empty string in the end. This behavior is incorrect.

A little testing proves me right:

fansh> re := Regex("a*")
a*
fansh> ra := re.matcher("aaaa")
fan.sys.RegexMatcher@d8d3ce
fansh> ra.matches
true
fansh> ra.group
aaaa
fansh> ra.start
0
fansh> ra.end
4
fansh> ra.find
true
fansh> ra.group

fansh> ra.start
4
fansh> ra.end
4
fansh> 

SlimerDude Wed 26 Feb 2014

Underneath, Fantom is just using Java's Macther.replaceAll() so other than writing a new regex implementation, I don't think a lot can be done about it.

With regards to your example, you're right - Fantex shows the same results:

regex   := Regex<|(a*)|>
matcher := regex.matcher("aaaa") 

regex.matches("aaaa")   // --> true

matcher.find()
matcher.group(0)  // --> "aaaa"
matcher.group(1)  // --> "aaaa" 

matcher.find()
matcher.group(0)  // --> ""
matcher.group(1)  // --> "" 

I also found this question on StackOverflow, posted 2 years ago - String.replaceAll() anomaly with greedy quantifiers in regex. The answer explains why the result is valid, and why it is different in sed. (In essence .* matches an empty string, which is replaced with b.)

SlimerDude Wed 26 Feb 2014

A bit more reading suggests it's all about zero-width matches. This article tells us it's not consistent, even between browsers! - Watch Out for Zero-Length Matches

It really does seem to be a case of:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Login or Signup to reply.