Help - Search - Members - Calendar
Full Version: Strange regex engine behavior with a lazy star token
Mp3tag Forums > Mp3tag - International > Bug Reports
ptrk.mj
E.g.
$regexp(a,.*?,+) = +++

while every flavor that I know (including Perl) outputs: +a+

From my understanding:
Since the dot is quantified by a lazy star (and there are no other tokens) it never matches any character (the token is skipped). The whole regex does match—however—twice. First it matches at character postion of 'a' and then at the very end of the string (regex engines tend to try to match a pattern even after all characters of the subject string have been used). Both matches are zero-width.

This does not happen in Mp3tag.
I don't know what's the logics here.

I know that this may sound a bit cryptic to ones not familiar with regular expressions. I'm not 100% sure of it myself. It may be a bug, it may be an oddity. Anyway an explanation on this would be much appreciated smile.gif

Regards
DetlevD
QUOTE (pmj1989 @ Jun 18 2010, 22:09) *
... $regexp(a,.*?,+) = +++ ...

Yes, this is rather confusing ... one single character should be replaced by one single character, and the result is a string of three characters!

$regexp('a','.','+') will give a single plus character.
$regexp('a','.+','+') will give a single plus character.
$regexp('a','.+?','+') will give a single plus character.

If you set a 'start of line" anchor ...
$regexp('a','^.','+') will give a single plus character.
$regexp('a','^.?','+') will give a single plus character.
$regexp('a','^.*','+') will give a single plus character.

... but
$regexp('a','.*','+') will give two plus characters.
$regexp('a','^.*?','+') will give two plus characters.
$regexp('a','.*$','+') will give two plus characters.
$regexp('a','.*?$','+') will give two plus characters.

$regexp('a','.*?','+') will give three plus characters.


If both 'start of line" and "end of line" anchors are set, then the expression works predictable.
$regexp('a','^.*?$','+') will give a single plus character as the result.

Is there anybody out there ... who understands the rules of the regular expression language in its entire dimension?

DD.20100619.1245.CEST
ptrk.mj
I decided to bring the topic back to life.

QUOTE ("DetlevD")
Yes, this is rather confusing ... one single character should be replaced by one single character, and the result is a string of three characters!

As I explained in the first post regex engines apart from matching at each character offset (in our one-character example at 'offset 0') also try to match a pattern at an offset behind the last character (i.e. match a pattern against void at the end of the string) (in our case 'offset 1').

Put $regexp(a,.*?,+) in any regex tool and you get +a+

Two zero-width matches (at offsets described above) are replaced with + character. That's 100% correct! Yet, Mp3tag outputs +++

1.
QUOTE ("DetlevD")
$regexp('a','.*','+') will give two plus characters.

Correct!

Match 1: a
Match 2: {void at the end of the string}

2.
QUOTE ("DetlevD")
$regexp('a','^.*?','+') will give two plus characters.

Wrong!

This regex should produce +a

Explanation: There is only one zero-width (due to laziness of the star) match at 'offset 0'. Match at 'offset 1' fails (due to ^ anchor).

3.
QUOTE ("DetlevD")
$regexp('a','.*$','+') will give two plus characters.

Correct!

It matches the same way as 1. did. $ anchor doesn't change anything here.

4.
QUOTE ("DetlevD")
$regexp('a','.*?$','+') will give two plus characters.

Correct!

Again, same matches as in 1. and 3.
.*? token will be forced to expand it's match to letter a by $ anchor. After that, there's another match at the end of the string.

5.
QUOTE ("DetlevD")
$regexp('a','^.*?$','+') will give a single plus character as the result.

Correct!

Match 1: a

Single one-character long match at 'offset 0'.
$ anchor makes the .*? token expand it's match to cover whole string.
^ anchor assures that pattern cannot be matched after the string ('offset 1').


What we learn from above:
- The problem occurs only when using *? (lazy star) token.
- Using $ anchor prevents the bug from appearing

Also:
- The bug will show up when using character token in regex that matches the character in the string (in our case it's the dot matching 'a')

Other examples:
$regexp(b,b*?,+)
$regexp(9,\d*?,+)
etc.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2014 Invision Power Services, Inc.