regex Back reference matching result not same as glibc and tre.

Submitted by liheng (P) on April 18, 2020, 8:44 a.m.

Details

Message ID 6D612B6AC5DCDA4580AF97B1068118AD2DC49A@DGGEML501-MBX.china.huawei.com
State New
Series "regex Back reference matching result not same as glibc and tre."
Headers show

Commit Message

liheng (P) April 18, 2020, 8:44 a.m.
Rich Felker:

Hello, I've noticed musl regex matching result is not same as glibc and tre. 
The back reference maybe not supported well in latest version.

Here is a simple test case:

#include <regex.h>
#include <stdio.h>
#include <string.h>

#define str "aba"
#define N 2
static const char *expected[N] =
{
        str, "a"
};

static const char pat[] = "(.?).?\\1";

int test_regex(void)
{
        regex_t rbuf;

        int err = regcomp(&rbuf, pat, REG_EXTENDED);
        if (err != 0) {
                char errstr[300];
                regerror(err, &rbuf, errstr, sizeof (errstr));
                puts (errstr);
                return err;
        }

        regmatch_t m[N];
        err = regexec(&rbuf, str, N, m, 0);
        if (err != 0) {
                puts ("regexec failed");
                return 1;
        }

        int result = 0;
        int i;
        for (i = 0; i < N; ++i) {
                if (m[i].rm_so == -1) {
                        printf ("m[%d] unused\n", i);
                        result = 1;
                }
                else {
                        int len = m[i].rm_eo - m[i].rm_so;
                        printf ("m[%d] = \"%.*s\"\n", i, len, str + m[i].rm_so);
                        if (strlen (expected[i]) != len
                                || memcmp (expected[i], str + m[i].rm_so, len) != 0)
                                result = 1;
                }
        }

        return result;
}

int main (void)
{
        int result = 0;

        result = test_regex();

        if (result != 0) {
                printf("test regex failed\n");
        } else {
                printf("test regex success\n");
        }

        return result;
}

musl: 
# ./test
regexec failed
test regex failed

glibc:
# ./test
m[0] = "aba"
m[1] = "a"
m[2] = ""
test regex success

tre:
# ./test
m[0] = "aba"
m[1] = "a"
m[2] = ""
test regex success


I noticed Rich Felker made change about back reference in below commit to suppress back reference processing in ERE regcomp.

commit 7c8c86f6308c7e0816b9638465a5917b12159e8f
Author: Rich Felker <dalias@aerifal.cx>
Date:   Fri Mar 20 18:25:01 2015 -0400

    suppress backref processing in ERE regcomp

    one of the features of ERE is that it's actually a regular language
    and does not admit expressions which cannot be matched in linear time.
    introduction of \n backref support into regcomp's ERE parsing was
    unintentional.


And I try to support back reference in ERE regcomp by below modify and then the musl regex matching success same as glibc and tre.

--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
                default:
+                       if (!ere && isdigit(*s)) {
+                       if (ere && isdigit(*s)) {
                                /* back reference */


Thank you for considering this.

Li Heng

Patch hide | download patch | download mbox

diff --git a/src/regex/regcomp.c b/src/regex/regcomp.c index bce6bc15..4d80cb1c 100644
--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
@@ -839,7 +839,7 @@  static reg_errcode_t parse_atom(tre_parse_ctx_t *ctx, const char *s)
                        break;
                default:
-                       if (isdigit(*s)) {
+                       if (!ere && isdigit(*s)) {
                                /* back reference */


This commit reminds me that if i want to use back reference i should not to tag REG_EXTENDED, but this test case matching still failed.

Comments

Florian Weimer April 18, 2020, 10:28 a.m.
* liheng:

> static const char pat[] = "(.?).?\\1";

> This commit reminds me that if i want to use back reference i should
> not to tag REG_EXTENDED, but this test case matching still failed.


Did you change the expression to this for the basic regular expression
test?

static const char pat[] = "\\(.?\\).?\\1";
liheng (P) April 18, 2020, 11:07 a.m.
static const char pat[] = "\\(.?\\).?\\1";
string: "aba";

I tested this pattern by my test case just now.

musl:
# ./test
regexec failed
test regex failed

glibc:
# ./test
Invalid back reference
test regex failed

tre:
# ./test
Invalid back reference
test regex failed

-----Original Message-----
From: Florian Weimer [mailto:fw@deneb.enyo.de] 
Sent: Saturday, April 18, 2020 6:29 PM
To: liheng (P) <liheng40@huawei.com>
Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.

* liheng:

> static const char pat[] = "(.?).?\\1";

> This commit reminds me that if i want to use back reference i should 
> not to tag REG_EXTENDED, but this test case matching still failed.


Did you change the expression to this for the basic regular expression test?

static const char pat[] = "\\(.?\\).?\\1";
Szabolcs Nagy April 18, 2020, 11:13 a.m.
* liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> static const char pat[] = "\\(.?\\).?\\1";
> string: "aba";

? is not special in bre

it should be \{0,1\} (i think we support \? as an extension,
but unescaped ? only matches literal ?). try one of

static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1";
static const char pat[] = "\\(.\\?\\).\\?\\1";

> 
> I tested this pattern by my test case just now.
> 
> musl:
> # ./test
> regexec failed
> test regex failed
> 
> glibc:
> # ./test
> Invalid back reference
> test regex failed
> 
> tre:
> # ./test
> Invalid back reference
> test regex failed
> 
> -----Original Message-----
> From: Florian Weimer [mailto:fw@deneb.enyo.de] 
> Sent: Saturday, April 18, 2020 6:29 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> 
> * liheng:
> 
> > static const char pat[] = "(.?).?\\1";
> 
> > This commit reminds me that if i want to use back reference i should 
> > not to tag REG_EXTENDED, but this test case matching still failed.
> 
> 
> Did you change the expression to this for the basic regular expression test?
> 
> static const char pat[] = "\\(.?\\).?\\1";
liheng (P) April 18, 2020, 11:37 a.m.
static const char pat[] = "\\(.?\\).?\\1";
str = "aba"

ok, I retest this pat with no tag.

regcomp(&rbuf, pat, 0);
regexec1(&rbuf, str, N, m, 0);

 glibc:
 # ./test
 regexec failed
 test regex failed

 musl:
 # ./test
 regexec failed
 test regex failed



-----Original Message-----
From: Szabolcs Nagy [mailto:nsz@port70.net] 
Sent: Saturday, April 18, 2020 7:13 PM
To: liheng (P) <liheng40@huawei.com>
Cc: Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.

* liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> static const char pat[] = "\\(.?\\).?\\1";
> string: "aba";

? is not special in bre

it should be \{0,1\} (i think we support \? as an extension, but unescaped ? only matches literal ?). try one of

static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1"; static const char pat[] = "\\(.\\?\\).\\?\\1";

> 
> I tested this pattern by my test case just now.
> 
> musl:
> # ./test
> regexec failed
> test regex failed
> 
> glibc:
> # ./test
> Invalid back reference
> test regex failed
> 
> tre:
> # ./test
> Invalid back reference
> test regex failed
> 
> -----Original Message-----
> From: Florian Weimer [mailto:fw@deneb.enyo.de]
> Sent: Saturday, April 18, 2020 6:29 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui 
> (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> 
> * liheng:
> 
> > static const char pat[] = "(.?).?\\1";
> 
> > This commit reminds me that if i want to use back reference i should 
> > not to tag REG_EXTENDED, but this test case matching still failed.
> 
> 
> Did you change the expression to this for the basic regular expression test?
> 
> static const char pat[] = "\\(.?\\).?\\1";
Szabolcs Nagy April 18, 2020, 2:07 p.m.
* liheng (P) <liheng40@huawei.com> [2020-04-18 11:37:13 +0000]:
> static const char pat[] = "\\(.?\\).?\\1";
> str = "aba"
> 
> ok, I retest this pat with no tag.

why?

? is not special in bre.
you need "\\{0,1\\}" or "\\?" instead of "?" to match "aba"

your pat would match str="a?b?a?" in a standard conform
implementation.

> 
> regcomp(&rbuf, pat, 0);
> regexec1(&rbuf, str, N, m, 0);
> 
>  glibc:
>  # ./test
>  regexec failed
>  test regex failed
> 
>  musl:
>  # ./test
>  regexec failed
>  test regex failed
> 
> 
> 
> -----Original Message-----
> From: Szabolcs Nagy [mailto:nsz@port70.net] 
> Sent: Saturday, April 18, 2020 7:13 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> 
> * liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> > static const char pat[] = "\\(.?\\).?\\1";
> > string: "aba";
> 
> ? is not special in bre
> 
> it should be \{0,1\} (i think we support \? as an extension, but unescaped ? only matches literal ?). try one of
> 
> static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1"; static const char pat[] = "\\(.\\?\\).\\?\\1";
> 
> > 
> > I tested this pattern by my test case just now.
> > 
> > musl:
> > # ./test
> > regexec failed
> > test regex failed
> > 
> > glibc:
> > # ./test
> > Invalid back reference
> > test regex failed
> > 
> > tre:
> > # ./test
> > Invalid back reference
> > test regex failed
> > 
> > -----Original Message-----
> > From: Florian Weimer [mailto:fw@deneb.enyo.de]
> > Sent: Saturday, April 18, 2020 6:29 PM
> > To: liheng (P) <liheng40@huawei.com>
> > Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui 
> > (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> > Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> > 
> > * liheng:
> > 
> > > static const char pat[] = "(.?).?\\1";
> > 
> > > This commit reminds me that if i want to use back reference i should 
> > > not to tag REG_EXTENDED, but this test case matching still failed.
> > 
> > 
> > Did you change the expression to this for the basic regular expression test?
> > 
> > static const char pat[] = "\\(.?\\).?\\1";
liheng (P) April 19, 2020, 12:26 p.m.
Ok, you are right, I retest to match "aba" by pat[] = "\\(.\\?\\).\\?\\1" success without tags (basic regular expression mode I think).
regcomp1(&rbuf, pat, 0);

But my point is that  why pat[] = "(.?).?\\1"  to match "aba" in extended regular expression mode that success in glibc and failed in musl?   Are musl-regex and glibc-regex different? 


-----Original Message-----
From: Szabolcs Nagy [mailto:nsz@port70.net] 
Sent: Saturday, April 18, 2020 10:07 PM
To: liheng (P) <liheng40@huawei.com>
Cc: musl@lists.openwall.com; Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.

* liheng (P) <liheng40@huawei.com> [2020-04-18 11:37:13 +0000]:
> static const char pat[] = "\\(.?\\).?\\1"; str = "aba"
> 
> ok, I retest this pat with no tag.

why?

? is not special in bre.
you need "\\{0,1\\}" or "\\?" instead of "?" to match "aba"

your pat would match str="a?b?a?" in a standard conform implementation.

> 
> regcomp(&rbuf, pat, 0);
> regexec1(&rbuf, str, N, m, 0);
> 
>  glibc:
>  # ./test
>  regexec failed
>  test regex failed
> 
>  musl:
>  # ./test
>  regexec failed
>  test regex failed
> 
> 
> 
> -----Original Message-----
> From: Szabolcs Nagy [mailto:nsz@port70.net]
> Sent: Saturday, April 18, 2020 7:13 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>; 
> musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; 
> Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> 
> * liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> > static const char pat[] = "\\(.?\\).?\\1";
> > string: "aba";
> 
> ? is not special in bre
> 
> it should be \{0,1\} (i think we support \? as an extension, but 
> unescaped ? only matches literal ?). try one of
> 
> static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1"; static 
> const char pat[] = "\\(.\\?\\).\\?\\1";
> 
> > 
> > I tested this pattern by my test case just now.
> > 
> > musl:
> > # ./test
> > regexec failed
> > test regex failed
> > 
> > glibc:
> > # ./test
> > Invalid back reference
> > test regex failed
> > 
> > tre:
> > # ./test
> > Invalid back reference
> > test regex failed
> > 
> > -----Original Message-----
> > From: Florian Weimer [mailto:fw@deneb.enyo.de]
> > Sent: Saturday, April 18, 2020 6:29 PM
> > To: liheng (P) <liheng40@huawei.com>
> > Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui
> > (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> > Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> > 
> > * liheng:
> > 
> > > static const char pat[] = "(.?).?\\1";
> > 
> > > This commit reminds me that if i want to use back reference i 
> > > should not to tag REG_EXTENDED, but this test case matching still failed.
> > 
> > 
> > Did you change the expression to this for the basic regular expression test?
> > 
> > static const char pat[] = "\\(.?\\).?\\1";
Florian Weimer April 19, 2020, 1:10 p.m.
* liheng:

> But my point is that why pat[] = "(.?).?\\1" to match "aba" in
> extended regular expression mode that success in glibc and failed in
> musl?  Are musl-regex and glibc-regex different?

They are different.  Nowadays, accepting backreferences for extended
regular expressions is probably a bug: it prevents certain strategies
for implementing regular expressions because they are not, in fact,
regular.

The glibc implementation is problematic for several reasons.  I cannot
recommend to use it as a reference.
Rich Felker April 20, 2020, 1:26 a.m.
On Sun, Apr 19, 2020 at 12:26:58PM +0000, liheng (P) wrote:
> Ok, you are right, I retest to match "aba" by pat[] =
> "\\(.\\?\\).\\?\\1" success without tags (basic regular expression
> mode I think). regcomp1(&rbuf, pat, 0);
> 
> But my point is that why pat[] = "(.?).?\\1" to match "aba" in
> extended regular expression mode that success in glibc and failed in
> musl? Are musl-regex and glibc-regex different?

Having backreferences in regular expressions is inherently a bug,
because it makes them non-regular and unsafe for use where the
expression is user input and finishing in finite time is a
requirement. They're supported in BRE because it's a historical
requirement impossed by the standard, but a major *feature* of ERE is
actually being regular and not having backreferences.

Rich