Cookie Notice

As far as I know, and as far as I remember, nothing in this page does anything with Cookies.

2012/04/03

Maybe Perl IS A Write-Only Language: Rereading a Regular Expression

OK, I don't really believe that, but I found this in my code base during a refactor today.

        $array[ 0 ] =~ s/[^\w\d\._]+/_/gomx ;

It took me some re-reading and reminding to decipher what it is supposed to be doing.

The brackets are holding a group, which contains everything that isn't a word character, a digit character (which is also a word character, so it is redundant), a period or an underbar, and replaces it with an underbar.

The carot (^) is a negator, and the square braces are a group. [^A-Z] matches any character which isn't a capital ASCII letter, which is to say, any character that is not in the group of letters between A and Z.

The flags, "/gomx", prove to me that this is my code, and it dates it to a point after I had bought and read through Perl Best Practices (where Damian Conway recommends to always use /m and /x) but before a presentation in my Perl Mongers group argued that the /o flag was useless. The usage, if I correctly recall, was to "precompile" the regular expression, making repeated uses faster. The speaker benchmarked it and found it didn't affect performance, or didn't affect performance positively, or didn't affect performance much. Looking in perlre, I'm not seeing evidence that /o is still in the language.

/g means global. Normally, regular expressions work on the first thing it matches, but this flag means it works on everything that matches.


/x extends formatting, allowing you to separate and document your regular expression. I clearly didn't do that. If I had documented this code, I wouldn't be blogging it now. /m allows you to treat multi-line strings like one-line strings. This code runs on tab-delimited data, line-by-line.  Damian Conway's suggests in Perl Best Practices that you always use these, even when you don't need them. Randal Schwartz told me on Google Plus that this is paramount to cargo-cult programming. (I don't believe I have done terrible damage in my attempts to present their positions. I also don't believe I have standing to argue on one side or the other. For further details, talk to them.)

So, this code takes everything that isn't in the small set of characters and replaces groups of them with a single underbar. Another redundancy is that underbars are word characters. m{[\w]} matches '_' just as m{[\w_]} does. It would be a better regex if it didn't match _, so that a string like '%^&_%*(&' would become '_' not '___'. This would be a better regex:

        $array[ 0 ] =~ s/[^A-Za-z0-9\.]+/_/gmx ;

And it would be better still if I documented it. So, this is clearly a case where I had a problem, brought in regexes, and now have two problems. Thank you JWZ.

10 comments:

  1. would this also be a "better" regex:

    s/\W+/_/gmx

    It is a wider net than your final one and might play nicer with Unicode. :-)

    Also, doesn't "." lose it specialness inside []? A quick look at perlre and search engines is inconclusive.

    However, "." should be captured by \W.

    ReplyDelete
  2. I don't know that I agree with your summation - I had no problem reading the line of code.

    ReplyDelete
  3. /m only affects the ^ and $ anchors, of which you use neither, so that is redundant.

    /x is also not being used by your example.

    Regexps in general often end up being rather unreadable when you come back to read them again later, which is why they ought to be commented nicely:

    # Turn clusters of non-identifier characters into underscores
    $array[0] =~ s/[^A-Za-z0-9\.]+/_/g;

    FTFY.

    ReplyDelete
  4. /m only affects the ^ and $ anchors, of which you use neither, so that is redundant.

    /x is also not being used by your example.

    Regexps in general often end up being rather unreadable when you come back to read them again later, which is why they ought to be commented nicely:

    # Turn clusters of non-identifier characters into underscores
    $array[0] =~ s/[^A-Za-z0-9\.]+/_/g;

    FTFY.

    ReplyDelete
  5. Doesn't /o matter only if there is variable interpolation inside regexp? /o basically is promise that variable wouldn't change.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Note that /w will be affected by locale, so if you really want a "ascii-only" string, it is probably a better idea to use the explicit ranges.

    ReplyDelete
  9. FYI, the perlre man page documents the syntax of regular expressions themselves. Whereas /o is an option to the m// and s/// operators and is therefore documented in the perlop man page.

    The /o is still part of Perl but it is seldom required and only has an effect if you have a variable in your regex (e.g: /www[.]$domain_name/).

    ReplyDelete
  10. gizmo, I'm sure I tried it both ways several years ago and decided that \. is what I wanted to do.

    tempire, you are clearly a better man than I. Or I was just caught flat-footed this day. As regexes go, or even as regexes I have written go, this is very slight.

    LeoNerd, I think I explicitly say that /m and /x aren't used, that Damian Conway says I should use 'em always, anyway, and that Randal Schwartz thinks that's cargo cult programming.

    Jakub and grant, I suppose that there is a fundamental misunderstanding from over a decade ago as to what compiling a regex is supposed to do. Thanks for helping me work that out.

    Daniel, explicit ranges is what I ended up with.

    Thanks for all your responses. The refactor that this regex was part of has succeeded in making the function half the size, much more DRY, and much much more readable and maintainable.

    ReplyDelete