2012/04/03

Maybe Perl IS A Write-Only Language: Rereading a Regular Expression

OK, I don't really believe that, but I found this in my code base during a refactor today.

        $array[ 0 ] =~ s/[^\w\d\._]+/_/gomx ;

It took me some re-reading and reminding to decipher what it is supposed to be doing.

The brackets are holding a group, which contains everything that isn't a word character, a digit character (which is also a word character, so it is redundant), a period or an underbar, and replaces it with an underbar.

The carot (^) is a negator, and the square braces are a group. [^A-Z] matches any character which isn't a capital ASCII letter, which is to say, any character that is not in the group of letters between A and Z.

The flags, "/gomx", prove to me that this is my code, and it dates it to a point after I had bought and read through Perl Best Practices (where Damian Conway recommends to always use /m and /x) but before a presentation in my Perl Mongers group argued that the /o flag was useless. The usage, if I correctly recall, was to "precompile" the regular expression, making repeated uses faster. The speaker benchmarked it and found it didn't affect performance, or didn't affect performance positively, or didn't affect performance much. Looking in perlre, I'm not seeing evidence that /o is still in the language.

/g means global. Normally, regular expressions work on the first thing it matches, but this flag means it works on everything that matches.


/x extends formatting, allowing you to separate and document your regular expression. I clearly didn't do that. If I had documented this code, I wouldn't be blogging it now. /m allows you to treat multi-line strings like one-line strings. This code runs on tab-delimited data, line-by-line.  Damian Conway's suggests in Perl Best Practices that you always use these, even when you don't need them. Randal Schwartz told me on Google Plus that this is paramount to cargo-cult programming. (I don't believe I have done terrible damage in my attempts to present their positions. I also don't believe I have standing to argue on one side or the other. For further details, talk to them.)

So, this code takes everything that isn't in the small set of characters and replaces groups of them with a single underbar. Another redundancy is that underbars are word characters. m{[\w]} matches '_' just as m{[\w_]} does. It would be a better regex if it didn't match _, so that a string like '%^&_%*(&' would become '_' not '___'. This would be a better regex:

        $array[ 0 ] =~ s/[^A-Za-z0-9\.]+/_/gmx ;

And it would be better still if I documented it. So, this is clearly a case where I had a problem, brought in regexes, and now have two problems. Thank you JWZ.
Post a Comment