Recursive Regular Expressions

2010-03-24 2 min read bash Fedora Linux

<img src="http://blog.amit-agarwal.co.in/wp-content/uploads/2010/08/yo-dawg-regex.jpg" alt="Yo dawg, I heard you liked regular expressions, so I put a regex in your regex so you can match while you match!" align="bottom" /> The <a class="zem_slink freebase/en/regular_expression" title="Regular expression" rel="wikipedia" href="http://en.wikipedia.org/wiki/Regular_expression">regular expressions we use in our daily lives are actually not that “regular.” Most of the languages support some kind of extended regular expressions that are computationally more powerful than the “<a href="http://en.wikipedia.org/wiki/Regular_expression">regular” regular expressions as defined by the <a href="http://en.wikipedia.org/wiki/Formal_language">formal language theory.

For instance, the so often used <a href="http://perldoc.perl.org/perlre.html#Capture-buffers">capture buffers add <a class="zem_slink freebase/en/data_storage_device" title="Data storage device" rel="wikipedia" href="http://en.wikipedia.org/wiki/Data_storage_device">auxiliary storage to the regular expressions that allow them to match an arbitrary pattern repeatedly. Or <a href="http://perldoc.perl.org/perlre.html#Look-Around-Assertions">look-ahead assertions that allow the regular expression engine to peek ahead before it making a decision.

The <a class="zem_slink freebase/en/perl" title="Perl" rel="homepage" href="http://www.perl.org/">Perl <a class="zem_slink freebase/en/programming_language" title="Programming language" rel="wikipedia" href="http://en.wikipedia.org/wiki/Programming_language">programming language has an especially rich with regex engine. One of the engine’s features is the lazy regular subexpressions. The lazy regular subexpressions are expressed as (??{ <a class="zem_slink freebase/en/computer_programming" title="Computer programming" rel="wikipedia" href="http://en.wikipedia.org/wiki/Computer_programming">code }), where the “code” is arbitrary Perl code that gets executed when the moment this subexpression may match.

This allows us to construct something really interesting – we can define a regular expression that has itself in the “code” part. The result is a recursive regular expression!

One of the classical problems that a regular expression can’t match is the language n1n, i.e., a <a class="zem_slink freebase/en/string" title="String (computer science)" rel="wikipedia" href="http://en.wikipedia.org/wiki/String_%28computer_science%29">string with a number of zeroes followed by an equal number of ones. Surprisingly, using the lazy regular subexpressions this problem becomes tractable!

Here is a Perl regular expression that matches n1n:

my $regex = qr/0(??{$regex})*1/;

This regular expression matches a followed by itself zero or more times, followed by a one. If the itself part doesn’t match, then the string this regular expression matches is 01. If the itself part matches once, the string this regular expression matches is 0011. If it matches two times, the string is 000111, …, etc.

Read More: <a href="http://feedproxy.google.com/~r/catonmat/~3/eTX_ascV6Qg/">http://feedproxy.google.com/~r/catonmat/~3/eTX_ascV6Qg/<div class="zemanta-pixie"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/25d355fb-fb6e-47f8-9b66-30215049a198/"><img class="zemanta-pixie-img" src="http://blog.amit-agarwal.co.in/wp-content/uploads/2010/08/reblog_b36.png" alt="Reblog this post [with Zemanta]" /><span class="zem-script more-related more-info pretty-attribution paragraph-reblog">

comments powered by Disqus