Nested Regular Expressions explained

This article was originally written for The Code Project and can be found at the location http://www.codeproject.com/useritems/Nested_RegEx_explained.asp. Therefore it is written in (a more or less well formed) English.

Introduction

A cool feature of the .NET RegEx-engine is the ability to match nested constructions, for example nested parenthesis. I will describe this feature somewhat in depth in this article. If you are an experienced RegEx developer, please feel free to go forward to the part: The push-down automata.

Background

Regular expressions are a very cool feature for pattern recognition in strings. Mathematically speaking regular expressions are parsed through a finite state machine. As the name implies such a machine has only a finite number of states, and it has no external memory attached. This means, that the finite state machine cannot match nested constructions such as: “(9 + (5 + 3))”. Or a more specific task:

Tell me if and where the string str has a number of correct nested parenthesis.

Though, if the finite state machine is supplied with an external stack, the mathematics state, that this would be possible – and that’s what has happened in the .NET RegEx engine. Through smart use of external stacks it is now possible to match nested constructions. So let’s see how it gets the job done.

The beginning: Captures

(Feel free to skip this part if you already have a solid knowledge of regular expressions).

In a regular expression you can always use parenthesis to capture a specific part of the recognized pattern. A quick example:

Source text
mymail@mycompany.com

Regular expression
([^@]+)@([^.]+)\.(\w{2,4})

This is a (too) simple expression which attempts to capture a mail address. Let’s run through the parenthesis:

1. ( [^@]+ )

The first parenthesis matches any number of characters which is not a @-symbol. The expression is encapsulated in a parenthesis, and therefore the RegEx engine will capture this part of the source into a group numbered 1. This group can be accessed at runtime like this:

 

String pattern = @"([^@]+)@([^.]+)\.(\w{2,4})"; 

String source = "mymail@mycompany.com"; 

Match match = Regex.Match(pattern, source); 

match.Groups[1].Value; // returns "mymail" 

match.Groups[1].Index; // returns 0; 

2. ( [^.]+ )

Likewise, after the @ in the email address the above expression will capture any text until it reaches a period. This will be captured in to group number 2.

3. ( \w{2,4} )

Finally this will capture the rest of the expression – the top level domain – into group number 3.

The fundamentals: Named captures

Even though this is not a new feature, named captures are fundamental for understanding the nested constructions in .NET regular expressions.

In the previous chapter parenthesis were used to capture a part of the source into a group. The groups were named with successive integers beginning with 1 (by convention Groups[0] captures the whole match).

But it is also possible to capture parts of the source into named groups, with the following simple syntax:

(?<user>[^@]+) @ (?<company>[^.]+) \. (?<top_level_domain>\w{2,4})

In the code the groups can now be accessed like this:

 

String pattern = @"([^@]+)@([^.]+)\.(\w{2,4})"; 

String source = "mymail@mycompany.com"; 

Match match = Regex.Match(pattern, source); 

// returns "mymail": 

match.Groups["user"].Value; 

// returns "mycompany": 

match.Groups["company"].Value 

// returns "com": 

match.Groups["top_level_domain"].Value 

This is not a new part of a regular expression engine – you would find the same to exist with almost the same syntax in languages like Python and PHP.

But the .NET regular expression engine uses the principle of named capturing to provide features for the developer to create a regular expression which is capable of matching nested constructions. Now, let’s check that out!

The push-down automata

As stated in the beginning of this article a finite state machine is not capable of matching nested constructions. This is actually the reason that Chomsky in the 1960’s argued that a finite state machine cannot recognize a natural language such as English.

A push-down automata is a finite state machine with an external memory – a stack – attached. This framework provides a basis for understanding how the .NET RegEx engine is capable of matching nested constructions. Let’s see some code.

Now, let’s say that the task is to match nested <div>‘s in HTML code.

 

String pattern = @" 

(?# line 01) <div> 

(?# line 02) (?> 

(?# line 03) <div> (?<DEPTH>) 

(?# line 04) | 

(?# line 05) </div> (?<-DEPTH>) 

(?# line 06) | 

(?# line 07) .? 

(?# line 08) )* 

(?# line 09) (?(DEPTH)(?!)) 

(?# line 10) </div> 

";source = "<div>asdf<div>asdf</div>asdffds</div>"; 

Match match = Regex.Match(source, pattern, 

  RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase); 

Console.WriteLine(match.Success.ToString()); 

First (line 1) this expression matches an opening div. This is mirrored by an ending closing tag </div> in line 10.

In line 2 the first group begins. It is an unnamed atomic group. Atomic means roughly that once something is matched, the RegEx-engine won’t give it up again. This might sound a bit strange at first, but it can be important to performance. To keep focus in this article I won’t elaborate further on it. But notice that this parenthesis is ended in line 8 with a * meaning: repeat as many times as possible.

Now the important stuff begins. Line 3 to 7 is an alternation with three possibilities. Either it should match a <div> or a </div> or a single character .?. When alternation occurs, the .NET RegEx engine will tryout the matches one at a time accepting the first match – even if this is not the longest match. (This is as far as I know opposed to a deterministic finite automaton).

When the RegEx engine matches a <div> (line 3), it is at the same time told to match something else: (?<DEPTH>). Remember the stuff about named capturing? This actually is a capture with the name DEPTH. It is empty, though! So while the group doesn’t actually capture anything from the source, it does something else, which is very important – it creates a new stack, let’s pretend it’s called DEPTH, and it puts the capture on that stack!

This means that we now have a new stack called DEPTH with one element on it containing an empty string. This is quite interesting – so let’s dig a bit deeper into it.

Check this out:

 

( 

(?<A>a) 

| 

(?<B>b) 

)*

Here we’ve also used named capturing, but we don’t just capture an empty string, we are capturing the letter a onto the stack A and the letter b onto the stack B. As seen before we can request this using the code: match.Groups["A"].Value;.

Now, let’s change the code a bit:

 

( 

(?<A>a) 

| 

(?<A>b) 

)*

Now both groups are named A. What happens this time? If we take this source: ab, the RegEx engine will first match the a. By doing this it creates a new stack called A and it pushes the a on the stack. Next it matches the b. It already has created a stack named A, so it just pushes a new element on the stack with the capture b. So now our stack looks like this:

 

Stack 'A' 

________ 

|      | 

|   b  | 

|   a  | 

|______| 

If you uses the same code to request what’s captured in group A (match.Groups["A"].Value) you would get the string “b” – the Groups object simply peeks the top element on the stack.

This stack invention is quite cool – but what’s even cooler is, that is let’s you pop elements off the stack inside the regex pattern, which is what happens in this example:

( 

(?<A>a) 

| 

(?<A>b) (?<-A>) 

)*

Again, consider the input string ab. First the RegEx engine matches the a, creates a new stack and pushes an a on it. Next, the engine matches a b and pushes it on the stack. Now the syntax changes: (?<-A>). This empty capturing parenthesis actually pops the top element off the A stack. To ensure this actually happens try the code once again: match.Groups["A"].Value. Now this code returns the string “a” even though the last character matched was “b”. The reason is, that the b was popped off the stack immediately after it was pushed on. So the stack contains only one element, i.e. the a.

Okay – back to the DIV’s in our main example.

Whenever the RegEx engine matches a <div> it pushes an empty string on the stack. On the other hand: whenever it matches a </div> it pops the stack. Doing this – the stack would end up empty if and only if the RegEx engine discovers a correct nested construction of DIV’s.

This is what the final expression is testing (line 9):

(?(DEPTH)(?!)) 

This is actually a conditonal test. A conditional test is of the well known form if-then-else in the syntax:

(? (if) then | else)

The if-part tests if the named group was matched and stored. If it was matched the then-part of the expression is applied, else the else-part is applied. Note that the else-part is optional.

What the last expression (?(DEPTH) (?!)) does then, is that it tests if the named group DEPTH was captured and stored. If it was, the then-part is applied (?!) forcing a negative lookahead with no expression, which will always fail. Hence, the whole expression fails if the number of (?<DEPTH>) doesn’t match the number of (?<-DEPTH>) applied in a correct nested order. You can test this last part your-self with an expression like this:

Given the source aba this expression will not succeed. First it matches the a and pushes it on the stack, then the b (without pushing) and pops the stack – now the stack is empty. Then it checks to see if the stack has been matched and stored – it hasn’t, so it will try to match a b where the source has an a. On the other hand, the source abb will succeed with a match.

Points of Interest

The invention of an auxiliary stack in the .NET RegEx engine has made it possible to match nested constructions and keep track of the matches bringing even more power to regular expressions. And it let’s you push, pop and to some extend peek the stack from within the RegEx engine.

Also, it is possible to create multiple stacks with different names keeping track of more complex expressions.

Morten Maate

About these ads

10 responses to “Nested Regular Expressions explained

  1. Pingback: Regex balancing group in depth « Linguistic forms

  2. Wow… this is the article I am looking for a long time. Very easy and descriptive explanation of tough topic.

    Really… bravo..

  3. Hi,
    A must read article for beginners. I tried your regular exp. in c# & it returns False.
    string source = @”(?>(?)|(?)|.?)*(?(DEPTH)(?!)) “;
    string pattern = “asdfasdfasdffds”;
    Match match = Regex.Match(source, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);

    Is my source variable correct?

    I am looking for a regex to extract nested if conditions like above.
    I have posted it here.

    http://stackoverflow.com/questions/7596861/regex-to-find-inner-if-conditions

    Can you pls help?

    Regards
    Uma Ilango

  4. Hi, thanks for your comment. Sorry about my late answer. I can see from the stack overflow thread, that your question has been answered. Please feel free to ask another time :-)
    Br. Morten

  5. Hi man, I understand I’m way late but I’m still hoping for an answer :P

    I’m trying this in VB.NET:
    Dim Regexp As String = “(?>a(?)|b(?))*(?(STACK)(?!))”
    ‘Create pattern

    Dim Regex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(Pattern, &H0)
    ‘Create Regex. &H0 is flags.none

    MsgBox(Regex.IsMatch(“aba”))
    ‘Compute boolean if matched.

    This should push an empty string into when it finds the letter “a” and pop it when it finds the letter “b” (Right?)

    This should then mean that the match should fail because there will be one item in when the parsing is done (Right?), but I always get True.

    Any idea why?

  6. Oh damn, it filters tags, Here’s my code:

    http://pastebin.com/EkqHBTck

    • Hi Mathias,

      Even though it’s just a’s and b’s it’s actually a quite complex example :-)

      To start with, try output your value, then it will be more apparent what is happening and easier to debug :-)

      If we take a look at the input example “aba”, then your regex will match “ab” which is as expected. This is fine since “ab” consits of only one ‘a’ and one ‘b’ and is a part of your string.

      If we take a look at another example, “q”, then the regex will match “” which is also fine. This is fine since your main parenthesis ends with * which will allow 0 captures. This is basically the same as the empty pattern: “”. So if you change it to +, then “q” will now fail.

      So how to change this behaviour will depend on what you are actually trying to accomplish? Do you want to match string with equal counts of a’s and b’s, or do they need to occur in some specific order, for example “ababab” or “aaabbb” etc. As I said in the beginning, this example can be quite tricky with regex:-)

      Br. Morten

  7. is it possible to match nested parttern e.g

    2a+(3a-4)+(2b(3a-2)))
    should output
    this
    (3a-4) ,(3a-2) and (2b(3a-2))
    or
    (3a-4) and (2b(3a-2))

    • Hi Oghenez,

      Yes, this should be possible with .NET Regular Expressions. My second blog post describes this scenario: http://retkomma.wordpress.com/2009/09/26/regex-balancing-group-in-depth/

      Please checkout the regex here: https://gist.github.com/1476622.

      Given the input “2a+(3a-4)+(2b(3a-2))” this outputs:
      3a-4
      3a-2
      2b(3a-2)

      I’ve added the beginning ^ and ending $ in order to enforce more strict parsing. E.g. “2a+(3a-4)+(2b(3a-2)))” would not succeed due to the extra parenthesis in the end. You can remove these symbols if you have less strict requirements.

      For complex parsing like this, depending on your scenario you might want to consider something else than regular expressions :-)

      Br. Morten

  8. Pingback: 02.8* Регулярни изрази | Plamen Stanev

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s