Crucial Concepts Behind Advanced Regular Expressions

Advertisement

Regular expressions (or regex) are a powerful way to traverse large strings in order to find information. They rely on underlying patterns in a string’s structure to work their magic. Unfortunately, simple regular expressions are unable to cope with complex patterns and symbols. To deal with this dilemma, you can use advanced regular expressions.

Below, we present an introduction to advanced regular expressions, with eight commonly used concepts and examples. Each example outlines a simple way to match patterns in complex strings. If you do not yet have experience with basic regular expressions, have a look at this article1 to get started. The syntax used here matches PHP’s Perl-compatible regular expressions.

1. Greediness/Laziness

Greed

All regex repetition operators are greedy. They try to match as much as possible in a string. Unfortunately, this might not always be a desired effect. Thus, lazy operators are used to solve this problem. They only match the smallest possible pattern and are used by adding a ‘?’ after the respective greedy operator. Alternatively, the ‘U’ modifier may be used to make all repetiton operators lazy. Differentiating between greediness and laziness is key to fully understanding advanced regular expressions.

Greedy Operators

The * operator matches the previous expression 0 or more times. It is a greedy operator. Consider the following expression:

preg_match( '/<h1>.*</h1>/', '<h1>This is a heading.</h1>
<h1>This is another one.</h1>', $matches );

Recall that a . means any character except a new line. The above regular expression is looking for an h1 tag and all of its contents. It uses the . and * operators to constantly match anything inside the tag. This pattern will match:

<h1>This is a heading.</h1><h1>This is another one.</h1>

It returns the whole string. The * operator will continuously match everything — even the middle closing h1 tag — because it is greedy. Matching the whole string is the best it can do.

Lazy Operators

Let’s change the above operator by adding a ‘?’ after it. This will make it lazy:

/<h1>.*?</h1>/

The regex now fulfills its duty and matches only the first h1 tag. Another greedy operator that uses this same property is {n,}. This matches the previous expression n or more times. If it is used without a question mark, it looks for the most repetitions possible. Otherwise, it starts from n repetitions:

# Set up a String
$str = 'hihi';

# Match it using the greedy {n,} operator
preg_match( '/(hi){1,}/', $str, $matches ); # matches[0] will be 'hihi'

# Match it with the lazy {n,}? operator
preg_match( '/(hi){1,}?/', $str, $matches ); # matches[0] will be 'hi'

2. Back Referencing

Back Referencing

What it does

Back referencing is a way to refer to previously matched patterns inside a regular expression. For example, take a look at this simple regex that matches an expression in quotes:

# Set up an array of matches
$matches = array();

# Create a String
$str = ""This is a 'string'"";

# Traverse it with regular expressions
preg_match( "/("|').*?("|')/", $str, $matches );

# Print the whole match
echo  $matches[0];

Unfortunately, this will not correctly match the string. Instead, it will print:

"This is a '

This regular expression matches the opening double quote but finds a different type of quote to close it. This is because it was given the option of picking a double or single quote at the end. In order to fix this, you can use back referencing. The expressions 1, 2, …., 9 hold references to previously captured subpatterns. The first matched quote, in this case, will be held by the variable 1.

How to Use It

In order to apply this concept to the aforementioned example, use 1 in place of the last quote:

preg_match( '/("|').*?1/', $str, $matches );

This will now correctly return:

"This is a 'string'"

Remember that back referencing may also be used by preg_replace. Note that instead of 1 … 9, you should use $1 … $9 … $n (any number of these will work). For example, if you want to replace all paragraph tags with text that represents them, use:

$text = preg_replace( '/<p>(.*?)</p>/', 
"&lt;p&gt;$1&lt;/p&gt;", $html );

The $1 back reference holds the text inside the paragraph and is being used in the replace pattern itself. This completely valid expression shows an easy way to access matched patterns even while replacing.

3. Named Groups

When using multiple back references, a regular expression can quickly become confusing and hard to understand. An alternative way to back reference is by using named groups. A named group is specified by using (?P<name>pattern), where name is the name of the group and pattern is the regular expression in the group itself. The group can then be referred to by (?P=name). For example, consider the following:

/(?P<quote>"|').*?(?P=quote)/

The above expression will create the same effect as the previous back reference example, but by instead using named groups. It is also significantly easier to read.

Named groups are also useful when sifting through the array of matches. The name given to a specific pattern is also the key of the corresponding matches array.

preg_match( '/(?P<quote>"|')/', "'String'", $matches );

# This will print "'"
echo $matches[1];

# This will also print "'", as it is a named group
echo $matches['quote'];

Thus, named groups not only make code easier to read but also organize it.

4. Word Boundaries

Word Boundaries

Word boundaries are places in a string that come between a word character and a non-word character. The specialty of these boundaries is the fact that they don’t actually match a character. Their length is zero. The b regular expression matches any word boundary.

Unfortunately, boundaries are so often skimmed over that many do not recognize their real significance. For example, let’s say you want to match the word “import”:

/import/

Watch out! Regular expressions can be tricky. The above expression will also match:

important

You may think it is as simple as adding a space before and after import to prevent these bogus matches:

/ import /

But what about this case?

The trader voted for the import

When import is at the beginning or the end of a string, the modified regex will fail. Thus, splitting this up into cases is required:

/(^import | import | import$)/i

Looking back at our regular expression, it does not take periods or other punctuation into account. Just to match this single word, a regular expressions may look like this:

/(^import(:|;|,)? | import(:|;|,)? | import(.|?|!)?$)/i

That’s a lot of code to match just a single word. This is why word boundaries are so significant. To accomplish the above statement and many other variations with word boundaries, all that is necessary is:

/bimportb/

This will match every case above and more. b‘s flexibility comes from the fact that it matches a zero-length string. All it matches is an imaginary space between two characters. It checks if one of the characters is a non-word character and the other is a word character. If so, it matches it. If the beginning or end of a string is encountered, b treats it as a non-word character. Because the i in import is still considered a word character, it will match import.

Note that the opposite of b is B. This operator will match the space in-between two word or two non-word characters. Thus, if you would like to match ‘hi’ inside another word, you could use:

BhiB

5. Atomic Groups

Advanced Operators

Atomic groups are special regex groups that are non-capturing. They are usually used to increase the efficiency of a regular expression, but may also be applied to eliminate certain matches. An atomic group is specified by using (?>pattern):

/(?>his|this)/

When the regex engine matches an atomic group, it will discard backtracting positions that came with all tokens inside it. Consider the word ‘smashing’. Using the above regular expression, the regex engine will first try to match the pattern ‘his’ in ‘smashing’. It will not find a match. At this point, the atomic group will kick in. The engine will discard all backtracking positions. This means that it will not search for ‘this’ inside ‘smashing’. Why? If ‘his’ did not return a match, then obviously ‘this’ (which includes ‘his’) will not return positive either.

The above example did not have many practical uses. We might as well have used /t?his?/ instead. Look at the following:

/b(engineer|engrave|end)b/

If the regex engine is given the word ‘engineering’, it will correctly match ‘engineer’. The next word boundary, b, will not match. Thus, it will move on to the next match: engrave. It realizes that the ‘eng’ matches, but the rest do not. Finally, ‘end’ is attempted and also failed. If you look carefully, you will realize that once the engine matches ‘engineer’ and fails the last word boundary, it can not possibly match ‘engrave’ or ‘end’. These two matches are smaller words than ‘engineer’, and thus the regex engine should not continue with the other trials.

/b(?>engineer|engrave|end)b/

The above is a much better alternative that will save the regex engine time and improve the code’s efficiency.

6. Recursion

Recursion

Recursion in regular expressions can be used to match nested constructs, such as parentheses, (this (that)), and HTML tags, <div></div>. They require the use of (?R), an operator that matches recursive sub-patterns. Consider the regular expression that matches nested parentheses:

/(((?>[^()]+)|(?R))*)/

The outermost parentheses in this regular expression match the beginning of the nested constructs. Then comes an optional operator, which can either match non-parenthetical characters (?>[^()]+) or the whole expression again in a sub-pattern, (?R). Notice that this operator is repeated as many times as possible to match all nested parentheses.

Another example of recursion at work is the following:

/<([w]+).*?>((?>[^<>]+)|((?R)))*/

The above expression combines character groups, greedy operators, back-tracking, and atomic groups to match nested tags. The first parenthesized group ([w]+) matches the tag name for use later in the regular expression. It then proceeds to match the rest of the tag. The next parenthesized sub-expression is very similar to the one above. It either matches non-tag (?>[^<>]+) characters or recurses over another tag (?R). Finally, the last part of the expression matches the close tag.

7. Callbacks

Callbacks

Certain matches in a pattern may require special modifications. In order to apply multiple or complex changes, callbacks can be used. A callback is used for dynamic substitution Strings in the preg_replace_callback function. They take in a function as a parameter to use when a match is found. This function receives the match array as a parameter and returns a modified string that is used as a replacement.

As an example, consider a regular expression that changes all words to uppercase in a given string. Unfortunately, PHP does not have a regex operator that changes a character to a different case. To accomplish this task, a callback may be used. First, the expression must match all letters that need to be capitalized:

/bw/

The above uses both word boundaries and character classes to work. Now that we have this expression, we can write a callback function:

function upper_case( $matches ) {
	return strtoupper( $matches[0] );
}

upper_case takes in an array of matches and returns the whole matched pattern in uppercase. $matches[0], in this case, represents the letter that needs to be capitalized. All of this can now be put together using the preg_replace_callback function:

preg_replace_callback( '/bw/', "upper_case", $str );

That is the power of a simple callback.

8. Commenting

Commenting

Commenting is not a way to actually match strings, but it is one of the most important parts of regular expressions. As you dive deep into larger, more complex expressions, it becomes hard to decipher what is actually being matched. Using comments in the middle of regular expressions is the perfect way to minimize such confusion.

To place a comment inside a regular expression, use the (?#comment) format. Replace “comment” with the word(s) of your choice:

/(?#digit)d/

It is especially important to comment regular expressions that you release to the public. Users of your regex will be able to easily understand and modify the pattern to meet their needs. It can even go so far as to help you decode it when revisiting a program.

Consider using the “x” or (?x) modifier for free-spacing mode with comments. This causes a regular expression to ignore white space between tokens. All spaces can still be represented with [ ] or (a backslash and a space):

/
d    #digit
[ ]   #space
w+   #word
/x

The above is the same as:

/d(?#digit)[ ](?#space)w+(?#word)/

Always create well-documented code.

Further Resources

(al)

↑ Back to topShare on Twitter

Karthik Viswanathan is a high-school student who loves to program and create websites. You can view Karthik's work on his blog, Lateral Code, and explore the most popular articles on the Web through his online Twitter application.

  1. 1

    Sjoerd Maessen

    May 7, 2009 6:23 am

    Great tutorial!

    0
  2. 2

    Good work. The writing isn’t subjective and it’s a great baseline that beginning programmers should be able to grasp. Enough detail to give the idea, but not so much that I lose interest.

    I would love to see some the the design / photo / inspiration articles follow this direction too.

    0
  3. 3

    Wow! Love the tutorial. If I’m not wrong, a lot of people here needed a good in-depth article on regex – a critical but hated part of programming. I know I definitely needed it. Thanks!

    0
  4. 4

    [ ] #space ?
    Heard of s ?

    1
  5. 5

    wow! I love this geek article! I didn’t know some of this advanced tricks… by the way, there’s an error on the first lazy operator example (the strong tag)

    (SM) Thank you, fixed!

    0
  6. 6

    Nice Article Karthik… cant expect this much from an high school student. Nice reading… Thanks for sharing.

    DKumar M.
    @instantshift.com

    0
  7. 7

    Using the above regular expression, the regex engine will first try to match the pattern ‘hi’ in ’smashing’. It will not find a match.

    “hi” is in smashing.

    0
  8. 8

    Javier Albinarrate

    May 7, 2009 8:19 am

    There is a very minor error in “Lazy Operators”
    The example should be

    /<h1>.*?</h1>/

    Instead of

    /<h1>.*<STRONG>?</STRONG></h1>/

    Regards!
    Javier

    (SM) Thank you, fixed!

    0
  9. 9

    Floris Fiedeldij Dop

    May 7, 2009 8:24 am

    Oh pretty sweet, finally I can up the level of regex knowledge that I have. Maybe I will finally grasp this a bit better.

    0
  10. 10

    fantastic! Thanks for the info. It would have been nice to see which concepts are supported by which programming languages. I use asp.net & most of the concepts appear to be basically the same, with a notable exception of Named Groups (though I need to do more research to be certain). Also, one of the concepts (Atomic Groups) was so foreign to me that additional samples would have been helpful. I am not certain that I understand that one at all. Will have to do some more research.

    That said, this was a great article & will be going into my reference library.
    Thanks!

    0
  11. 11

    great tut thanks

    0
  12. 12

    Nicolas Elizaga

    May 7, 2009 9:35 am

    Awesome. Grant Skinner of Flash infamy also made an AIR application to simplify creating Regular Expressions:

    http://gskinner.com/RegExr/desktop/

    And the browser version:
    http://gskinner.com/RegExr/

    1
  13. 13

    Daniel Einspanjer

    May 7, 2009 10:05 am

    Good article, but a tiny bit of editing would help.

    You tried to use a “strong” tag to bold the question mark in your first example of a lazy quantifier. Since the code was in a preformatted code block, the strong tag obfuscates your example.

    In your first example of atomic groups, you use the regex (?>hi|this) with the string smashing. The part you overlooked is that hi actually does match the word smashing. You should either use a different test word, or use something like “his” instead of “hi” for your first regex alternate so that the example works as expected.

    0
  14. 14

    Thanks for including #8. Nothing is worse than running into a 2-line regex full of gibberish with no comments when you’re maintaining code.

    0
  15. 15

    I thought I knew everything about regular expressions until I read this article!

    0
  16. 16

    Keep in mind that the patterns do not work on Unicode characters by default–at least in JavaScript’s regular expressions.
    Thus a pattern like /bw/ won’t match a whole word if it contains non ASCII characters. For example the German word “Grüße” is not matched.

    0
  17. 17

    Awesome article, always had a thing for regex so It’s nice that someone has finally typed it out.

    0
  18. 18

    Instead of preg_replace_callback, you can use the ‘e’ operator.

    For instance,
    $str = preg_replace(‘/(blah)/ie’, ”_’.strtolower(“1″).’_”, ‘AOEUBLAHAOEU’);

    Will leave $str with the value of ‘AOEU_blah_AOEU’
    Use it wisely. Always escape your strings…

    0
  19. 19

    Thank You very Much SM.

    0
  20. 20

    Great article, loved it and could learn alot fromt it!

    0
  21. 21

    very helpful and nice written tut, thank u!

    0
  22. 22

    nice examples, nice writing !

    0
  23. 23

    Sebastiaan Stok

    May 8, 2009 3:22 am

    The best way to learn Regular Expressions.
    Is reading this book: Master Regular Expressions.
    http://oreilly.com/catalog/9780596528126/

    I really love that book.

    0
  24. 24

    Nice write up Karthik. I am a hard core reg-ex user, I have used it many imaginative ways, but to see it topic wise was a revelation.

    0
  25. 25

    Many compliments for this very well done post!

    0
  26. 26

    most awesome

    0
  27. 27

    The regex from #5 can be made more efficient by moving the “en” part outside the parentheses.

    /b(?>engineer|engrave|end)b/

    becomes

    /ben(?>gineer|grave|d)b/

    Aargh, backslashes get stripped from the code…

    0
  28. 28

    Great Tutorial!

    Regards

    0
  29. 29

    With regards to #2 (Back Referencing), alternations are generally slower than character classes, and as such I would recommend character classes instead:

    preg_match('#(['"]).*?1#', $str, $matches);

    Admittedly, on a single string or small amounts of data, the speed difference between both versions would be infinitesimal.. more a matter of principal.. I would limit alternations to sequence of characters instead of single characters.

    For #3, there is actually 3 ways to create named groups in PCRE:
    (?…) named capturing group (Perl)
    (?’name’…) named capturing group (Perl)
    (?P…) named capturing group (Python)

    More can be viewed over at the PCRE Manual

    For #7 (Callbacks), while I understand it is for demonstrative callback purposes, this would be a good time to note that regex isn’t always the best solution for problems.. By making use of a myriad of php built in functions, the solution to particular problems could completely negate the need for regex; in this case by simply using ucwords instead ;)
    echo $str = ucwords($str);

    PCRE is powerful to be sure.. but there is a cost to all this robustness.. speed. For all intents and purposes, it is still extremely fast. But like anything else in programming, there is more than one way to skin a cat. Sometimes regex is simply overkill or even downright inappropriate for the task at hand.

    While there are tons of references dealing with regex, my personal favorite is pretty much hailed as the bible of Regular Expressions. A great book which goes into the details of how the regex engine *thinks*. Highly recommended.

    0
  30. 30

    Brian Temecula

    May 9, 2009 8:58 am

    No matter how many times I use regex, I will always appreciate a good article like this. Thanks!

    0
  31. 31

    Great tutorial!

    0
  32. 32

    I was always frightened of reg exp. Your post remove that fear :)
    Thanks a lot. Nice image selection for each section. Great post.
    ABCoder

    0
  33. 33

    This is the most useful article I have ever read about regular expressions. Well done.

    0
  34. 34

    Great guide, wish this would have been around when I first tired to learn regular expressions.

    0
  35. 35

    Chetankumar Akarte

    May 30, 2009 9:29 am

    Hi ,
    Nice Introduction!
    Any body interested to start Text and Data Manipulation with Regular Expressions in .NET Development ?

    Please take a look…

    0
  36. 36

    I am trying to find all capitalized ‘proper nouns’ from a body of text. so that for example i have a string “i love Chocolate Fudge factory’s Chocolate Blocks made with chocolate from Scents of Sweets” I want to pull out “Chocolate Fudge”, “Chocolate Blocks” and “Scents of Sweets”. One hint to mention is that because it is a fairly long article, there will be multiple instances of each of the phrases, so i will know which ones to pull from the body. how would i do this? ideas?

    0
  37. 37

    El problema con
    .*? son los espacios en blancos.
    La mejor solucion es especificar todo, por ejemplo
    [sa-zA-Záéíóú].*?
    hay que contemplar todas las posibilidades que uno quiera ;ÁÉ&…etc

    0
  38. 38

    Thanks for the helpful and nicely thought out overview.

    One suggestion, though: It would be helpful to explain the matching pattern used in the example for Back Referencing. It’s hard to decipher if you don’t already know regexes pretty well, and, besides, that snippet is intended to illustrate back reference functionality as opposed to specific matching syntax.

    0
  39. 39

    Actually, disregard my comment. I read the article too quickly. It’s fine the way it is.

    0
  40. 40

    Didrik Nordström

    February 24, 2010 5:43 am

    Great tutorial, thanks!

    Regex kicks ass

    0
  41. 41

    Nice Post! it is very useful to me! Thank you!

    0
  42. 42

    but:
    I want “a(b(c(d(e))))” not “(a(b(c(d(e))))”…

    0
  43. 43

    Very good reading. Thank you.

    0
  44. 44

    Amazing article. I was looking for advance regex resource from a long time.

    0
  45. 45

    I don’t think this is foalrlmy a regular expression, because of ? character. The processing of a real regular expression never requires backtracking (i.e. never requires reading the input multiple times) and hence they are very fast once you generate the deterministic finite automata (DFA) for the given regular expression.Programming languages usually provide extensions to regular expressions (like ? in this case) for convenience to the programmer. There are only 3 operators in formal regular languages, which are star ( * ), union ( | ) and concatenation. All other operators used in a regular expressions must be representable by using these 3, for example you can represent 0+ as 00* and so on. The operator ? used in the above expression is something totally different and which in fact violates the rule that machines that process a regular expression can do so with finite amount of memory. Hence this expression is actually not regular and the language generated by it is not a regular language, mathematically.

    0
  46. 46

    Good Job.
    As well as presentation is well. keep it up.
    Thanks for your guide.

    0
  47. 47

    I’m not sure this is possible to do without any errors one way or the other. For example, how could a regex tell the difference between “Othello is my favorite play” and “Onions are my favorite vegetable”? Also, do you have a complete list of words that won’t be capitalized when used in titles? It’s not based on length but on word type, so you’d need a full list to get it perfect. The following regex is a start but far from perfect:

    b[A-Z][a-z]*(s([A-Z][a-z]*|[a-z]{1,3}))*b

    Some known deficiencies: It’ll match the start of sentences, it cheats on uncapitalizable words, and misses words like “McDonald’s”. Some of these are easy fixes, some are not. Have fun!

    0
  48. 48

    In practical programming, nearly all regular expressions are Perl-style extended regular expressions, which are not technically (as defined in Comp Sci class) a regular expression. They still share the same name, though.

    0

↑ Back to top