Menu Search
Jump to the content X X
Smashing Conf San Francisco

We use ad-blockers as well, you know. We gotta keep those servers running though. Did you know that we publish useful books and run friendly conferences — crafted for pros like yourself? E.g. upcoming SmashingConf San Francisco, dedicated to smart front-end techniques and design patterns.

Crucial Concepts Behind Advanced Regular Expressions

Regular expressions (or regex) are a powerful way to traverse large strings in order to find information. They rely on underlying patterns in a string’s structure to work their magic. Unfortunately, simple regular expressions are unable to cope with complex patterns and symbols. To deal with this dilemma, you can use advanced regular expressions.

You may also be interested in the following related posts:

Below, we present an introduction to advanced regular expressions, with eight commonly used concepts and examples. Each example outlines a simple way to match patterns in complex strings. If you do not yet have experience with basic regular expressions, have a look at this article3 to get started. The syntax used here matches PHP’s Perl-compatible regular expressions.

1. Greediness/Laziness Link


All regex repetition operators are greedy. They try to match as much as possible in a string. Unfortunately, this might not always be a desired effect. Thus, lazy operators are used to solve this problem. They only match the smallest possible pattern and are used by adding a ‘?’ after the respective greedy operator. Alternatively, the ‘U’ modifier may be used to make all repetiton operators lazy. Differentiating between greediness and laziness is key to fully understanding advanced regular expressions.

Greedy Operators Link

The * operator matches the previous expression 0 or more times. It is a greedy operator. Consider the following expression:

preg_match( '/<h1>.*</h1>/', '<h1>This is a heading.</h1>
<h1>This is another one.</h1>', $matches );

Recall that a . means any character except a new line. The above regular expression is looking for an h1 tag and all of its contents. It uses the . and * operators to constantly match anything inside the tag. This pattern will match:

<h1>This is a heading.</h1><h1>This is another one.</h1>

It returns the whole string. The * operator will continuously match everything — even the middle closing h1 tag — because it is greedy. Matching the whole string is the best it can do.

Lazy Operators Link

Let’s change the above operator by adding a ‘?’ after it. This will make it lazy:


The regex now fulfills its duty and matches only the first h1 tag. Another greedy operator that uses this same property is {n,}. This matches the previous expression n or more times. If it is used without a question mark, it looks for the most repetitions possible. Otherwise, it starts from n repetitions:

# Set up a String
$str = 'hihi';

# Match it using the greedy {n,} operator
preg_match( '/(hi){1,}/', $str, $matches ); # matches[0] will be 'hihi'

# Match it with the lazy {n,}? operator
preg_match( '/(hi){1,}?/', $str, $matches ); # matches[0] will be 'hi'

2. Back Referencing Link

Back Referencing

What it does Link

Back referencing is a way to refer to previously matched patterns inside a regular expression. For example, take a look at this simple regex that matches an expression in quotes:

# Set up an array of matches
$matches = array();

# Create a String
$str = ""This is a 'string'"";

# Traverse it with regular expressions
preg_match( "/("|').*?("|')/", $str, $matches );

# Print the whole match
echo  $matches[0];

Unfortunately, this will not correctly match the string. Instead, it will print:

"This is a '

This regular expression matches the opening double quote but finds a different type of quote to close it. This is because it was given the option of picking a double or single quote at the end. In order to fix this, you can use back referencing. The expressions 1, 2, …., 9 hold references to previously captured subpatterns. The first matched quote, in this case, will be held by the variable 1.

How to Use It Link

In order to apply this concept to the aforementioned example, use 1 in place of the last quote:

preg_match( '/("|').*?1/', $str, $matches );

This will now correctly return:

"This is a 'string'"

Remember that back referencing may also be used by preg_replace. Note that instead of 1 … 9, you should use $1 … $9 … $n (any number of these will work). For example, if you want to replace all paragraph tags with text that represents them, use:

$text = preg_replace( '/<p>(.*?)</p>/', 
"&lt;p&gt;$1&lt;/p&gt;", $html );

The $1 back reference holds the text inside the paragraph and is being used in the replace pattern itself. This completely valid expression shows an easy way to access matched patterns even while replacing.

3. Named Groups Link

When using multiple back references, a regular expression can quickly become confusing and hard to understand. An alternative way to back reference is by using named groups. A named group is specified by using (?P<name>pattern), where name is the name of the group and pattern is the regular expression in the group itself. The group can then be referred to by (?P=name). For example, consider the following:


The above expression will create the same effect as the previous back reference example, but by instead using named groups. It is also significantly easier to read.

Named groups are also useful when sifting through the array of matches. The name given to a specific pattern is also the key of the corresponding matches array.

preg_match( '/(?P<quote>"|')/', "'String'", $matches );

# This will print "'"
echo $matches[1];

# This will also print "'", as it is a named group
echo $matches['quote'];

Thus, named groups not only make code easier to read but also organize it.

4. Word Boundaries Link

Word Boundaries

Word boundaries are places in a string that come between a word character and a non-word character. The specialty of these boundaries is the fact that they don’t actually match a character. Their length is zero. The b regular expression matches any word boundary.

Unfortunately, boundaries are so often skimmed over that many do not recognize their real significance. For example, let’s say you want to match the word “import”:


Watch out! Regular expressions can be tricky. The above expression will also match:


You may think it is as simple as adding a space before and after import to prevent these bogus matches:

/ import /

But what about this case?

The trader voted for the import

When import is at the beginning or the end of a string, the modified regex will fail. Thus, splitting this up into cases is required:

/(^import | import | import$)/i

Looking back at our regular expression, it does not take periods or other punctuation into account. Just to match this single word, a regular expressions may look like this:

/(^import(:|;|,)? | import(:|;|,)? | import(.|?|!)?$)/i

That’s a lot of code to match just a single word. This is why word boundaries are so significant. To accomplish the above statement and many other variations with word boundaries, all that is necessary is:


This will match every case above and more. b‘s flexibility comes from the fact that it matches a zero-length string. All it matches is an imaginary space between two characters. It checks if one of the characters is a non-word character and the other is a word character. If so, it matches it. If the beginning or end of a string is encountered, b treats it as a non-word character. Because the i in import is still considered a word character, it will match import.

Note that the opposite of b is B. This operator will match the space in-between two word or two non-word characters. Thus, if you would like to match ‘hi’ inside another word, you could use:


5. Atomic Groups Link

Advanced Operators

Atomic groups are special regex groups that are non-capturing. They are usually used to increase the efficiency of a regular expression, but may also be applied to eliminate certain matches. An atomic group is specified by using (?>pattern):


When the regex engine matches an atomic group, it will discard backtracting positions that came with all tokens inside it. Consider the word ‘smashing’. Using the above regular expression, the regex engine will first try to match the pattern ‘his’ in ‘smashing’. It will not find a match. At this point, the atomic group will kick in. The engine will discard all backtracking positions. This means that it will not search for ‘this’ inside ‘smashing’. Why? If ‘his’ did not return a match, then obviously ‘this’ (which includes ‘his’) will not return positive either.

The above example did not have many practical uses. We might as well have used /t?his?/ instead. Look at the following:


If the regex engine is given the word ‘engineering’, it will correctly match ‘engineer’. The next word boundary, b, will not match. Thus, it will move on to the next match: engrave. It realizes that the ‘eng’ matches, but the rest do not. Finally, ‘end’ is attempted and also failed. If you look carefully, you will realize that once the engine matches ‘engineer’ and fails the last word boundary, it can not possibly match ‘engrave’ or ‘end’. These two matches are smaller words than ‘engineer’, and thus the regex engine should not continue with the other trials.


The above is a much better alternative that will save the regex engine time and improve the code’s efficiency.

6. Recursion Link


Recursion in regular expressions can be used to match nested constructs, such as parentheses, (this (that)), and HTML tags, <div></div>. They require the use of (?R), an operator that matches recursive sub-patterns. Consider the regular expression that matches nested parentheses:


The outermost parentheses in this regular expression match the beginning of the nested constructs. Then comes an optional operator, which can either match non-parenthetical characters (?>[^()]+) or the whole expression again in a sub-pattern, (?R). Notice that this operator is repeated as many times as possible to match all nested parentheses.

Another example of recursion at work is the following:


The above expression combines character groups, greedy operators, back-tracking, and atomic groups to match nested tags. The first parenthesized group ([w]+) matches the tag name for use later in the regular expression. It then proceeds to match the rest of the tag. The next parenthesized sub-expression is very similar to the one above. It either matches non-tag (?>[^<>]+) characters or recurses over another tag (?R). Finally, the last part of the expression matches the close tag.

7. Callbacks Link


Certain matches in a pattern may require special modifications. In order to apply multiple or complex changes, callbacks can be used. A callback is used for dynamic substitution Strings in the preg_replace_callback function. They take in a function as a parameter to use when a match is found. This function receives the match array as a parameter and returns a modified string that is used as a replacement.

As an example, consider a regular expression that changes all words to uppercase in a given string. Unfortunately, PHP does not have a regex operator that changes a character to a different case. To accomplish this task, a callback may be used. First, the expression must match all letters that need to be capitalized:


The above uses both word boundaries and character classes to work. Now that we have this expression, we can write a callback function:

function upper_case( $matches ) {
	return strtoupper( $matches[0] );

upper_case takes in an array of matches and returns the whole matched pattern in uppercase. $matches[0], in this case, represents the letter that needs to be capitalized. All of this can now be put together using the preg_replace_callback function:

preg_replace_callback( '/bw/', "upper_case", $str );

That is the power of a simple callback.

8. Commenting Link


Commenting is not a way to actually match strings, but it is one of the most important parts of regular expressions. As you dive deep into larger, more complex expressions, it becomes hard to decipher what is actually being matched. Using comments in the middle of regular expressions is the perfect way to minimize such confusion.

To place a comment inside a regular expression, use the (?#comment) format. Replace “comment” with the word(s) of your choice:


It is especially important to comment regular expressions that you release to the public. Users of your regex will be able to easily understand and modify the pattern to meet their needs. It can even go so far as to help you decode it when revisiting a program.

Consider using the “x” or (?x) modifier for free-spacing mode with comments. This causes a regular expression to ignore white space between tokens. All spaces can still be represented with [ ] or (a backslash and a space):

d    #digit
[ ]   #space
w+   #word

The above is the same as:

/d(?#digit)[ ](?#space)w+(?#word)/

Always create well-documented code.

Further Resources Link

  • Regular-Expressions.info4
    Comprehensive website on regular expressions
  • Cheat Sheet5
    Informative regular expressions cheat sheet
  • Regex Generator
    JavaScript regular expressions generator


Footnotes Link

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5

↑ Back to top Tweet itShare on Facebook


Karthik Viswanathan is a high-school student who loves to program and create websites. You can view Karthik's work on his blog, Lateral Code, and explore the most popular articles on the Web through his online Twitter application.

  1. 1

    Nice Article Karthik… cant expect this much from an high school student. Nice reading… Thanks for sharing.

    DKumar M.

  2. 2

    Sjoerd Maessen

    May 7, 2009 6:23 am

    Great tutorial!

  3. 3

    Good work. The writing isn’t subjective and it’s a great baseline that beginning programmers should be able to grasp. Enough detail to give the idea, but not so much that I lose interest.

    I would love to see some the the design / photo / inspiration articles follow this direction too.

  4. 4


    May 7, 2009 6:38 am

    Wow! Love the tutorial. If I’m not wrong, a lot of people here needed a good in-depth article on regex – a critical but hated part of programming. I know I definitely needed it. Thanks!

  5. 5

    [ ] #space ?
    Heard of s ?

  6. 6

    wow! I love this geek article! I didn’t know some of this advanced tricks… by the way, there’s an error on the first lazy operator example (the strong tag)

    (SM) Thank you, fixed!

  7. 7

    Instead of preg_replace_callback, you can use the ‘e’ operator.

    For instance,
    $str = preg_replace(‘/(blah)/ie’, ”_’.strtolower(“1″).’_”, ‘AOEUBLAHAOEU’);

    Will leave $str with the value of ‘AOEU_blah_AOEU’
    Use it wisely. Always escape your strings…

  8. 8

    Simon White

    May 7, 2009 8:09 am

    Using the above regular expression, the regex engine will first try to match the pattern ‘hi’ in ’smashing’. It will not find a match.

    “hi” is in smashing.

  9. 9

    Javier Albinarrate

    May 7, 2009 8:19 am

    There is a very minor error in “Lazy Operators”
    The example should be


    Instead of



    (SM) Thank you, fixed!

  10. 10

    Floris Fiedeldij Dop

    May 7, 2009 8:24 am

    Oh pretty sweet, finally I can up the level of regex knowledge that I have. Maybe I will finally grasp this a bit better.

  11. 11

    Matt Lindley

    May 7, 2009 9:27 am

    fantastic! Thanks for the info. It would have been nice to see which concepts are supported by which programming languages. I use & most of the concepts appear to be basically the same, with a notable exception of Named Groups (though I need to do more research to be certain). Also, one of the concepts (Atomic Groups) was so foreign to me that additional samples would have been helpful. I am not certain that I understand that one at all. Will have to do some more research.

    That said, this was a great article & will be going into my reference library.

  12. 12

    great tut thanks

  13. 13

    Nicolas Elizaga

    May 7, 2009 9:35 am

    Awesome. Grant Skinner of Flash infamy also made an AIR application to simplify creating Regular Expressions:

    And the browser version:

  14. 14

    Daniel Einspanjer

    May 7, 2009 10:05 am

    Good article, but a tiny bit of editing would help.

    You tried to use a “strong” tag to bold the question mark in your first example of a lazy quantifier. Since the code was in a preformatted code block, the strong tag obfuscates your example.

    In your first example of atomic groups, you use the regex (?>hi|this) with the string smashing. The part you overlooked is that hi actually does match the word smashing. You should either use a different test word, or use something like “his” instead of “hi” for your first regex alternate so that the example works as expected.

  15. 15

    Thanks for including #8. Nothing is worse than running into a 2-line regex full of gibberish with no comments when you’re maintaining code.

  16. 16

    I thought I knew everything about regular expressions until I read this article!

  17. 17

    Keep in mind that the patterns do not work on Unicode characters by default–at least in JavaScript’s regular expressions.
    Thus a pattern like /bw/ won’t match a whole word if it contains non ASCII characters. For example the German word “Grüße” is not matched.

  18. 18

    Awesome article, always had a thing for regex so It’s nice that someone has finally typed it out.

  19. 19


    May 7, 2009 7:44 pm

    Thank You very Much SM.

  20. 20

    Great article, loved it and could learn alot fromt it!


↑ Back to top