agillo.net | about me

Regex Primer : Part 2

This article is a continuation of Regex Primer - Part 1. If you haven't read that yet, it might be prudent to do so first. You can test the queries as you go here

Negation [^ ]

Ok! We've been introduced to []. One important feature we haven't covered is negation.  If you want to say something like any character but a.

[^a] # matches b,c,d,e,f,\n .... anything but a

This applies to all the characters within the box when the ^ is present. You can't pick and choose.

[^0123456789] # matches anything but a numeric value

Aliases

We now have adequate knowledge under our belt to start with some real life examples. One of the easiest places to use regex is String validation. For example lets validate a telephone number in the form 555-12345678. Typically you could start splitting the string and then casting both halves to numbers, but we know regex now, so lets do it quickly

555-[0-9]{7}

Done. We are saying match 555 followed by a dash followed by exactly 7 digits.

But we can go even shorter!

One thing you'll notice, is that after a while you start using the ranges like [0-9] and [a-z] an awful lot. They are very common and pop up all the time. As such there exist various shorthands for these commonly used ranges. Like, in this case, \d which stands for digit and is semantically identical to [0-9]

555-\d{7} # this is exactly the same as 555-[0-9]{7}

Here are the other shorthands:

note: You don't have to know these. You can achieve the exact same effect by just typing it out using the box form. They are there for convenience.

shorthand meaning semantically identical to
\d digit [0-9]
\w word [a-zA-Z0-9_]  Note the underscore is also included
\s space, tab or newline [ \t\r\n]
\D anything but a digit ^\d
\W anything but a word ^\w
\S anything but a space ^\s

TIP: notice they each have an exact opposite by simply using the uppercase form. You only really need to remember the first 3.

The DOT

The dot is a special shorthand. I am providing it mainly because if you are reading someone else's code you will see it popup. It basically says match everything except the new line. The problem is that what constitutes a new line isn't always consistent on different platforms.

.*   # match it all (try it on the demo page and see its power...)

.*   # same as [^\n]

The dot is misused quite a bit, so generally I would recommend being specific and using a combination of the other shorthands and the box

Escaping

Sometimes you want to actually find a dot, or the [ character. Since these and others have special meaning and are part of the regex syntax you have to escape them when you want to search specifically for these characters. for example

.    # actually looks for just a DOT instead of everything

Grouping and OR

Ok, lets get back to validation. This time an email adress. We'll keep it somewhat simpler than the real specs and provide the following rules an email address has to live up to in our system

[a-z][\w-]*@[a-z]+.[a-z]+

Done? Did you remember to escape the dot? Another gotcha lies in the use of the +.  In the second part we want to match 1 or more, not zero or more.

Now lets say that we want to change our rules and match only the big valid domains. So our email address has to end in either com or net. We can't do this with the syntax we've learned up till now. The regex query will make use of 2 new concepts, the OR operator and groups.

[a-z][\w-]*@[a-z]+.(com|net)

Lets break that down by example

com|net  #  match com or net

a|b|c    # same as [abc].

The group is needed to indicate that we don't want to use the entire left side. if we wanted to match 'Brad Pitt' or 'Angelina Pitt' only

Brad|Angelina Pitt  # matches either 'Brad' or 'Angelina Pitt'
                    #Not What We wanted!

(Brad|Angelina) Pitt  # now we got it

The principle of grouping using parenthesis should be relatively straightforward as a developer.  In fact we can start combining it with some of the other operators we already know.

(dog)+   #  match dog,dogdog,dogdogdog ...

java(bean)?    #  match java or javabean

Greediness

greediness is a special case, but one you are bound to run into when you start doing more advanced queries. One typical area this will come up is parsing XML or HTML.

<p> this is a <b>paragraph</b></p>

And we want to find all the tags used, so <p>, <b>, </b> and </p>

<.*>  # Look for tags

If you try this, the result might not be what you expect. The * operator is greedy - It will try to match as much as possible. So it looks for < then as much text as it can find (greedy behavior) and finally a >.

The behavior we wanted is not to match as much as possible but stop as soon as it finds the first >, instead of the last >. This is an important distinction, and important to wrap you head around. The way we override the greedy behavior is by adding a ?

<.*?>  # Now it does what we want (not greedy)

The not greedy operator is identical for the 1 or more operator, where it looks like +?

Boundaries

In the previous tutorial, we offered a simple telephone number validation

555-\d{8}   # 555- followed by 8 numbers

Now although this is technically correct, if we where to test a couple of strings against this. Then the following example would also be valid:

someTextUpFront 555-12345678

555-12345678 my phonenumber

Because the 555-\d{8} portion is met. What we wanted was to match the number and nothing else.

^555-\d{8}$

Boundaries are special because they don't take up a space (I'll clear this up later). But they are placeholders to indicate markers in the text. In this case we are marking where the text begins and ends. So the first thing after the beginning should be our phone number. And the last thing before the end should be our phone number.

boundary meaning
^ the beginning of the text. (Its unfortunate the makers of regex chose the same character as the negation character, but you just need to learn it. When not within [] the ^ character means the beginning of the text)
$ End of the text

some examples:

a   #  matches any a

^a  #  matches only a text that starts with an a

a$  #  matches only a text that ends with an a

There is one more boundary, the word boundary (\b). Lets use an example. I want to find the words for and she.

(for|she)  # match for or she

which matches all the instances of the words for and she. But in the example text of the regex tester it also matches the for in before. I didn't want that. I could try to look for instances that are preceded and followed by a space.

>[ ](for|she)[ ]  # match for or she

Which now doesn't match before - great! But if you look closely we have created a new problem. In the text there is the sentence "for she had plenty of time". And our new regex doesn't match the she in that sentence. This makes perfect sense, as soon as you see why.

for she had plenty of time

We have already matched the space that precedes "she". The only way we can achieve our goal is by using the word boundary

\b(for|she)\b  # match for or she

the word boundary here defining where the word begins and ends. Much like the previous boundaries defined the start and end of the text. I said earlier that 'Boundaries are special because they don't take up a space'. With the example above we match exactly for or she. We don't actually match the \b. No text is consumed like with the [ ] alternative. This is an important distinction, because with all other regex operators something out of your text is used up, and can't be found again.

The End - Find And Replace

If you got here. Congratulations. I have kept the best for last - because the one area where I personally win the most time, is find and replace.

This can't be tested in the regex tester, and you will need some kind of editor or ide (eclipse/notepad++/wordpad)

Let's say we have a file that contains 100s of lines like this And we need to restructure the lines to use a european date, with a full 4 digit year.

31-01-10_backup32
24-01-10_backup1
24-02-10_backup_mona
11-03-09_backup_lisa

And we need to restructure the lines to use a european date, with a full 4 digit year.

\d{2}-\d{2}-\d{2}_backup.*  // would match the lines

But we want to be able to take out specific parts. So we need to use the grouping operator () we have already come across

(\d{2})-(\d{2})-(\d{2})_backup(.*)  // would match the lines


Then all we need to do is replace the lines with

{Group2}-{Group1}-20{Group3}_backup{Group4}

Which in the replace field looks like

\2-\1-20\3_backup\4

And its as easy as that.

Best of luck.

blog comments powered by Disqus