This is the last of a 3 part series introducing the common functions of regex. Arming you with enough of the basics to get through most of the common text searching problems. Part 1 is here.
A regex tester is provided as well, and it is highly recommended to try some of the operators out as you go.
greediness is a special case, but one you are bound to run into when you start doing more advanced queries. One typical area this will come up is parsing XML or HTML.
<p> this is a <b>paragraph</b></p>
And we want to find all the tags used, so <p>, <b>, </b> and </p>
<.*> # Look for tags
If you try this, the result might not be what you expect. The * operator is greedy - It will try to match as much as possible. So it looks for < then as much text as it can find (greedy behavior) and finally a >.
The behavior we wanted is not to match as much as possible but stop as soon as it finds the first >, instead of the last >. This is an important distinction, and important to wrap you head around. The way we override the greedy behavior is by adding a ?
<.*?> # Now it does what we want (not greedy)
The not greedy operator is identical for the 1 or more operator, where it looks like +?
In the previous tutorial, we offered a simple telephone number validation
555-\d{8} # 555- followed by 8 numbers
Now although this is technically correct, if we where to test a couple of strings against this. Then the following example would also be valid:
someTextUpFront 555-12345678
555-12345678 my phonenumber
Because the 555-\d{8} portion is met. What we wanted was to match the number and nothing else.
^555-\d{8}$
Boundaries are special because they don't take up a space (I'll clear this up later). But they are placeholders to indicate markers in the text. In this case we are marking where the text begins and ends. So the first thing after the beginning should be our phone number. And the last thing before the end should be our phone number.
| boundary | meaning |
|---|---|
| ^ | the beginning of the text. (Its unfortunate the makers of regex chose the same character as the negation character, but you just need to learn it. When not within [] the ^ character means the beginning of the text) |
| $ | End of the text |
some examples:
a # matches any a
^a # matches only a text that starts with an a
a$ # matches only a text that ends with an a
There is one more boundary, the word boundary (\b). Lets use an example. I want to find the words for and she.
(for|she) # match for or she
which matches all the instances of the words for and she. But in the example text of the regex tester it also matches the for in before. I didn't want that. I could try to look for instances that are preceded and followed by a space.
>[ ](for|she)[ ] # match for or she
Which now doesn't match before - great! But if you look closely we have created a new problem. In the text there is the sentence "for she had plenty of time". And our new regex doesn't match the she in that sentence. This makes perfect sense, as soon as you see why.
for she had plenty of time
We have already matched the space that precedes "she". The only way we can achieve our goal is by using the word boundary
\b(for|she)\b # match for or she
the word boundary here defining where the word begins and ends. Much like the previous boundaries defined the start and end of the text. I said earlier that 'Boundaries are special because they don't take up a space'. With the example above we match exactly for or she. We don't actually match the \b. No text is consumed like with the [ ] alternative. This is an important distinction, because with all other regex operators something out of your text is used up, and can't be found again.
If you got here. Congratulations. I have kept the best for last - because the one area where I personally win the most time, is find and replace.
This can't be tested in the regex tester, and you will need some kind of editor or ide (eclipse/notepad++/wordpad)
Let's say we have a file that contains 100s of lines like this
31-01-10_backup32
24-01-10_backup1
24-02-10_backup_mona
11-03-09_backup_lisa
And we need to restructure the lines to use a european date, with a full 4 digit year.
\d{2}-\d{2}-\d{2}_backup.* // would match the lines
But we want to be able to take out specific parts. So we need to use the grouping operator () we have already come across
(\d{2})-(\d{2})-(\d{2})_backup(.*) // would match the lines
Then all we need to do is replace the lines with
{Group2}-{Group1}-20{Group3}_backup{Group4}
Which in the replace field looks like
\2-\1-20\3_backup\4
And its as easy as that.
Best of luck.
blog comments powered by Disqus