Expressions of Delight

REALbasic’s Regex class

By Matt Neuburg

Matt Neuburg first learned about regular expressions while using the Nisus word processor in 1990, and hasn’t had a good night’s sleep since. This article was originally published in REALbasic Developer Issue 1.1 (Aug/Sept 2002).

The Regex class, along with the RegexMatch, RegexOptions, and RegexException classes, is the gateway to REALbasic’s implementation of regular expressions. Regular expressions are a way of expressing a textual find or find-and-replace that’s too complicated, or too vague, for a function like InStr or Replace. To see what I mean, let’s take an example.

Suppose you’ve got a string representing some HTML, and you want to remove all the HTML markup from it. An HTML tag starts with a left angle-bracket and ends with a right angle-bracket; the tag consists of both angle-brackets, and everything in between, like this: “<TAG>”. The trouble is, of course, that you don’t know in advance what “everything in between” consists of.

So how would you find and remove an HTML tag using just InStr? You’d have to take a piecemeal approach. First you’d have to find a left angle-bracket, and remember where it is. Then you’d look for a right angle-bracket that comes after the left angle-bracket, and remember where it is. Then you’d have to break up the string into three pieces – what precedes the tag, the tag itself, and what follows the tag – and reassemble it without the middle piece, thus deleting the tag. Here’s some actual code.

Manual removal of HTML tag:

  dim s, leftPart, rightPart as string
  dim starting, ending as integer
  s = // whatever
  starting = instr(s, "<")
  if starting > 0 then
    ending = instr(starting, s, ">")
    if ending > 0 then
      leftPart = mid(s,1,starting-1)
      rightPart = mid(s,ending+1)
      s = leftPart + rightPart
    end
  end

That’s not horrible – I purposely chose a simple example to start with – but it’s not very pleasant either. The code is fairly illegible, and it feels like we’re working much too hard. The notion “a left angle-bracket, the following right angle-bracket, and everything in between” seems simple enough. Yet we can’t express it with InStr, so we’re having to implement it as a succession of finds plus a sort of brute-force replacement, all of which is ugly, error-prone, tedious, and not very general; and imagine having to extend this into a loop where we remove every HTML tag! Not much fun. Using regular expression syntax, however, we can express it, very simply, like this: <.*>

Regular expression syntax uses some symbolism that may at first be strange to you. But you can probably guess what’s going on in this particular expression. The angle-brackets mean angle-brackets, and the dot-asterisk means “everything in between.” That’s all there is to it. To demonstrate this expression in action, here’s some actual code for removing an HTML tag from a string using the Regex class.

Example 1: Regex removal of HTML tag

  dim s as string, r as regex
  r = new regex
  s = // whatever
  r.options.greedy = false // disable greediness
  r.searchPattern = "<.*>"
  r.replacementPattern = ""
  s = r.replace(s)

I hope the cleanliness and elegance of that example feels sufficiently compelling that you’re encouraged to want to learn more – because, make no mistake, there is a learning curve to using regular expressions. The good news, though, is that regular expression syntax has achieved near universality in the computer world; and REALbasic’s implementation of regular expressions is based on a widely used freeware code library called PCRE (Perl-compatible regular expressions). Once you’ve learned how to use regular expressions in REALbasic, you’ve also learned how to use them in BBEdit, JavaScript, Perl, Python, and PHP; regular expressions are also used (with a slightly different syntax) in Nisus Writer and Microsoft Word. So they are certainly worth learning about.

This article can’t teach you all about regular expressions; the subject is huge. I strongly recommend Jeffrey Friedl’s book, Mastering Regular Expressions, from O’Reilly & Associates; and REALbasic’s online help for Regex provides a complete guide to the syntactical details. What I’ll do here is introduce you to regular expressions, and explain how you use them through REALbasic’s Regex-related classes.

Search Expression

There are two kinds of regular expression: the search expression, describing the text you want to look for, and the replace expression, describing the text you want to replace the found text with (if any). The search expression is far more powerful and, since you’ll always need one, more important.

Before we start, I have to tell you the basic rule of how regular expression syntax works. In a regular expression, certain characters, such as the dot and the asterisk in Example 1, have special meaning. The basic rule of regular expressions is that if a character has no special meaning, it just represents itself in the normal way, like the angle brackets in Example 1. If a character does have special meaning and you want to use it to represent itself, without that special meaning, you “escape” it by putting a backslash (\) in front of it; for example, \* means an ordinary asterisk. Also, if a character does not have special meaning, you can sometimes give it special meaning by putting a backslash in front of it. For example, r is just a normal “r”, but \r means a return character. I know this sounds confusing, but it will be clearer when we look at some more examples, and if I don’t tell you about it up front we can’t get started at all.

Now then. I think the best way to approach an understanding of regular expressions is to consider that when we use InStr, every character in the search expression represents an exact match, whereas the power of a regular search expression lies largely in its ability to be deliberately vague about what we’re looking for. There are two main kinds of thing you get to be vague about: what individual characters to look for, and how many characters to look for. We’ll take these in turn.

We begin with vague individual characters. What we want here is to make a single character in the search expression stand for more than one possible character to look for. To do so, we list the acceptable possibilities inside square brackets. This is called a character set. For example, [aeiou] means a single character that might be a or e or i or o or u. This notation would get tedious if there were lots of acceptable possibilities, so you can invert it by making the first character inside the square brackets a caret; now you’ve got a list of all the unacceptable possibilities. So, [^aeiou] means any single character that isn’t a or e or i or o or u.

Also, a range of characters, using ASCII order, can be specified by putting a hyphen between the first and last characters of the range; so, [0-9] means any numeric digit, and [0-9A-F] means any character that might be used as a hexadecimal digit. This is still rather tedious when a character set is very frequently used, so the syntax defines some character sets for you in advance. For example, instead of [0-9] you can just say \d and instead of [0-9a-zA-Z_] you can say \w. And a dot (.) means any character at all.

Now let’s turn to vague quantities of character. To specify that a character can occur a vague number of times, you put a metacharacter after the character that is to be repeated. For example, a plus sign (+) means that the preceding character must appear at least once but can occur more times than that, in succession. Note that this doesn’t mean that the very same character has to appear several times in succession, because the character to be repeated might be a vague character. For example, [aeiou]+ means any stretch of any vowels, and will find the “eau” in “beautiful”. A question mark (?) means that the preceding character may appear once or perhaps not at all, and an asterisk (*) means that the preceding character may appear once, not at all, or any number of times in succession. Now you can understand how we found an HTML tag with the expression <.*> earlier; it means a left angle-bracket, a right angle-bracket, and any characters in any quantity between them.

When specifying vague quantities, you must be concerned about “greediness”. A greedy search is one that matches the largest stretch it can find. Recall that in Example 1, we disabled greediness before starting the search. To see why, imagine searching a string with two HTML tags in it. A greedy search finds everything from the start of the first HTML tag to the end of the second HTML tag. A non-greedy search finds just the first HTML tag, as desired.

Performing the Search

To perform a search, you need three things: a string to look inside, a regular search expression, and an instance of the Regex class. You hand the Regex instance the search expression as its SearchPattern property, and then send the Regex instance the Search message. If you provide just one parameter for the Search message, that parameter is the string to look inside, and the search will start at the first character of the string. If you provide two parameters, the second parameter is the index of the character where the search should start. The character indexing is zero-based! This contrasts unfortunately with InStr, where character indexing is one-based.

The value returned when you send a Regex instance the Search message is an instance of the RegexMatch class. You consult this instance to learn about the results of the find. If the find failed, the RegexMatch instance will be nil. If the find succeeded, the RegexMatch’s SubexpressionString(0) property will contain the matched substring, and its SubexpressionStart(0) property will contain the index (zero-based again) of the first character of the matched substring within the original string. You’re probably wondering what the “(0)” means here, but don’t worry about it for now; just trust me.

To illustrate, here’s a code snippet that counts the number of runs of vowels in a string. We do this by finding vowel runs in successive iterations of a loop until the find fails. Each time through the loop, we use the start position and length of the previously matched substring to determine where to start the next search.

Example 2: Successive searches

  dim s as string, r as regex, m as regexmatch
  dim i, count as integer
  r = new regex
  s = "beautification"
  // has 5 vowel runs: "eau", "i", "i", "a", "io"
  r.searchPattern = "[aeiou]+"
  do
    m = r.search(s,i)
    if m <> nil then
      count = count + 1
      i = m.subexpressionstart(0)
      i = i + len(m.subexpressionstring(0))
    end
  loop until m = nil
  msgbox str(count)
  // result is 5, the right answer

It seems wasteful to have to supply the search string as a parameter to the Search message every time through the loop, when that string isn’t changing. To save us from having to do this, the Regex class permits a different way of using the Search message. Having performed the search once, so that the Regex instance knows what the search string is, we set the starting position for the next search using the Regex instance’s SearchStartPosition property and send it the Search message with no parameters at all. We can rewrite Example 2 to use this syntax.

Example 3: Successive searches, alternate syntax

  dim s as string, r as regex, m as regexmatch
  dim i, count as integer
  r = new regex
  s = "beautification"
  r.searchPattern = "[aeiou]+"
  m = r.search(s)
  while m <> nil
    count = count + 1
    i = m.subexpressionstart(0)
    i = i + len(m.subexpressionstring(0))
    r.searchStartPosition = i
    m = r.search()
  wend
  msgbox str(count)

Parentheses and Subexpressions

In a regular search expression, parentheses have two functions. One is simply to group things. For example, you might want to use a plus-sign to indicate a repetition, not of a single character, but of a more extended regular expression. To do so, you’d group what precedes the plus-sign in parentheses; the plus-sign would then apply to the group as a whole. Thus the expression (p+e)+r would match “pepper”: first we match the single “p” and the “e” that follows it; then we try to do it again, and we succeed, matching the double “p” and the “e” that follows it; then we try to do it again, and we fail; so we look to see if an “r” follows, and it does, so we stop with a successful match.

The other use of parentheses in a search expression is to demarcate a substring of whatever the search expression finds. This is useful, for instance, when you have to make a search expression where what interests you about the result is not the entire found string but only a certain part of it. For example, suppose you want to find a word starting with “anti”, but you don’t care about the whole word; you’re interested in what’s being opposed, so what you really want to know is what follows the “anti”. The regular search expression anti(\w*) will perform the search; but what do the parentheses do here? They allow us to refer to the relevant substring of whatever is found. The stuff found by what’s in the parentheses of a search expression is called a subexpression of the result. In this case, it is subexpression 1, and you can extract it from the resulting RegexMatch instance with the SubexpressionString(1) property and get its position with the SubexpressionStart(1) property.

Example 4: Subexpression

  dim r as regex, m as regexMatch
  r = new regex
  r.searchPattern = "anti(\w*)"
  m = r.search("The antithesis of synthesis.")
  msgbox m.subexpressionString(1)
  // result is "thesis"

You can now understand what the “(0)” is for in Examples 2 and 3. In a RegexMatch instance, subexpression 0 is the entire match result. Any other subexpressions are parts of the result demarcated by parentheses in the search expression. The rule for how they are numbered is simple: just count left-parentheses from left to right. So, for example, if the search expression were ((anti)(\w*)), then the material following the “anti” would be subexpression 3.

Another interesting use of subexpressions is to refer to them within the search expression. This is done by number, preceded by a backslash; so, \3 means subexpression 3. You use this to search for material containing repetition at a distance. For example, the expression \w*(\w)\w*\1\w* looks for a word containing the same character twice; it matches “metre” which contains two e’s, but it doesn’t match “metric” where no two letters are the same.

Search and Replace

To perform a search-and-replace, you assign a regular replace expression to the Regex instance’s ReplacementPattern property and send it the Replace message. Regular replace expressions can refer to subexpressions, using the same \3 notation we just talked about; otherwise they are pretty much just ordinary strings. The syntax of the Replace message is just like that of the Search message. The result is a string where the replacement has been inserted into the original in place of the found match.

Example 1 has already provided an illustration of search-and-replace. Here’s another, using a subexpression reference. Suppose we’ve got a string from an email message where the author indicated emphasis by surrounding words with asterisks. We want to turn this to HTML; we will find a stretch surrounded by asterisks and replace the asterisks with “” and “”.

Example 5: Search-and-replace

  dim r as regex, s as string
  r = new regex
  s = "This is a *very* important message."
  r.searchPattern = "\*(.*)\*"
  r.replacementPattern = "<B>\1</B>"
  r.options.greedy = false
  s = r.replace(s) 
  // result is "This is a <B>very</B> important message."

Example 5 is very typical of a regular expression search-and-replace. We look for a stretch of text consisting of two asterisks and everything in between, using syntax we’re now very familiar with. (Notice the use of a backslash to show we mean an actual asterisk.) But we’re only interested in what’s between the asterisks; we want to throw away the asterisks themselves. So in the search expression we demarcate what comes between the asterisks as a subexpression. That way we can refer to it in the replace expression, which consists of “” and “” surrounding whatever turned out to be between the asterisks. The result, in this particular case, is the replacement string “very”. That whole replacement string then replaces the whole found string in the original string; thus, we end up with just what we started with except that the asterisks are gone and the HTML markup is inserted.

We can take this even further, finding and replacing with HTML all asterisked expressions in a string that contains several of them. It’s simply a matter of inserting one line before performing the replace, specifying that the replace should be global:

Example 5a: Search-and-replace, global

  …
  s = "This *is* a *very* important message."
  …
  r.options.replaceAllMatches = true
  …
  // result is "This <B>is<.B> a <B>very</B> important message."

Another way to do a replace is to perform a search and then send the Replace message to the resulting RegexMatch instance. You can use a regular replace expression as parameter, or omit the parameter if you already set the Regex instance’s ReplacementPattern property. The result is the replacement string alone, not the replacement string inserted into the original. In this example, we look for a seven-digit phone number in any of several forms – “555-1234”, “5551234”, or “555 1234” – and render it into canonical form using a hyphen.

Example 6: Search-and-replace, extracted

  dim r as regex, m as regexMatch
  dim s, canonical as string
  s = "The phone number is 5437890."
  r = new regex
  r.searchPattern = "(\d\d\d)([- ]?)(\d\d\d\d)"
  m = r.search(s)
  canonical = m.replace("\1-\3")
  // result is "543-7890"

Since the search result remains sitting in the RegexMatch instance, you can now proceed to perform a different replacement without performing the entire search over again.

Example 6a: Search-and-replace, extracted, repeated

  dim r as regex, m as regexMatch
  dim s, canonical, noncanonical as string
  s = "The phone number is 5437890."
  r = new regex
  r.searchPattern = "(\d\d\d)([- ]?)(\d\d\d\d)"
  m = r.search(s)
  canonical = m.replace("\1-\3")
  // result is "543-7890"
  noncanonical = m.replace("\1 \3")
  // result is "543 7890"

You can also take advantage of the Perl operators $` and $' as metacharacters in a regular replace expression; they stand for the part of the original string respectively preceding and following the found string.

Example 7: Prematch operator

  dim r as regex, m as regexMatch
  dim s, prematch as string
  s = "The phone number is 5437890."
  r = new regex
  r.searchPattern = "(\d\d\d)([- ]?)(\d\d\d\d)"
  m = r.search(s)
  prematch = m.replace("$`")
  // result is "The phone number is "

Options

We had occasion in the preceding examples to refer to the Options property of a Regex instance. This is an instance of the RegexOptions class, which has various properties whose values determine the behavior of subsequent searches. You can examine the online help for RegexOptions to study these properties. Besides greediness and whether a search-and-replace should be global, there’s a setting for case sensitivity. There are also a number of settings having to do with the treatment of line endings. The reason for these is partly that different platforms use different line-ending characters, and partly that regular expressions were developed in a context of being applied to just one line (or paragraph) of text at a time, which might or might not be the behavior you want.

Exceptions

If you supply a search expression that can’t be parsed as a valid regular expression, REALbasic will throw an exception when you send the Search or Replace message. This will be an instance of the RegexException class, and you can learn more about the details of the problem through its Message property. Of course, you can also get a NilObjectException by trying to extract property values from a nil RegexMatch instance generated by an unsuccessful search; and trying to extract a SubexpressionString or a SubexpressionStart using an index too high for the number of subexpressions in the search expression will generate an OutOfBoundsException.

Expression Yourself

This discussion has not enumerated every aspect of regular expressions, but it has introduced the high points, and you now know enough to get started with REALbasic’s implementation of regular expressions. You shouldn’t have much difficulty understanding the list of metacharacters in the online help for Regex, and you’re well prepared, with a little study, a little thought, and a little experimentation, to do much of what can be done with regular expressions.

Here are a couple of warnings. First, regular expressions take practice. The subject is a deep one, and adepts pride themselves on their ability to construct long, powerful, incomprehensible search expressions. Second, don’t imagine that any single regular expression can solve every search problem. There’s nothing wrong with breaking down a problem into several searches, and often that’s the best solution. Neither of these matters has anything to do with REALbasic itself! Remember, REALbasic simply implements a standard form of regular expression syntax.

Now get out there and search some text!