In this article, Darren explains a nice and simple way of how to use regular expression in .Net. For those who are confused about how regular expressions work, this article is for you.
Although I was familiar enough with the basic concepts of regular expressions to use them in VBScript and JScript, I noticed that I was struggling to understand many regular expressions I found in examples and documentation. Some of the new features such as lookaround and named capturing left me feeling more than a little overwhelmed. In addition to this, the documentation for regular expressions was scant and quite often with little or no sample code. Because of this, I initially steered away from using regular expressions in my .NET projects altogether.
Although I was familiar enough with the basic concepts of regular expressions to use them in VBScript and JScript, I noticed that I was struggling to understand many regular expressions I found in examples and documentation. Some of the new features such as lookaround and named capturing left me feeling more than a little overwhelmed. In addition to this, the documentation for regular expressions was scant and quite often with little or no sample code. Because of this, I initially steered away from using regular expressions in my .NET projects altogether.
In this article I hope to highlight some of these new areas and hopefully de-mystify them in such a way that you won't find yourself in the position that I did.
Matching: Groups and Named Captures
From previous regular expression authoring you will likely be familiar with the concept of referencing parenthesized captures via the $1...$N notation - these are referred to as backreferences. To demonstrate this, consider the following VB.NET sample:
Dim userName As String = "Neimke, Darren"
Dim re As New RegEx( "(\w+),\s(\w+)" )
userName = re.Replace( userName, "$2 $1" )
Response.Write( userName )
Dim re As New RegEx( "(\w+),\s(\w+)" )
userName = re.Replace( userName, "$2 $1" )
Response.Write( userName )
The above pattern matches two words separated by a comma and a space, captures the surname and the firstname of a user and formats them in firstname, surname order. The result is that the value "Darren Neimke" would be displayed in the browser.
In the Replace statement the $N notation refers to the Nth group of parenthesis (captures). An important point to note is that, in .NET the zeroth element ($0) refers to the entire matched text - "Neimke, Darren" in the case of the above example.
The Regex class now offers some convenient shared (static) members that allow simple statements to be in-lined, thus reducing the need for unneccessarily bulky code structures such as the one shown above. The useful static members are: IsMatch, Match, Matches, Replace and Split. Using this syntax allows for the previous code to be reduced to:
Dim userName As String = "Neimke, Darren"
userName = Regex.Replace( userName, "(\w+),\s(\w+)", "$2 $1" )
userName = Regex.Replace( userName, "(\w+),\s(\w+)", "$2 $1" )
The reduced code benefits can be further seen with another example, using IsMatch() to ensure that a string contains a Decimal number pattern before executing some code:
If Regex.IsMatch( userInputString, "\d+(\.?\d+)" ) Then
' perform some conversion and math operations here
End If
' perform some conversion and math operations here
End If
Prior to .NET, a regular expressions Match object contained many SubMatches. This has remained the same in .NET although they are now referred to as Groups. Groups are a collection property of a Match object and each captured group can be accessed via it's index (remembering that index 0 refers to the entire match), like so:
Dim userName As String = "Neimke, Darren"
Response.Write( Regex.Match( username, "(\w+),\s(\w+)" ).Groups(2).ToString() )
Response.Write( Regex.Match( username, "(\w+),\s(\w+)" ).Groups(2).ToString() )
This would display the text "Darren" as it is the captured Group at index 2.
Named Captures
Additionally Groups can be assigned names via the new (?<nameOfGroup>...) or (?'nameOfGroup'...) syntax. For consistency with other flavors of regular expressions - such as Perl - I prefer the first syntax and it is the one that is most commonly used. Assigning names to groups helps to make your code more self-describing and can lead to improved maintainability. Here's an example of naming the two captures:
Dim userName As String = "Neimke, Darren"
Dim pattern As String = "(?<surname>(\w+)),\s(?<firstname>(\w+))"
Response.Write( Regex.Match( userName, pattern ).Groups("firstname").ToString() )
Dim pattern As String = "(?<surname>(\w+)),\s(?<firstname>(\w+))"
Response.Write( Regex.Match( userName, pattern ).Groups("firstname").ToString() )
Displays "Darren".
Non-Capturing
While captures provide a lot of power, they can incur quite a performance hit. With regular expressions in VBScript and JScript, capturing occurred whenever you used parenthesis in a regular expression pattern. Sometimes, though, you need to use parenthesis, but you don't need capturing. For example, if you wanted to match either "Let's go this way" or "Let's go that way" you could use the following regular expression:
Let\'s go th(is|at) way
The parentheses with the pipe indicate an option. The pattern matches either "is" or "at" after the "th". Unfortunately, this regular expression incurs an unneeded performance hit because the captured text (either "is" or "at") is remembered via a backreference.
Fortunately, .NET regular expressions provide the (?:...) syntax, which allows for grouping to be done without incurring the performance hit of captured text being "remembered" as a backreference. Using this syntax, the above regular expression could be changed to:
Let\'s go th(?:is|at) way
That pattern would match either:
"Let's go this way"
"Let's go that way"
"Let's go that way"
But would only contain one captured group, referenced as Groups(0). This can obviously lead to significant performance gains, especially when complex patterns are applied to even moderately large bodies of text.
Lookaround
Lookaround is a feature that is partially implemented in JScript but not in VBScript. There are two directions of lookaround - lookahead and lookbehind - and two flavors of each direction - positive assertion and negative assertion. The syntax for each is:
(?=...) - Positive lookAHEAD
(?!...) - Negative lookAHEAD
(?<=...) - Positive lookBEHIND
(?<!...) - Negative lookBEHIND
(?!...) - Negative lookAHEAD
(?<=...) - Positive lookBEHIND
(?<!...) - Negative lookBEHIND
Understanding look(ahead|behind) requires an understanding of the difference between matching text and matching position. To help with this understanding I should state first that lookaround assertions are non-consuming. To see what I mean, let's look at the following simple example.
pattern = "test"
text = "testing"
text = "testing"
When the above pattern is applied to the text the "context" of the parser sits at a position in the text between the "t" and the "i" in the word testing. This is because the regular expression parser bumps along the string as it gets a match, like so:
Start - ^testing
Match "t" - t^esting
Match "e" - te^sting
Match "s" - tes^ting
Match "t" - test^ing
Match "t" - t^esting
Match "e" - te^sting
Match "s" - tes^ting
Match "t" - test^ing
Once the parser has moved beyond a position there is no way to reverse up and re-attempt a match. To understand where this causes difficulty, consider this, what if you needed to match the word "test" but only when it was contained in the word "tested" and not any other possible combination such as "tester". With lookahead you can simply assert that condition like so: (?=tested\b)test
This works because, with lookaround, the parser is not bumped along the string. This can be especially useful for finding a position in a document by combining a lookahead assertion with a lookbehind assertion. To demonstrate, let's consider that we need to match the string "test" when it was contained within the string "protested" but not "detested". To do this you can do a negative, lookbehind assertion on "de" and a positive lookahead assertion on "tested", like this: (?<!de)(?=tested\b)test
In other words you are matching a position at which to start matching text. The above pattern would set the parser at the following position in the string "protested"
Start - pro^tested
Match "t" - prot^ested
Match "e" - prote^sted
Match "s" - protes^ted
Match "t" - protest^ed
Match "t" - prot^ested
Match "e" - prote^sted
Match "s" - protes^ted
Match "t" - protest^ed
Another good example of using lookaround would be to validate "special" password conditions such as: "Password must be between 8 and 20 characters, must contain at least 2 letter characters and at least 2 digit characters. It can only contain either letter or digit characters."
For such a password constraint, the following expression would probably do quite nicely: ^(?=.*?\d.*?\d)(?=.*?\w.*?\w)[\d\w]{8,20}$
Readability and Maintainability
One of my personal favorite new features is the ability to have embedded comments in regular expressions. Most of us will have, at one time or another come across a regular expression that looks somewhat like this:
Dim re As New Regex( "(?<=( #|@))(?=\w+)\w+\b ", RegexOptions.Multiline )
If you are lucky you might find a comment that alludes to the purpose of the regular expression, but, when the time comes to maintain the expression you are undoubtedly left with a sense of anxiety and, more often than not, a complete re-write is undertaken as opposed to some minor maintenance operation. .NET allows regular expression patterns to be authored with embedded comments via the RegExOptions.IgnorePatternWhitespace compiler option and the (?#...) syntax embedded within each line of the pattern string.
This allows for psuedo-code-like comments to be embedded in each line and has the following affect on readability:
Dim re As New Regex ( _
"(?<= (?# Start a positive lookBEHIND assertion ) " & _
"(#|@) (?# Find a # or a @ symbol ) " & _
") (?# End the lookBEHIND assertion ) " & _
"(?= (?# Start a positive lookAHEAD assertion ) " & _
" \w+ (?# Find at least one word character ) " & _
") (?# End the lookAHEAD assertion ) " & _
"\w+\b (?# Match multiple word characters leading up to a word boundary)", _
RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)
"(?<= (?# Start a positive lookBEHIND assertion ) " & _
"(#|@) (?# Find a # or a @ symbol ) " & _
") (?# End the lookBEHIND assertion ) " & _
"(?= (?# Start a positive lookAHEAD assertion ) " & _
" \w+ (?# Find at least one word character ) " & _
") (?# End the lookAHEAD assertion ) " & _
"\w+\b (?# Match multiple word characters leading up to a word boundary)", _
RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)
Delegates
Finally, a really useful addition to the .NET Framework is that the Regex.Replace() method allows the use of a delegate as the "replacement" argument. To understand what I'm talking about, consider the following snippet:
Dim myString As String = RegEx.Replace( "a true taste of the temperature", "t.*?e\b", "a" )
After the replace operation has occurred, the value of myString will be "a a a of a a" and it's fairly obvious what happened. Every time the regular expression parser found a match within the string it replaced it with the letter "a". That's all nice and easy if all you need to do is a straight replace, but what about if you need to implement some sort of business logic into the check or you need to "touch" the sub-matches in some way and re-build the replaced string.
A good enough example is converting all words within a body of text to proper case (i.e. first letter capitalized). To do this your first instincts might be to create a pattern like so: \b(\w)(\w+)?\b. You could then enumerate the matches, convert the first sub-match to its uppercase version, join the sub-matches and re-append them to a StringBuilder instance, like so:
mc = re.Matches( bodyOfText )
Dim m As Match
For Each m In mc
sb.AppendFormat("{0}{1}", m.Groups(1).Value.ToUpper(), m.Groups(2).Value)
Next
Dim m As Match
For Each m In mc
sb.AppendFormat("{0}{1}", m.Groups(1).Value.ToUpper(), m.Groups(2).Value)
Next
That would work fine if your string contained only word characters, but, what if it looked like this: ~~~ This %%% is ### a chunk of text. After the replacement operation you would end up with the following string meaning that all non-word characters that didn't participate in the matches were dropped: ThisIsAChunkOfText. There are ways around it, mostly by building bigger, more complex patterns and doing more string building inside the match collection iteration.
A more elegant solution is to wire-up a MatchEvaluator delegate. You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. You provide the MatchEvaluator with a pointer (reference) to handler function and that function will be called each time a match is encountered. The function must take a Match parameter as its single argument and must return a String back to the regular expression Replace method that invoked it. This method of replacement allows you the flexibility to do all sorts of operations transparently to the Replace method itself, and because it is all handled within the Replace method call, you are not left with having to re-build a string as in the previous example.
A demonstration is in order - let's re-write our previous failed attempt at converting a string to proper case using delegates:
Sub Page_Load(sender as Object, e as EventArgs)
Dim myDelegate As New MatchEvaluator( AddressOf MatchHandler )
Dim sb As New System.Text.Stringbuilder()
Dim bodyOfText As String = _
"~~~ This %%% is ### a chunk of text."
Dim pattern As String = "\b(\w)(\w+)?\b"
Dim re As New Regex( _
pattern, RegexOptions.Multiline Or _
RegexOptions.IgnoreCase _
)
Dim newString As String = re.Replace(bodyOfText, myDelegate)
Response.Write( bodyOfText & "<hr>" & newString )
End Sub
Dim myDelegate As New MatchEvaluator( AddressOf MatchHandler )
Dim sb As New System.Text.Stringbuilder()
Dim bodyOfText As String = _
"~~~ This %%% is ### a chunk of text."
Dim pattern As String = "\b(\w)(\w+)?\b"
Dim re As New Regex( _
pattern, RegexOptions.Multiline Or _
RegexOptions.IgnoreCase _
)
Dim newString As String = re.Replace(bodyOfText, myDelegate)
Response.Write( bodyOfText & "<hr>" & newString )
End Sub
Private Function MatchHandler( ByVal m As Match ) As String
Return m.Groups(1).Value.ToUpper() & m.Groups(2).Value
End Function
Return m.Groups(1).Value.ToUpper() & m.Groups(2).Value
End Function
As you can see, the separation is much cleaner and having the replacement logic handled in a separate handler method allows you to implement very complicated operations without affecting readability, maintainability or - and most importantly - data integrity as a result of missing data in a string re-building operation.
Original Post from www.devarticles.com
| By: Darren Neimke |