JScript and VBScript regular expressions page 2/2

""
If you try to match a string containing the file name, where the period (.) is part of the input string, you can prefix a period in the regular expression to achieve this requirement. For example, the following JScript regular expression can match '':

/filename\.ext/
For VBScript, the equivalent expression is as follows:

"filename\.ext"
These expressions are still quite limited. They are only allowed to match any single character. In many cases, it is very useful for matching special characters from a list. For example, if the input text contains chapter titles such as Chapter 1, Chapter 2, etc., you may need to find these chapter titles.

bracket expression
You can put one or more single characters in a square bracket ([ and ]) to create a list to match. If the characters are enclosed in brackets, the list is called a bracket expression. In brackets, like anywhere else, ordinary characters represent themselves, that is, they match a place that appears in the input text. Most special characters lose their meaning when they are in bracket expressions. Here are some exceptions:

']' A character if it is not the first item, a list will end. To match the ']' character in the list, place it first, immediately after the '[' of the beginning.

'\' still as an escape character. To match the '\' character, use '\\'.
The characters contained in the bracket expression match only one single character where the bracket expression is located in the regular expression. The following

JScript regular expressions can match 'Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', and 'Chapter 5':

/Chapter [12345]/
To match the same chapter title in VBScript, use the following expression:

"Chapter [12345]"
Please note that the positional relationship between the word 'Chapter' and the following spaces and the characters in brackets is fixed. Therefore, bracket expressions are only used to specify a set of characters that satisfy the single character position immediately after the word 'Chapter' and a space. Here is the ninth character position.

If you want to use a range instead of the character itself to represent the character to be matched, you can use a hyphen to separate the start and end characters of the range. The character value of each character will determine its relative order within a range. The following JScript regular expression contains a range expression equivalent to the parentheses list shown above.

/Chapter [1-5]/
The expressions of the same function in VBScript are as follows:

"Chapter [1-5]"
If a range is specified in this way, both the start and the end values are included in that range. One thing that needs to be noted is that in Unicode sorting, the starting value must be before the end value.

If you want to include hyphens in a bracket expression, you must use one of the following methods:

Escape it with a backslash:
[\-]
Place the hyphen at the beginning and end positions of the bracket list. The following expression can match all lowercase letters and hyphens:
[-a-z]
[a-z-]
Creates a range where the value of the start character is smaller than the hyphen and the value of the end character is equal to or greater than the hyphen. The following two regular expressions meet this requirement:

[!--]
[!-~]

Similarly, by placing a caret (^) at the beginning of the list, you can find all characters that are not in the list or range. If the caret appears elsewhere in the list, it matches itself without any special meaning. The following JScript regular expression matches the title of the chapter number greater than 5:

/Chapter [^12345]/
For VBScript, use:

"Chapter [^12345]"
In the example shown above, the expression will match any numeric character at the ninth position except 1, 2, 3, 4, or 5. Therefore, 'Chapter 7' is a match, and the same is true for 'Chapter 9'.

The above expression can be represented by a hyphen (-). For JScript, it is:

/Chapter [^1-5]/
Or, for VBScript, it is:

"Chapter [^1-5]"
A typical use of bracket expressions is to specify a match to any uppercase or lowercase alphabetical characters or any number. The following JScript expression gives this match:

/[A-Za-z0-9]/
The equivalent VBScript expression is:

"[A-Za-z0-9]"
Qualifier

Sometimes I don't know how many characters to match. To be able to adapt to this uncertainty, regular expressions support the concept of qualifiers. These qualifiers can specify how many times a given component of a regular expression must appear before the match can be met.

The following table gives an explanation of the various qualifiers and their meanings:

Character Description
* Match the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * Equivalent to {0,}.

+ Match the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but cannot match "z". + equivalent to {1,}.

? Match the previous subexpression zero or once. For example, "do(es)?" can match "do" in "do" or "does". ? Equivalent to {0,1}.

{n} n is a non-negative integer. Match n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob" , but can match 'o' in "food" .

{n,} n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" , but can match all o in "fooooood" . 'o{1,}' equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.

{n,m} m and n are both non-negative integers, where n <= m. Match at least n and match at most m. For example, "o{1,3}" will match the first three o in "fooooooood". 'o{0,1}' equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers.

For a large input document, the number of chapters can easily exceed nine chapters, so there is a way to deal with two-digit or three-digit chapter numbers. The qualifier provides this function. The following JScript regular expression can match the chapter title with any number of bits:

/Chapter [1-9][0-9]*/
The following VBScript regular expression performs the same match:

"Chapter [1-9][0-9]*"
Note that the qualifiers appear after the range expression. Therefore, it will be applied to the entire range expression contained, and in this case only numbers from 0 to 9 are specified.

The '+' qualifier is not used here, because a number is not necessarily needed in the second or subsequent position. Also no '?' character is used, as this will limit the number of chapters to only two digits. At least one number must be matched after the 'Chapter' and the space characters.

If the chapter limit is known to be 99 chapters, you can use the following JScript expression to specify at least one number, but no more than two numbers.

/Chapter [0-9]{1,2}/
For VBScript, the following regular expressions can be used:

"Chapter [0-9]{1,2}"
The disadvantage of the above expression is that if there is a chapter number greater than 99, it will still only match the first two digits. Another drawback is that some people can create a Chapter 0 and it still matches. A better JScript expression to match a double digit is as follows:

/Chapter [1-9][0-9]?/
or

/Chapter [1-9][0-9]{0,1}/
For VBScript, the following expression is equivalent to the above:

"Chapter [1-9][0-9]?"
or

"Chapter [1-9][0-9]{0,1}"
The '*', '+' and '?' qualifiers are all called greedy, that is, they match as many words as possible. Sometimes this is not what you want to happen at all. Sometimes it just happens to be the smallest match.

For example, you might want to search for an HTML document to find a chapter title that is included in the H1 tag. In a document, the text may have the following form:

<H1>Chapter 1 – Introduction to Regular Expressions</H1>
The following expression matches everything from the beginning less than sign (<) to the end of the H1 mark.

/<.*>/
The regular expression of VBScript is:

"<.*>"
If what you want to match is the starting H1 tag, the following non-greedy expressions will only match <H1>.

/<.*?>/
or

"<.*?>"
By placing '?' after the '*', '+' or '?' qualifiers, the expression changes from greedy matches to non-greedy or minimal matches.

Locator

Until now, the examples seen only consider finding chapter titles that appear anywhere. Any string 'Chapter' that appears followed by a space and a number may be a real chapter title or a cross reference to other chapters. Since the real chapter title always appears at the beginning of a line, it is necessary to design a method to look for only the title and not the cross reference.

Locators provide this function. A locator can fix a regular expression at the beginning or end of a line. You can also create regular expressions that appear only within words or only at the beginning or end of a word. The following table contains a list of regular expressions and their meanings:

Character Description
^ Matches the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after '\n' or '\r'.
$ Match the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the previous position of '\n' or '\r'.
\b Match a word boundary, which means the position of the word and space.
\B Match non-word boundaries.

Qualifiers cannot be used for locators. Since there will be no consecutive positions before or after a newline or word boundary, expressions such as '^*' are not allowed.

To match the text at the beginning of a line of text, use the '^' character at the beginning of the regular expression. Don't confuse the '^' syntax with its syntax in parentheses expressions. Their syntax is fundamentally different.

To match the end position of a line of text, use the '$' character at the end of the regular expression.

To use a locator when searching for chapter titles, the following JScript regular expression will match the chapter title with up to two numbers at the beginning of a line:

/^Chapter [1-9][0-9]{0,1}/
The regular expressions of the same function in VBScript are as follows:

"^Chapter [1-9][0-9]{0,1}"
A real chapter title not only appears at the beginning of a line, but also has only this content in this line, so it must also be at the end of a line. The following expression ensures that the specified match only matches the chapter and does not match cross references. It is achieved by creating a regular expression that matches only the beginning and end positions of a line of literal.

/^Chapter [1-9][0-9]{0,1}$/
For VBScript, use:

"^Chapter [1-9][0-9]{0,1}$"
Matching word boundaries are slightly different, but it adds a very important function to regular expressions. The word boundary is the position between the word and the space. Non-word boundaries are anywhere else. The following JScript expression will match the first three characters of the word 'Chapter' because they appear after the word boundary:

/\bCha/
For VBScript, it is:

"\bCha"
Here the position of the '\b' operator is very critical. If it is at the beginning of the string to match, a match at the beginning of the word is looked for; if it is at the end of the string, a match at the end of the word is looked for. For example, the following expression will match 'ter' in the word 'Chapter' because it appears before the word boundary:

/ter\b/
as well as

"ter\b"
The following expression will match 'apt' because it is in the middle of 'Chapter', but will not match 'apt' in 'aptitude':

/\Bapt/
as well as

"\Bapt"
This is because in the word 'Chapter', 'apt' appears at a non-word boundary position, and in the word 'aptitude', 'word boundary position. The position of non-word boundary operators is not important because the match has nothing to do with the beginning or end of a word.

Select and group
Select Allow to use the '|' character to select among two or more candidates. By extending the regular expression of the chapter title, it can be expanded into an expression that applies more than just to the chapter title. However, this is not as direct as expected. When using selection, the most likely expression for each side of the '|' character will be matched. You might think that the following JScript and VBScript expressions will match the 'Chapter' or 'Section' at the beginning and end positions of a line and followed by one or two numbers:

/^Chapter|Section [1-9][0-9]{0,1}$/
"^Chapter|Section [1-9][0-9]{0,1}$"
Unfortunately, the real situation is that the regular expression shown above either matches the word 'Chapter' at the beginning of a line, or matches the 'Section' at the end of a line followed by any number. If the input string is 'Chapter 22', the above expression will only match the word 'Chapter'. If the input string is 'Section 22', the expression will match 'Section 22'. But this result is not our purpose here, so there must be a way to make regular expressions more responsive to what they are going to do, and there is indeed such a way.

Parentheses can be used to limit the range of choices, which means that it is clear that the choice only applies to these two words 'Chapter' and 'Section'. However, parentheses are also difficult to deal with because they are also used to create subexpressions, and some will be introduced later on in the section on subexpressions. By taking the regular expression shown above and adding parentheses at the appropriate position, the regular expression can be made to match both 'Chapter 1' and 'Section 3'.

The following regular expression uses parentheses to form a group of 'Chapter' and 'Section', so the expression works correctly. For JScript, it is:

/^(Chapter|Section) [1-9][0-9]{0,1}$/
For VBScript, it is:

"^(Chapter|Section) [1-9][0-9]{0,1}$"
These expressions work correctly and just produce an interesting by-product. Placing parentheses on both sides of 'Chapter|Section' creates an appropriate grouping, but it also causes one of the two words to be matched to be captured for future use. Since there is only one set of parentheses in the expression shown above, there can only be one captured submatch. This submatch can be referenced using the Submatches collection of VBScript or the $1-$9 attribute of the RegExp object in JScript.

Sometimes capturing a sub-match is desirable, sometimes undesirable. In the example shown in the description, what you really want to do is to use parentheses to group the choice between the words 'Chapter' or 'Section'. It is not desirable to refer to the match later. In fact, please do not use unless you really need to capture sub-match. Since there is no need to spend time and memory to store those sub-matches, this regex will be more efficient.

You can use '?:' in front of the parentheses of the regular expression pattern to prevent storing this match for future use. The following modifications to the regular expression shown above provide the same functionality that eliminates sub-match storage. For JScript:

/^(?:Chapter|Section) [1-9][0-9]{0,1}$/
For VBScript:

"^(?:Chapter|Section) [1-9][0-9]{0,1}$"
In addition to the '?:' metacharacter, there are two non-capturing metacharacters used to matches called pre-checks. One is a forward pre-search, which is represented by ?=, to match the search string at any position where the regular expression pattern begins to match the parentheses. A negative pre-check is indicated by '?!', where the search string is matched at any position that does not match the regular expression pattern at the beginning.

For example, suppose there is a document that contains references to Windows 3.1, Windows 95, Windows 98, and Windows NT. Further assuming that the document needs to be updated by looking for all references to Windows 95, Windows 98, and Windows NT and changing these references to Windows 2000. You can use the following JScript regular expression, which is a forward pre-check to match Windows 95, Windows 98, and Windows NT:

/Windows(?=95 |98 |NT )/
To make the same match in VBScript, you can use the following expression:

"Windows(?=95 |98 |NT )"
Once a match is found, the next match search begins immediately following the matching text (not including the characters used in the pre-examination). For example, if the expression shown above matches 'Windows 98', the search will continue from 'Windows' instead of '98'.

Backward quote
One of the most important features of regular expressions is the ability to store a part of the matching successful pattern for later use. Recall that adding parentheses to both sides of a regular expression pattern or partial pattern will cause the partial expression to be stored in a temporary buffer. The non-capturing metacharacter '?:', '?=', or '?!' can be used to ignore the storage of this part of the regular expression.

Each submatch captured is stored as the content encountered from left to right in the regular expression pattern. The buffer number that stores sub-match starts from 1 and is consecutively numbered until the maximum 99 sub-expressions. Each buffer can be accessed using '\n', where n is a one- or two-digit decimal number that identifies a particular buffer.

A simplest and most useful application to quote backwards is the ability to determine where two identical words appear in a text in succession. Please see the following sentence:

Is is the cost of of gasoline going up up?
According to the written content, the above sentence obviously has the problem of repeated words repeatedly. It would be great if there was a way to modify the sentence without looking for repetition of each word. The following JScript regular expression can achieve this function using a subexpression.

/\b([a-z]+) \1\b/gi
The equivalent VBScript expression is:

"\b([a-z]+) \1\b"
In this example, the subexpression is each item between parentheses. The captured expression includes one or more alphabetical characters, i.e. specified by '[a-z]+'. The second part of the regular expression is a reference to the previously captured sub-match, that is, the second occurrence of the word matched by the additional expression. '\1' is used to specify the first submatch. Word boundary element characters ensure that only individual words are detected. If this is not the case, phrases such as "is issued" or "this is" will be incorrectly recognized by the expression.

In a JScript expression, the global flag ('g') after the regular expression means that the expression will be used to find as many matches as possible in the input string. Case sensitivity is specified by the case sensitivity mark ('i') at the end of the expression. Multi-line tags specify potential matches that may appear at both ends of a newline character. For VBScript, various tokens cannot be set in expressions, but they must be explicitly set using the properties of the RegExp object.

Using the regular expression shown above, the following JScript code can use sub-match information to replace the same word that appears twice in a literal string with the same word:

var ss = "Is is the cost of of gasoline going up up?.\n";
var re = /\b([a-z]+) \1\b/gim; //Create regular expression styles.
var rv = (re,"$1"); //Replace two words with one word.
The closest equivalent VBScript code is as follows:

Dim ss, re, rv
ss = "Is is the cost of of gasoline going up up?." & vbNewLine
Set re = New RegExp
= "\b([a-z]+) \1\b"
= True
= True
= True
rv = (ss,"$1")
Please note that in VBScript code, global, case sensitivity, and multi-line tags are set using appropriate properties of the RegExp object.

Use $1 in the replace method to reference the saved first submatch. If there are multiple sub-matches, you can continue to quote it with $2, $3, etc.

Another use of backward reference is to break down a common resource indicator (URI) into component parts. Suppose you want to decompose the following URI into protocol (ftp, http, etc), domain name address and page/path:

:80/scripting/
The following regular expressions can provide this function. For JScript, it is:

/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/
For VBScript, it is:

"(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"
The first additional subexpression is used to capture the protocol part of the web address. This subexpression matches any word that is located before a colon and two forward slashes. The second additional subexpression captures the domain name address of that address. This subexpression matches any character sequence that does not include '^', '/', or ':' characters. The third additional subexpression captures the website port number, if specified. This subexpression matches zero or more numbers followed by a colon. Finally, the fourth additional subexpression captures the path specified by the web address and \ or page information. This subexpression matches one and multiple characters except '#' or space.

After applying this regular expression to the URI shown above, the sub-match contains the following:

RegExp.$1 contains "http"
RegExp.$2 contains “"
RegExp.$3 contains ":80"
RegExp.$4 contains "/scripting/"

Previous page12Read the full text