Four application of regular expressions in web page processing

Regular expressions provide an efficient and convenient way to match string patterns. Almost all high-level languages provide support for regular expressions, or provide ready-made code bases for call. This article takes common processing tasks in the ASP environment as an example to introduce the application skills of regular expressions.

1. Check the format of password and email address

Our first example demonstrates a basic function of regular expressions: abstractly describing arbitrary complex strings. It means that regular expressions give programmers a formal string description method, which can describe any string pattern encountered by the application with very little code. For example, for those who are not engaged in technical work, the requirements for password format can be described as follows: the first character of the password must be a letter, the password must have a minimum of 4 characters and no more than 15 characters, and the password cannot contain characters other than letters, numbers and underscores.

As programmers, we must convert the above natural language description of password format into other forms so that the ASP page can understand and apply it to prevent illegal password input. The regular expression describing this password format is: ^[a-zA-Z]\w{3,14}$. In ASP applications, we can write the password verification process into a reusable function, as shown below:

Function TestPassword(strPassword)
Dim re
Set re = new RegExp
 = false
 = false
 = "^[a-zA-Z]\w{3,14}$"
TestPassword = (strPassword)
End Function

Let's compare the regular expressions in the password format to the natural language description:
The first character of the password must be a letter: the regular expression description is "^[a-zA-Z]", where "^" indicates the beginning of the string, and the hyphen tells RegExp to match all characters in the specified range.
Password has a minimum of 4 characters and no more than 15 characters: the regular expression description is "{3,14}".
Passwords cannot contain characters other than letters, numbers, and underscores: the regular expression description is "\w".

A few points to explain: {3, 14} means that the previous pattern matches at least 3, but no more than 14 characters (plus the first character becomes 4 to 15 characters). Note that the syntax requirements in curly braces are extremely strict and spaces are not allowed to be added on both sides of the comma. If spaces are added, it will have an impact on the meaning of the regular expression, resulting in errors during password format verification. In addition, the "$" character is not added at the end of the regular expression above. The $ character causes the regular expression to match the string until the end, ensuring that no other characters are added to the legitimate password.

Similar to password format verification, checking the legitimacy of email addresses is also a very common problem. Simple email address verification using regular expressions can be implemented as follows:

＜%
Dim re
Set re = new RegExp
 = "^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$"
 ("aabb@")
%＞

2. Extract specific parts of HTML page

The main problem with extracting content from HTML pages is that we have to find a way to accurately identify the part of what we want. For example, here is a snippet of HTML code that displays a news title:

＜table border="0" width="11%" class="Somestory"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞Other content...＜/td＞
＜/tr＞
＜/table＞
＜table border="0" width="11%" class="Headline"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞Iraq War！＜/td＞
＜/tr＞
＜/table＞
＜table border="0" width="11%" class="Someotherstory"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞Other content...＜/td＞
＜/tr＞
＜/table＞

Looking at the above code, it is easy to see that the news title is displayed by the table in the middle, with its class property set to Headline. If HTML pages are very complex, use an additional feature provided by Microsoft IE since 5.0 to view only the HTML code of selected pages, please visit/Windows/ie/WebAccess/Learn more. For this example, we assume that this is the only table with the class attribute set to Headline. Now we want to create a regular expression, find the Headline table through the regular expression and include the table in our page. First, write code that supports regular expressions:

＜%
Dim re, strHTML
Set re = new RegExp ' Create a regular expression object
 = true
 = false ' End search after the first match
%＞

Let's consider the area we want to extract: Here we are extracting the entire <table> structure, including the text of the end tag and the news title. Therefore, the starting character of the search should be the <table> start tag: = "<table.*(?=Headline)".

This regular expression matches the table's start mark and can return everything from the start mark to "Headline" (except for line breaks). Here is a way to return matching HTML code:

' Put all matchingHTMLPut the code inMatchesgather
Set Matches = (strHTML)
' Show all matchingHTMLCode
For Each Item in Matches
 
Next
' Show one of them
 (0).Value

Run this code to process the HTML fragment displayed earlier. The regular expression returns the matching content as follows: ＜table border="0" width="11%" class=". The "(?=Headline)" in the regular expression does not get characters, so the value of the table class attribute cannot be seen.

The code to get the rest of the table is also quite simple: = "<table.*(?=Headline)(.|\n)*?</table>". Where: "*" after "(.|\n)" matches 0 to more than one arbitrary characters; and "?" minimizes the matching range of "*", that is, matches as few characters as possible before finding the next part of the expression. ＜/table＞ is the end mark of the table.

The "?" limiter is very important, it prevents expressions from returning code to other tables. For example, for the HTML code snippet given earlier, if you delete this "?", the return content will be:

＜table border="0" width="11%" class="Headline"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞Iraq War！＜/td＞
＜/tr＞
＜/table＞
＜table border="0" width="11%" class="Someotherstory"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞Other content...＜/td＞
＜/tr＞
＜/table＞

The returned content not only contains the <table> tag of the Headline table, but also contains the Someotherstory table. From this we can see that the "?" here is essential.

This example assumes some rather ideal premises. The situation in practical applications is often much more complicated, especially when you have no influence on writing the source HTML code you are using, writing ASP code is particularly difficult. The most effective way is to spend more time analyzing HTML near the content to be extracted, and test it frequently to ensure that the extracted content is exactly what you need.

In addition, the situation where regular expressions cannot match any content of the source HTML page should be paid attention to and deal with. Content updates can be very fast. Don’t let your page have low-level and ridiculous mistakes just because others have changed the format of the content.

3. Analyze text data files
There are many formats and types of data files, and XML documents, structured text and even unstructured text often become data sources for ASP applications. An example we are going to look at below is a structured text file using qualifiers. Qualifiers (such as quotes) indicate that each part of a string is inseparable, even if the string contains a separator that separates records into fields. Here is a simple structured text file:

surname,name, Telephone, illustrate
Sun,Wukong, 312 555 5656, ASPvery good
pig,Bajie, 847 555 5656, I'm a film producer

This file is very simple. The first line of it is the title, and the following two lines are records with commas as delimiters. It is also very simple to parse this file. Just divide the file into lines (based on line break symbols) and then divide each record by field. However, if we include a comma in the content of a certain field:

surname,name, Telephone, illustrate
Sun,Wukong, 312 555 5656, I likeASP,besidesVBandSQL
pig,Bajie, 847 555 5656, I'm a film producer

Problems occur when parsing the first record, because in a parser that only recognizes comma separators, its last field contains the contents of two fields. To avoid such problems, fields containing delimiters must be surrounded by qualifiers. Single quotes are a commonly used qualifier. After adding the text file above to the single quote qualifier, its contents are as follows:

surname,name, Telephone, illustrate
Sun,Wukong, 312 555 5656, 'I like ASP, and VB and SQL'
pig,Bajie, 847 555 5656, 'I'm a film producer'

Now we can confirm which comma is the separator and which comma is the content of the field, that is, we only need to regard the commas that appear inside the quotes as the content of the field. The next thing we need to do is implement a regular expression parser, which determines when to split fields based on commas and when to treat commas as field content.

The problem here is slightly different from what most regular expressions face. Usually we look at a small part of the text to see if it matches the regular expression. But here, we can only reliably determine which content is in quotes after considering the entire line of text.

Here is an example explaining the problem. A random half line of content is extracted from a certain text file and gets: 1, beach, black, 21, ', dog, cat, duck, ', . In this example, since there are other data on the left side of "1", it is extremely difficult to parse its contents clearly. We don't know how many single quotes are in front of this data fragment, so we cannot determine which characters are in the quotes (it cannot be divided when the text inside the quotes is parsed). If this data fragment has even number (or no) single quotes before it, then "', dog, cat, duck, '" is a string defined in quotes and is inseparable. If the number of quotes in the preceding are odd, then "1, beach, black, 21, '" is the end part of a string and is indivisible.

Therefore, regular expressions must analyze the entire line of text and fully consider how many quotes appear to determine whether the character is inside or outside the quote pair, i.e.:,(?=([^']*'[^']*')*(?![^']*')). This regular expression first finds a quote, and then continues to search and ensure that the number of single quotes after the comma is either an even number or 0. The following judgment of the regular expression is based on: If the number of single quotes after the comma is an even number, then the comma is located outside the string. The following table gives a more detailed description:

,	Looking for a comma
(?=	Continue to look forward to match the following pattern:
(	Start a new mode
[^']*'	[Non-quote characters] 0 or more, then a quote
[^']'[^'])	[Non-quoted characters] 0 or more, followed by a quote. After combining the previous content, it matches the quote pairs
)*	End the pattern and match the entire pattern (quotation pairs) 0 or more times
(?!	Look forward to exclude this pattern
[^']*'	[Non-quote characters] 0 or more, then a quote
)	End mode

Here is a VBScript function that accepts a string parameter, divides the string according to the comma separator and single quote qualifier in the string, and returns the result array:

Function SplitAdv(strInput)
Dim objRE
Set objRE = new RegExp
' set upRegExpObject
 = true
 = true
 = ",(?=([^']*'[^']*')*(?![^']*'))"
' ReplaceMethodchr(8)Replace the comma we want to use，chr(8)Right now\b
' character，\b在character串中出现的可能极为微小。
' Then we\bSave string split to array
SplitAdv = Split((strInput, "\b"), "\b")
End Function

In short, parsing text data files with regular expressions has the advantages of being efficient and shortening development time, and can save a lot of time to analyze files and extract useful data based on complex conditions. There are still many traditional data available in a rapidly developing environment, and mastering how to construct efficient data analysis routines will be a valuable skill.

4. String replacement

In the last example we want to look at the replacement function of VBScript regular expressions. ASP is often used to dynamically format text obtained from various data sources. With the power of VBScript regular expressions, ASP can dynamically change matching complex text. Highlighting some words by adding HTML tags is a common application, such as highlighting search keywords in search results.
To illustrate the implementation method, let's take a look at an example that highlights all ".NET" in the string. This string can be obtained from anywhere, such as a database or other web site.

＜%
Set regEx = New RegExp
 = true
 = True
' Regular expression pattern，
' Looking for any ending as“.NET”Word orURL。
 = "(\b[a-zA-Z\._]+?\.NET\b)"
' String for testing replacement functions
strText = "Microsoft has built a new website."
' Calling regular expressionsReplacemethod
' $1Indicates inserting matching text into the current location
 (strText, _
"＜b style='color: #000099; font-size: 18pt'＞$1＜/b＞")
%＞

There are several important things that must be paid attention to in this example. The entire regular expression is placed in a pair of parentheses, and its purpose is to intercept all matching content for later use, which is referenced by $1 in the replacement text. Similar intercepts can use up to 9 per replacement, referenced by $1 to $9, respectively. The Replace method of a regular expression is different from the Replace function of VBScript itself. It only requires two parameters: the text being searched and the text being replaced.
In this example, to highlight the searched ".NET" strings, we enclose these strings with bold tags and other style attributes. Using this search and replacement technology, we can easily add the ability to highlight search keywords to the website search program, or automatically add links to keywords appearing in the page to other pages.

Conclusion

I hope that the several regular expression techniques introduced in this article will inspire you when and how to apply regular expressions. Although the examples in this article are written in VBScript, regular expressions are also very useful in it. They are one of the main mechanisms for server-side control form verification and are exported to the entire .NET framework through namespace. (