Regular expressions in PHP (I)
Hunte April 14, 2000
PHP inherits the consistent tradition of *NIX and fully supports the processing of regular expressions. Regular expressions provide a high-level, but not intuitive way to match and handle strings. Friends who have used PERL regular expressions know that regular expressions are very powerful, but they are not that easy to learn.
for example:
^.+@.+\\..+$
This effective but difficult-to-understand code is enough to give some programmers a headache (I am) or make them give up using regular expressions. I believe that after you finish reading this tutorial, you can understand the meaning of this code.
Basic pattern matching
It all starts with the most basic. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of strings. The pattern can be very simple, composed of ordinary strings, or very complex, and often uses special characters to represent characters in a range, recurrence, or context. For example:
^once
This pattern contains a special character ^, indicating that the pattern matches only those strings starting with once. For example, this pattern matches the string "once upon a time" and does not match "There once was a man from NewYork". Just as the ^ symbol indicates the beginning, the $ symbol is used to match strings ending in a given pattern.
bucket$
This pattern matches "Who keep all of this cash in a bucket" and does not match "buckets". When the characters ^ and $ are used at the same time, they represent an exact match (strings are the same as patterns). For example:
^bucket$
Match only the string "bucket". If a pattern does not include ^ and $, then it matches any string containing the pattern. For example: mode
once
With string
There once was a man from NewYork
Who kept all of his cash in a bucket.
It's a match.
The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letter itself, and the numbers are the same. Other slightly more complex characters, such as punctuation and white characters (spaces, tabs, etc.), use escape sequences. All escape sequences are headed with a backslash (\). The escape sequence of tab characters is: \t. So if we want to detect whether a string starts with a tab, we can use this pattern:
^\t
Similarly, \n means "new line" and \r means enter. Other special symbols can be used to add backslashes in front of them, such as the backslash itself is represented by \\, periods are represented by \., and so on.
Character cluster
In INTERNET programs, regular expressions are usually used to verify user input. After the user submits a FORM, it is not enough to determine whether the entered phone number, address, EMAIL address, credit card number, etc. are valid.
So we need to use a more free way to describe the pattern we want, which is the character cluster. To create a cluster of characters representing all vowel characters, put all vowel characters in a square bracket:
[AaEeIiOoUu]
This pattern matches any vowel character, but can only represent one character. A hyphen can represent a range of a character, such as:
[a-z] //Match all lowercase letters
[A-Z] //Match all capital letters
[a-zA-Z] //Match all letters
[0-9] //Match all numbers
[0-9\.\-] //Match all numbers, periods and minus signs
[ \f\r\t\n] //Match all white characters
Similarly, these also represent only one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:
^[a-z][0-9]$
Although [a-z] represents a range of 26 letters, here it can only match a string whose first character is a lowercase letter.
It was mentioned earlier that ^ represents the beginning of a string, but it has another meaning. When using ^ in a set of square brackets, it means "non" or "exclude", and is often used to remove a character. Also using the previous example, we require that the first character cannot be a number:
^[^0-9][0-9]$
This pattern matches "&5", "g7" and "-2", but does not match "12" and "66". Here are a few examples of excluding specific characters:
[^a-z] //All characters except lowercase letters
[^\\\/\^] //All characters except (\)(/)(^)
[^\"\'] //All characters except double quotes (") and single quotes (')
The special character "." (dot, period) is used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string ending with the number 5 and starting with other non-"newline" characters. Pattern "." can match any string except empty string and string that only includes one "new line".
PHP regular expressions have some built-in general character clusters, and the list is as follows:
Character cluster meaning
[[:alpha:]] any letter
[[:digit:]] any number
[[:alnum:]] any letters and numbers
[[:space:]] any white character
[[:upper:]] any capital letter
[[:lower:]] any lowercase letter
[[:punct:]] any punctuation mark
[[:xdigit:]] Any hexadecimal number is equivalent to [0-9a-fA-F]
Hunte April 14, 2000
PHP inherits the consistent tradition of *NIX and fully supports the processing of regular expressions. Regular expressions provide a high-level, but not intuitive way to match and handle strings. Friends who have used PERL regular expressions know that regular expressions are very powerful, but they are not that easy to learn.
for example:
^.+@.+\\..+$
This effective but difficult-to-understand code is enough to give some programmers a headache (I am) or make them give up using regular expressions. I believe that after you finish reading this tutorial, you can understand the meaning of this code.
Basic pattern matching
It all starts with the most basic. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of strings. The pattern can be very simple, composed of ordinary strings, or very complex, and often uses special characters to represent characters in a range, recurrence, or context. For example:
^once
This pattern contains a special character ^, indicating that the pattern matches only those strings starting with once. For example, this pattern matches the string "once upon a time" and does not match "There once was a man from NewYork". Just as the ^ symbol indicates the beginning, the $ symbol is used to match strings ending in a given pattern.
bucket$
This pattern matches "Who keep all of this cash in a bucket" and does not match "buckets". When the characters ^ and $ are used at the same time, they represent an exact match (strings are the same as patterns). For example:
^bucket$
Match only the string "bucket". If a pattern does not include ^ and $, then it matches any string containing the pattern. For example: mode
once
With string
There once was a man from NewYork
Who kept all of his cash in a bucket.
It's a match.
The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letter itself, and the numbers are the same. Other slightly more complex characters, such as punctuation and white characters (spaces, tabs, etc.), use escape sequences. All escape sequences are headed with a backslash (\). The escape sequence of tab characters is: \t. So if we want to detect whether a string starts with a tab, we can use this pattern:
^\t
Similarly, \n means "new line" and \r means enter. Other special symbols can be used to add backslashes in front of them, such as the backslash itself is represented by \\, periods are represented by \., and so on.
Character cluster
In INTERNET programs, regular expressions are usually used to verify user input. After the user submits a FORM, it is not enough to determine whether the entered phone number, address, EMAIL address, credit card number, etc. are valid.
So we need to use a more free way to describe the pattern we want, which is the character cluster. To create a cluster of characters representing all vowel characters, put all vowel characters in a square bracket:
[AaEeIiOoUu]
This pattern matches any vowel character, but can only represent one character. A hyphen can represent a range of a character, such as:
[a-z] //Match all lowercase letters
[A-Z] //Match all capital letters
[a-zA-Z] //Match all letters
[0-9] //Match all numbers
[0-9\.\-] //Match all numbers, periods and minus signs
[ \f\r\t\n] //Match all white characters
Similarly, these also represent only one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:
^[a-z][0-9]$
Although [a-z] represents a range of 26 letters, here it can only match a string whose first character is a lowercase letter.
It was mentioned earlier that ^ represents the beginning of a string, but it has another meaning. When using ^ in a set of square brackets, it means "non" or "exclude", and is often used to remove a character. Also using the previous example, we require that the first character cannot be a number:
^[^0-9][0-9]$
This pattern matches "&5", "g7" and "-2", but does not match "12" and "66". Here are a few examples of excluding specific characters:
[^a-z] //All characters except lowercase letters
[^\\\/\^] //All characters except (\)(/)(^)
[^\"\'] //All characters except double quotes (") and single quotes (')
The special character "." (dot, period) is used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string ending with the number 5 and starting with other non-"newline" characters. Pattern "." can match any string except empty string and string that only includes one "new line".
PHP regular expressions have some built-in general character clusters, and the list is as follows:
Character cluster meaning
[[:alpha:]] any letter
[[:digit:]] any number
[[:alnum:]] any letters and numbers
[[:space:]] any white character
[[:upper:]] any capital letter
[[:lower:]] any lowercase letter
[[:punct:]] any punctuation mark
[[:xdigit:]] Any hexadecimal number is equivalent to [0-9a-fA-F]