SoFunction
Updated on 2025-03-02

A thorough study of java regular expressions

 package testreg;
import ;
import ;

/**
* <p>Title: Research on Regular Expressions</p>
* <p>Description:
* Recently, I often use some regular expression usage problems in my work. Most of them are just a little bit or a little bit when I go to the Internet. It's better to ask for someone
* Please ask for it. Be cruel and see for yourself! In the past two days, we use the free time between the two phases of our project to support regular expressions for J2SE.
* A thorough study! The price is...it is to waste nearly twelve pieces of white paper in the department. Less gossip, and return to the book.
* Principle:
*    The principle of regular expressions is a finite state automata. There are finite states inside the automata, an initial state, and a
* End status. The automaton decides what to do next based on the input and the current state inside itself. Haha, this was learned a long time ago
I can’t remember the things *, so I will only refer to them.
* Regular expressions in Java:
*
* 3 categories: Pattern, which represents the pattern, is the regular expression itself, and Matcher, is a finite state automata, which is actually mostly
* The number of numbers still makes Pattern class. Matcher often simply calls Pattern, and I don’t know what pattern this is. this
* The two categories are written in classics, and there are many algorithms worthy of careful study by those with skill. Another is an exception class
* Thrown when using a regular expression is incorrect, it is a runtime exception.
* Several difficulties:
*    terminator
*      line terminator Chinese meaning, it refers to a sequence of characters composed of one or two characters. In java
*     All line terminator:
*     A newline (line feed) character ('\n'),
*     --------------------------------------------------------------------------------------------------------------------
*     A carriage-return character followed immediately by a newline character ("\r\n"),
*     -------Enter + Line Break (0D0A)
*     A standalone carriage-return character ('\r'),
*     ------Enter (0D)
*     A next-line character ('\u0085'),
*     ------------Next line character? (? It means I don't know what it is. Please send me a mail if you understand.
*     A line-separator character ('\u2028'), or
*     --------------------------------------------------------------------------------------------------------------------
*     A paragraph-separator character ('\u2029).
*     -----------paragraph separator?
*      If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.
*      If you use unix mode, only \n is considered a line terminator, that is, when using pattern, it is as follows:
*     Pattern p=("regular expression",Pattern.UNIX_LINE);
*      or Pattern p=("(?d)regular expression")
*      "." Match all characters except line terminator (when DOTALL is not specified)
*
*   ,Greedy,Reluctant and Possessive.
*      These words are not easy to translate. The original text is Greedy Quantifiers, Reluctant Quantifiers and Possessive
*       Quantifiers With my English, I have to translate it as greedy quantum, reluctant quantum and possessive quantum? This is too funny,
*      Fortunately, I understand what they mean. I will explain this in detail later.
*  3. Understanding of forms such as [a-zA-Z], [a-d[h-i]], [^a-f], [b-f&&[a-z]], [b-f&&[^cd]], etc.
*    For the above, the original text is described by range, union, negation, intersection, subtraction, etc.
*    The range table is the range, union is the union, negation is the inverse, interference is the intersection, subtraction
*     Is it...a subtraction? ? Anyway, it means subtracting part of it
*         a-z Lowercase letters from a to z
*    negation   [^a-f]All except a-f, the whole set is all characters
*    union     [a-d[h-i]] a-d and h-i
*      subtraction [b-f&&[^cd]]   There are all other than cd in b-f
*      intersection[b-f&&[a-z]] is the common part of b-f and a-z
*     I have summarized it, in fact, square brackets represent a set, and the elements in the set are represented by the enumeration method, such as [abcd], but there are too many
*    What should I do? You can't list all from a to z, right? Then use a-z to represent square brackets, and the intersection is represented by &&, and the union is formed.
*     Omitted, the difference set (which is almost the same as the difference set for the subtraction) is represented by intersection and inverse. Therefore, the above can be expressed as:
*     [[a-z][A-Z]],[[a-d][h-i]],[^a-f],[[b-f]&&[a-z]],[[b-f]&&[^cd]]
*     Is this consistent with our habits?
*   4. The meaning of each sign
*       When generating a pattern, several flags can be used simultaneously to specify the scheme when matching.
*    Usage form is as follows: Pattern p=(".*a?",Pattern.UNIX_LINES);
*       When multiple flags are specified at the same time, you can use the "|" operator to connect such as:
*     Pattern p=(".*a?,Pattern.UNIX_LINES|);
*     You can also specify in the expression such as:
*     Pattern p=("(?d).*a?");
*     Pattern p=("(?d)(?s).*a?");
*     The above two definitions correspond to the previous two equivalents
*     All the logos are as follows:
*     Constant               Equivalent Embedded Flag Expression 
     Pattern.CANON_EQ              None Enables canonical equivalence
     Pattern.CASE_INSENSITIVE      (?i) Enables case-insensitive matching
                   (?x) Permits whitespace and comments in pattern.
                  (?m) Enables multiline mode.
                    (?s) Enables dotall mode
     Pattern.UNICODE_CASE          (?u) Enables Unicode-aware case folding.
     Pattern.UNIX_LINES            (?d) Enables Unix lines mode

CANON_EQ Specify the usage specification equivalent mode? I have limited understanding of this. Does it mean that as long as this pattern is specified,
Can the ascii code 'a' be equal to the unicode 'a' and XXX code 'a'? Please help me. (mail to me)

CASE_INSENSITIVE Specifies the use of case-insensitive matching mode. This is easy to understand, but be careful that this flag is just
It is valid for ascii code. To make unicode ignore case when comparing, you must specify UNICODE_CASE at the same time, that is, you must specify
CASE_INSENSITIVE|UNICODE_CASE or use (?i)(?u)

COMMENTS Specifies the use of comments and ignore blanks, that is, ".*a"==".  *a #this is comments" I want this
*     The regular expression is very large and it is more useful when inputting into a file. I don’t think it is used.
*     
*     MULTILINE In multiline mode the expressions ^ and $ match just after 
*     or just before, respectively, a line terminator or the end of the 
*     input sequence. By default these expressions only match at the beginning 
*     and the end of the entire input sequence
*    Specify the use of multi-line matching mode. In the default mode, ^ and $ match only the beginning and end of one input, respectively.
*     In this mode, ^ and $ in addition to matching the beginning and end of the entire input, it also matches the following and
*     The front (not the front and the back, that is, ^ matches the back of the line terminator $ matches the front of the line terminator.
*     
*    DOATALL If this pattern is specified, "." can match any character including line terminator
*     
*    UNIX_LINES When specifying this mode, only \n is considered a line terminator and neither \r nor \r\n

*  I can't remember the others for a while, so let's talk about them when introducing them in detail.
* </p>
*/
public class TestReg2
{

   public static void main(String[] args)
   {
       String str1 = "";
       Object str = "";
//Note: \r,\n,\b and other escape characters should be written as \r,\n,\b and so on in java string constants, otherwise the compilation will not be overwhelmed.
//\s match\r,\n,\r and spaces
("\\s matches \\r,\\n,\\r and spaces "+" \t\n\r".matches("\\s{4}"));
//\S and\s are inverse
("\\S and \\s are inversely opposite"+"/".matches("\\S"));
//.Mis ​​match\r and\n
(".Does not match \\r and \\n "+"\r".matches("."));
       ("\n".matches("."));

//\w match letters, numbers and underscores
("\w match letters, numbers and underscores"+"a8_".matches("\\w\\w\\w"));
//\W and \w are inverse
("\\W and \\w inverse"+"&_".matches("\\W\\w"));
//\dMatch number
("\\d matches the number "+"8".matches("\\d"));
//\D and \d are inverse
("\\D and \\d are inversely opposite"+"%".matches("\\D"));
// Both match but different
       ("======================");
(" means \\000a match\\000a"+"\n".matches("\n"));
(" means \n match line breaks"+"\n".matches("\n"));
       ("======================");
// Both match but different
       ("\r".matches("\r"));
       ("\r".matches("\\r"));
       ("======================");
//^Match start
("^ matches beginning with "+"hell".matches("^hell"));
       ("abc\nhell".matches("^hell"));
//$ Match ends
("$ match end"+"my car\nabc".matches(".*ar$"));
       ("my car".matches(".*ar$"));
//\bMatching Boundary
("\\b Matching Boundary"+"bomb".matches("\\bbom."));
       ("bomb".matches(".*mb\\b"));
//\B and \b are inverse
("\\B and \\b are inversely opposite"+"abc".matches("\\Babc"));

//[a-z] matches lowercase letters from a to z
("[a-z] matches the lowercase letters "+"s".matches("[a-z]"));
       ("S".matches("[A-Z]"));
       ("9".matches("[0-9]"));

//Reverse
("Inverse"+"s".matches("[^a-z]"));
       ("S".matches("[^A-Z]"));
       ("9".matches("[^0-9]"));

//The function of brackets
(The function of "brackets"+"aB9".matches("[a-z][A-Z][0-9]"));
       ("aB9bC6".matches("([a-z][A-Z][0-9])+"));
//or operation
("or operation"+"two".matches("two|to|2"));
       ("to".matches("two|to|2"));
       ("2".matches("two|to|2"));

       //[a-zA-z]==[a-z]|[A-Z]
       ("[a-zA-z]==[a-z]|[A-Z]"+"a".matches("[a-zA-Z]"));
       ("A".matches("[a-zA-Z]"));
       ("a".matches("[a-z]|[A-Z]"));
       ("A".matches("[a-z]|[A-Z]"));

//Experience the following four
("Experience the following four\n======================================================================================================);
       (")".matches("[a-zA-Z)]"));
       (")".matches("[a-zA-Z)_-]"));
       ("_".matches("[a-zA-Z)_-]"));
       ("-".matches("[a-zA-Z)_-]"));
       ("==========================");
       ("b".matches("[abc]"));
       //[a-d[f-h]]==[a-df-h]
       ("[a-d[f-h]]==[a-df-h]"+"h".matches("[a-d[f-h]]"));
       ("a".matches("[a-z&&[def]]"));
//Catch the intersection
("Fetch the intersection"+"a".matches("[a-z&&[def]]"));
       ("b".matches("[[a-z]&&[e]]"));
//Pick up and
("Get merging"+"9".matches("[[a-c][0-9]]"));
       //[a-z&&[^bc]]==[ad-z]
       ("[a-z&&[^bc]]==[ad-z]"+"b".matches("[a-z&&[^bc]]"));
       ("d".matches("[a-z&&[^bc]]"));
       //[a-z&&[^m-p]]==[a-lq-z]
       ("[a-z&&[^m-p]]==[a-lq-z]"+"d".matches("[a-z&&[^m-p]]"));
       ("a".matches("\\p{Lower}"));
/// Pay attention to the following understanding of the usage of \b (note that in the string constant, the ten-item writes the \b table backspace directly, so you need to write \\b
       ("*********************************");
       ("aawordaa".matches(".*\\bword\\b.*"));
       ("a word a".matches(".*\\bword\\b.*"));
       ("aawordaa".matches(".*\\Bword\\B.*"));
       ("a word a".matches(".*\\Bword\\B.*"));
       ("a word a".matches(".*word.*"));
       ("aawordaa".matches(".*word.*"));
//Experience the usage of the group
//The order of groups is only the number of "("The first one is the first group, the second one is the second group...
// Group 0 represents the entire expression
       ("**************test group**************");
       Pattern p = ("(([abc]+)([123]+))([-_%]+)");
       Matcher m = ("aac212-%%");
       (());
       m = ("cccc2223%_%_-");
       (());
       ("======test group======");
       (());
       ((0));
       ((1));
       ((2));
       ((3));
       ((4));
       (());
       ("========test end()=========");
       (());
       ((2));
       ("==========test start()==========");
       (());
       ((2));
//test backslash test backreference?
Pattern pp1=("(\\d)\\1");//This expression means that two identical numbers must appear
//\1 means referring to the first group\n means referring to the nth group (it must be used \\1 but not \1 because \1 makes other sense in the string (I also know what it is)
Matcher mm1=("3345");//33 matches but 45 doesn't match
("test backslash test backreference");
       (());
       (());

//Asking the following differences
       ("==============test find()=========");
       (());
       ((2));

("This is the group result found from the third character (index=2)");
       (());
       ((0));
       ((1));
       ((2));
       ((3));
       ();
       (());
//Test a pattern can match a string multiple times
("Test a pattern can match a string multiple times");
       Pattern p1 = ("a{2}");
       Matcher m1 = ("aaaaaa");
//This shows that Matcher's matchs() method is a match of each string.
       (());
       (());
       (());
       (());
       (());
//Test matchs() again
("Retest matchs()");
       Pattern p2 = ("(a{2})*");
       Matcher m2 = ("aaaa");
       (());
       (());
       (());
//So find is to find if there is a corresponding pattern in a string, and matchs is exactly matched
       //test lookupat()
       ("test lookupat()");
       Pattern p3 = ("a{2}");
       Matcher m3 = ("aaaa");
       (());
       (());
       (());
       (());
//Summary matchs() is the entire match and always starts from scratch. Find is a partial match and starts from the end of the last match.
//LookingAt also starts from scratch, but partially matches
("======= test Blank line====================);
       ("         \n".matches("^[ \\t]*$\\n"));

//Demonstrate the usage of appendXXX
       ("=================test append====================");
       Pattern p4 = ("cat");
       Matcher m4 = ("one cat two cats in the yard");
       StringBuffer sb = new StringBuffer();
       boolean result = ();
       int i=0;
       ("one cat two cats in the yard");
       while(result)
       {(sb, "dog");
       (());
("Things"+i++":"+());
       result = ();
       }
       (());
       (sb);
       (());

       //test UNIX_LINES
       ("test UNIX_LINES");
       Pattern p5=(".",Pattern.UNIX_LINES);
       Matcher m5=("\n\r");
       (());
       (());

       //test UNIX_LINES
       ("test UNIX_LINES");
       Pattern p6=("(?d).");
       Matcher m6=("\n\r");
       (());
       (());

       //test UNIX_LINES
       ("test UNIX_LINES");
       Pattern p7=(".");
       Matcher m7=("\n\r");
       (());
       (());

       //test CASE_INSENSITIVE
       ("test CASE_INSENSITIVE");
       Pattern p8=("a",Pattern.CASE_INSENSITIVE);
       Matcher m8=("aA");
       (());
       (());
       ("test CASE_INSENSITIVE");
       Pattern p9=("(?i)a");
       Matcher m9=("aA");
       (());
       (());
       ("test CASE_INSENSITIVE");
       Pattern p10=("a");
       Matcher m10=("aA");
       (());
       (());

       //test COMMENTS
       ("test COMMENTS");
       Pattern p11=(" a a #ccc",);
       Matcher m11=("aa a a #ccc");
       (());
       (());
       ("test COMMENTS");
       Pattern p12 = ("(?x) a a #ccc");
       Matcher m12 = ("aa a a #ccc");
       (());
       (());

//test MULTILINE. Please refer to my understanding of the multi-line pattern above
       ("test MULTILINE");
       Pattern p13=("^.?",|);
       Matcher m13=("helloohelloo,loveroo");
       (());
       ("start:"+()+"end:"+());
       (());
       //("start:"+()+"end:"+());
       ("test MULTILINE");
       Pattern p14=("(?m)^hell.*oo$",);
       Matcher m14=("hello,Worldoo\nhello,loveroo");
       (());
       ("start:"+()+"end:"+());
       (());
       //("start:"+()+"end:"+());
       ("test MULTILINE");
       Pattern p15=("^hell(.|[^.])*oo$");
       Matcher m15=("hello,Worldoo\nhello,loveroo");
       (());
       ("start:"+()+"end:"+());
       (());
      // ("start:"+()+"end:"+());

       //test DOTALL
       ("test DOTALL");
       Pattern p16=(".",);
       Matcher m16=("\n\r");
       (());
       (());

       ("test DOTALL");
       Pattern p17=(".");
       Matcher m17=("\n\r");
       (());
       (());

       ("test DOTALL");
       Pattern p18=("(?s).");
       Matcher m18=("\n\r");
       (());
       (());

//test CANON_EQ This is an example of jdk, but I really don’t understand what it means, so I ask everyone for advice
       ("test CANON_EQ");
       Pattern p19=("a\u030A",Pattern.CANON_EQ);
       (('\u030A'));
       ("is"+('\u030A'));
       ("is"+('\u030A'));
       (('\u00E5'));
       ("is"+('\u00E5'));
       Matcher m19=("\u00E5");
       (());
       (('\u0085'));
       ("is"+('\u0085'));

//Note the following three examples to understand the differences between Greedy, Reluctant and Possessive Quantifiers
       Pattern ppp=(".*foo");
       Matcher mmm=("xfooxxxxxxfoo");
       /**
        * Greedy   quantifiers 
           X?      X, once or not at all 
           X*      X, zero or more times 
           X+      X, one or more times 
           X{n}    X, exactly n times 
           X(n,}   X, at least n times 
           X{n,m}  X, at least n but not more than m times 
Greedy quantifiers is the most commonly used one. As mentioned above, its matching method is to first match as many characters as possible, when
This causes the entire expression to fail to match, try again, for example:
The matching process of *foo and xfooxxxxxxfoo.* first matches the entire input. I found that this does not work and the entire string cannot match.
*                                                                                                                                Because of this process
*
* Hence the name Greedy
        */
       boolean isEnd=false;
       int k=0;
       ("==========");
       ("xfooxxxxxxfoo");
       while(isEnd==false)
       try{
           ("the:"+k++);
           (());
           (());
       }catch(Exception e){
           isEnd=true;
       }
       isEnd=false;
       Pattern ppp1=(".*?foo");
       Matcher mmm1=("xfooxxxxxxfoo");
       /**
        * Reluctant quantifiers 
           X??       X, once or not at all 
           X*?       X, zero or more times 
           X+?       X, one or more times 
           X{n}?     X, exactly n times 
           X(n,}?    X, at least n times 
           X{n,m}?   X, at least n but not more than m times 
Reluctant quantifiers matches the opposite way, it always starts with the minimum match, if this causes
If the entire string fails to match, then eat another character and try again, such as:
The matching process of *?foo and xfooxxxxxxfoo, first, .* matches the empty string, and the entire string match fails, so
*
*         Empty string, no, eat another x, no,... It is not until you eat all the x in the middle that you find that the match is successful. This way
*                                                                It behaves a bit like hiring
*
        */
       k=0;
       ("?????????????????????");
       ("xfooxxxxxxfoo");
       while(isEnd==false)
       try{
           ("the:"+k++);
           (());
           (());
       }catch(Exception e){
           isEnd=true;
       }
       isEnd=false;
       Pattern pp2=(".*+foo");
       Matcher mm2=("xfooxxxxxxfoo");
       /**
        * 
           Possessive quantifiers 
           X?+        X, once or not at all 
           X*+        X, zero or more times 
           X++        X, one or more times 
           X{n}+      X, exactly n times 
           X(n,}+     X, at least n times 
           X{n,m}+    X, at least n but not more than m times 
Possessive quantifiers This matching method is similar to the Greedy method, the difference is that it is not smart enough, when
When it eats all the characters that can be eaten in one bite, it finds that the whole string does not match, and it will not try to spit out a few. Its Row
In order to be similar to the landlord, he is greedy but stupid, he is called possessed.
        */

       int ii=0;
       ("+++++++++++++++++++++++++++");
       ("xfooxxxxxxfoo");
       while(isEnd==false)
       try{
           ("the:"+ii++);
           (());
           (());
       }catch(Exception e){
           isEnd=true;
       }  
   } 
}