js plays with regular expressions syntax highlighting

After learning the rules for a few days, it is almost time to summarize and organize the results. I wanted to write grammar highlighting and matching before, but my level was not enough and I couldn't understand the examples.

So let’s analyze the two masters, cobalt carbonate and Barret Lee syntax highlighting implementation.

Let's talk about this article by Barret Lee firstA few small examples teach you how to implement regular expression highlight》

When I watched it before, I felt that it was magical, especially the example of matching step by step below, which is even more domineering. However, the author also said that separation is just for the convenience of demonstration, and you can see what matches this step in a very intuitive way. Otherwise, if you match it in one step, you will finish processing without knowing what happened.
Take a look at his rules

Copy the codeThe code is as follows:

(/^\s+|\s+$/) // Match the beginning and end spaces
(/(["'])(?:\\.|[^\\\n])*?\1/) // Match string
(/\/(?!\*|span).+\/(?!span)[gim]*/) // Matching regular span was added by him last time, and I think it should not appear here.
(/(\/\/.*|\/\*[\S\s]+?\*\/)/) // Match comments
(/(\*\s*)(@\w+)(?=\s*)/) // Match the marks in the comment
(/\b(break|continue|do|for|in|function|if|else|return|switch|throw|try|catch|finally|var|while|with|case|new|typeof|instance|delete|void|Object|Array|String|Number|Boolean|Function|RegExp|Date|Math|window|document|navigator|location|true|false|null|undefined|NaN)\b/) // Match keywords

Brother Beard may not want to make wheels repeatedly, but just wants to figure out how to make such wheels, so he wrote this thing until it was done, without in-depth and detailed processing, and the work was relatively rough.
Of course, I am not talking about it, I just commented briefly. After all, there are many excellent grammar highlighting plug-ins, so there is no need to repeat it yourself, just learn the principles.

Let's analyze the next cobalt carbonate articleHow to implement regular expression JavaScript code highlighting》
In fact, this article has been analyzed in detail, so I can only briefly explain it.
Cobalt carbonate has always been relatively rigorous in thinking. I read this article for more than an hour before, and I could only read a rough one. This time I analyzed it again and realized it myself, and it took me half a day.
But it's very worth it, I really learned a lot.

Let’s take a look at the general logic first.

Copy the codeThe code is as follows:

(\/\/.*|\/\*[\S\s]+?\*\/) // Match comments
((["'])(?:\\.|[^\\\n])*?\3) // Match string
\b(break|continue|do|for|in|function|if|else|return|switch|this|throw|try|catch|finally|var|while|with|case|new|typeof|instance|delete|void)\b // Match keywords
\b(Object|Array|String|Number|Boolean|Function|RegExp|Date|Math|window|document|navigator|location)\b // Match built-in objects
\b(true|false)\b // Match boolean value
\b(null|undefined|NaN)\b // Match various null values. I think this is more suitable for the same set as the Boolean value.
(?:[^\W\d]|\$)[\$\w]* // Match ordinary variable names
(0[xX][0-9a-fA-F]+|\d+(?:\.\d+)?(?:[eE]\d+)?) // Match the numbers (the former does not occupy, there will be problems here)
(?:[^\)\]\}]|^)(\/(?!\*)(?:\\.|[^\\\/\n])+?\/[gim]*) // Match regular
[\S\s] // Any other unmatched values

Original description of the last [\S\s]: We must match each character. Because they all need to do HTML escape once.
Then there is detailed code below.

This is a very good article. I have read it at least 10 times and I only understood it almost two days ago.

However, this code has some minor flaws, such as strings that cannot match folded lines, string matching optimization.

In addition, the number matching is not comprehensive enough and can only match 0xff, 12.34, 1e3, formats such as .123 12.3e+3 cannot match.
I think the keyword order can be slightly optimized.
Because traditional NFA engines only match from left to right, and when they match, they stop the operation of the next branch.
Therefore, putting the most common keywords in front can improve some of the performance.
Finally, it is best to use new RegExp, which will improve the performance of code with large amounts of code.

Let me give you my regular and simple demos below. (In fact, it is just an optimization of the source code of cobalt carbonate.)
Let’s take a look at the regular part first:

Copy the codeThe code is as follows:

(\/\/.*|\/\*[\s\S]*?\*\/) // Match comment Not changed
("(?:[^"\\]|\\[\s\S])*"|'(?:[^'\\]|\\[\s\S])*') // Match comments Optimized
\b(true|false|null|undefined|NaN)\b // Match Boolean and null values. These are more commonly used, and grouping is in advance.
\b(var|for|if|else|return|this|while|new|function|switch|case|typeof|do|in|throw|try|catch|finally|with|instance|delete|void|break|continue)\b // Match keywords, the keyword order has been changed
\b(document|Date|Math|window|Object|location|navigator|Array|String|Number|Boolean|Function|RegExp)\b //Built-in object, word order has been changed
(?:[^\W\d]|\$)[\$\w]* // Match the normal variable name but not changed
(0[xX][0-9a-fA-F]+|\d+(?:\.\d+)?(?:[eE][+-]?\d+)?|\.\d+(?:[eE][+-]?\d+)?) // Match the number, fix the matching
(?:^|[^\)\]\}])(\/(?!\*)(?:\\.|[^\\\/\n])+?\/[gim]*) // Matching regularity, this is the most complicated and has many situations. I have no strength to modify it for the time being.
[\s\S] // Match other

Combine a group with boolean and null values, and then optimize the regular grouping, so 2 groups are reduced than it.
He 2, 3 are string groupings because (["']) captures the previous quotes, and my regular does not do that.
This (true|false|null|undefined|NaN) If you don't like to put it in a group, it's OK to separate it.
Is it the same grouping just to distinguish between coloring.
Under sublime text true|false|null|undefined|NaN are all the same color, while notepad++ is only colored true|false, I just want to say hehe.

OK, it's almost time to give an example.
I believe that many people have turned it off before seeing this, or just pulled down the scrollbar and turned it off.
But I wrote this for these friends who have carefully read it. As long as there is one person reading it, I don’t think it will be in vain.
example:

Copy the codeThe code is as follows:

// Single line comment
/**
* Multi-line comments
* @date 2014-05-12 22:24:37
* @name Test it
*/
var str1 = "123\"456";
var str2 = '123\'456';
var str3 = "123\
456";

var num = 123;
var arr = [12, 12.34, .12, 1e3, 1e+3, 1e-3, 12.34e3, 12.34e+3, 12.34e-3, .1234e3];
var arr = ["12", "12.34", '.12, 1e3', '1e+3, 1e-3', '12.34e3, 12.34e+3, 12.34e-3', ".1234e3"];
var arr = [/12", "12.34/, /"12\/34"/];

for (var i=0; i<1e3; i++) {
var node = ("a"+i);
(node);
}

function test () {
return true;
}
test();

(function(window, undefined) {
var _re_js = new RegExp('(\\/\\/.*|\\/\\*[\\s\\S]*?\\*\\/)|("(?:[^"\\\\]|\\\\[\\s\\S])*"|\'(?:[^\'\\\\]|\\\\[\\s\\S])*\')|\\b(true|false|null|undefined|NaN)\\b|\\b(var|for|if|else|return|this|while|new|function|switch|case|typeof|do|in|throw|try|catch|finally|with|instance|delete|void|break|continue)\\b|\\b(document|Date|Math|window|Object|location|navigator|Array|String|Number|Boolean|Function|RegExp)\\b|(?:[^\\W\\d]|\\$)[\\$\\w]*|(0[xX][0-9a-fA-F]+|\\d+(?:\\.\\d+)?(?:[eE][+-]?\\d+)?|\\.\\d+(?:[eE][+-]?\\d+)?)|(?:^|[^\\)\\]\\}])(\\/(?!\\*)(?:\\\\.|[^\\\\\\/\\n])+?\\/[gim]*)|[\\s\\S]', 'g');

function htmlEncode(str) {
 var i, s = {
 //"&": /&/g,
 """: /"/g,
 "'": /'/g,
 "<": //g,
 " ": /\n/g,
 " ": / /g,
 " ": /\t/g
 };
 for (i in s) {
 str = (s[i], i);
 }
 return str;
 }

= prettify;
})(window);

You can use the following code to test it.

Code:

Copy the codeThe code is as follows:

<!doctype html>
<html lang="en">
<head>
 <meta charset="UTF-8">
 <title>test</title>
 <style>
/* Highlight style */
 *{font-size:12px;}
 code{word-break:break-all;}

.com {color:#008000;} /* Comments */
.comkey {color:#FFA500;} /* Comment tag */
.str {color:#808080;} /* String */
.val {color:#000080;} /* true|false|null|undefined|NaN */
.kwd {color:#000080;font:bold 12px 'comic sans ms', sans-serif;} /* Keywords */
.obj {color:#000080;} /* Built-in object */
.num {color:#FF0000;} /* Number */
.reg {color:#8000FF;} /* Regular */
.func {color:#A355B9;} /* Function */
</style>
</head>
<body>

<code >
// Single line comment
/**
* Multi-line comments
* @date 2014-05-12 22:24:37
* @name Test it
*/
var str1 = "123\"456";
var str2 = '123\'456';
var str3 = "123\
456";

for (var i=0; i<1e3; i++) {
var node = ("a"+i);
(node);
}

function test () {
return true;
}
test();

= prettify;
})(window);
</code>

= prettify;
})(window);

var code = ("regdemon");
= prettify(code);
</script>
</body>
</html>

The result of almost combining the two ideas of Brother Beard and Cobalt Carbonate is now relatively complete.
I haven't tested the compatibility and other things yet, and there is no need to test it. I don't plan to write various syntax highlights by myself. I'm too tired. .