CodeMirror js code highlighting summary

CodeMirror is a JavaScript-based code editor. CodeMirror supports syntax highlighting in a large number of languages, including highlighting of css, html, js, etc. In addition, CodeMirror also supports automatic code completion, search/replacement, HTML preview, line number, selection/search result highlighting, visual tab, automatic code formatting, etc.

The github address of CodeMirror source code: /marijnh/CodeMirror/. In the past few days, except for class, I have been gnawing on its source code. I have basically not found any relevant information on the Internet. I found it really hard to find it. This summary is just about the general principle. I don’t understand the specific details. Although I can understand a lot of codes, there are big problems when connecting them. There are no comments. Most of the variables in the source code guess what it means, and most functions really know what functions are implemented.

The reason why CodeMirror can support highlighting in so many languages is that it defines the parsing methods of multiple languages in its mode package and then provides a unified interface to the outside world. This part of the content is also divided into one level in the source code. Below I mainly study the JS and CSS code highlighting scripts included with the CodeMirror library as an example.

github:/marijnh/CodeMirror/blob/master/mode/javascript/

This is the js parsing method it defines. I will use it instead of the js file
There are mainly two functions defined in:

("javascript",function(config,parserConfig){}
("text/javascript", "javascript");

The function of these two defines is mainly to be affiliated to the main class CodeMirror

The interfaces provided to the outside are mainly:

return{
  startState:function(basecolumn){...}
  token:function(stream,state){...}
  indent:function(state,textAfter){...}      
}

Now parse these three functions:

(1) startState: It mainly defines the context environment and starting state of function analysis. If this method is not available, it is equivalent to having no semantics during the parsing process.
Although the startState key is not a necessary choice, it is also very important, because highlighting often involves the context, that is, the context in which the highlighted phrase is currently in, usually affecting the selection of semantics and colors. Therefore, a startState is needed to initialize a state object, and what exactly this state object contains is entirely determined by the specific application. CodeMirror has no hard regulations.
(2) token: This is the most important analytical syntax function. By calling (stream, state) to execute function jsTokenBase(stream, state) {...}, I will parse the main content of this function below.
(3) indent: This is optional

Let’s talk about the jsTokenBase function, read the next character through () and judge the characters. It mainly uses regular matching and the result returned? ? ?

function jsTokenBase(stream,state){
  var ch = ();
  if(ch == '”' || ch=”'”)
    return ...;          //Judge whether there is the next "or", return ["string","string"]  else if(/[\[\]{}\(\),;\:\.]/.test(ch))
    return .. ;          //Match []{}()... these, return ch  else if(ch==”0” &amp;&amp; (/x/i)){
    (/[\da-f]/i); //0x**, parse hexadecimal number    return ret(“number”,”number”);//Return an object encapsulated function ret(tp, style,cont)  }
  else if(/\d/.test(ch) || ch ==“”&amp;&amp;(/\d/)) 
    return ret(“number”,”number”);//Match the number  else if (ch == "/") {       //Match comment    if((“*”)) return [“comment”,”comment”];     //judge"/*"    else if((“/”)) return [“comment”,”comment”];  //judge"//"else if ( == "operator" ||  == "keyword c" || /^[\[{}\(,;:]$/.test()) {}                     //？？
    else if((isOperatorChar)) return ret(“operator”); //Judgement/After operator  }
  else if(ch == "#") return ["error","error"]; //The return statement is wrong else if((ch)) return ret("operator"); //Return operator  else { (/[\w\$_]/); return ..} //Return the matching string}

The above is just to determine what type of character each ch = () belongs to, that is, to know whether the current characters belong to symbols, strings, numbers, comments or other.
Next, the more important thing is the string stack behind it. In fact, the shadow of the stack can be seen in the code. Just like the syntax analysis and semantic analysis in the compilation principle, you need to scan every character in the string and determine whether it is put into the stack or the rules. The compilation principle of this semester has not been studied very seriously, and you have to review it again. When I gave the example above, I actually felt that highlighting JS or CSS code requires a context, and the hierarchical relationships such as braces and colons of JS or CSS happen to be a stack pressing process when reading from top to bottom from left to right.

For example：
  function pushcontext(){...}
  function popcontext() {...}
  function pushlex(type,info){..}
  function poplex(){...}

Then call it through function statement(type){}, etc.

Another point to be said is why so many states need to be marked in the above judgment? Because highlighting is not done at once, after the user enters the code, he may move the cursor to any point and then modify the code. At this time, do you have to re-parse the entire code? No, but to some extent, too. Yes, because the code after the user modified the point must be re-highlighted, because the user may enter a brace, thereby changing the level of all subsequent codes (one brace is put into the stack, and the stack environment of the subsequent code changes, and the highlighting scheme depends on the elements of the stack). Neither. Because the previous code can certainly be considered safely to not need to be re-lighted, it is unnecessary to re-light the entire code. Just imagine that if it is thousands of lines of code, the user will re-light it every time he presses the key, wouldn't it be very inefficient? Therefore, when each capture highlighting task, the program should highlight it from this modification point. In fact, CodeMirror does the same. This multi-state object is to quickly start to re-light from a certain point. CodeMirror will actually help you "back up" these state objects (copystate functions). I really don't understand the implementation details of the copyState function in the source code...

Compared with JS mode files, CSS will feel simpler and easier to understand... The principle is similar, so I won't say it again. Generally speaking, it defines a lot of keywords, and then judges each key symbol, and also uses stack.
Github source code:/marijnh/CodeMirror/blob/master/mode/css/

Now go to the main function of CodeMirror, and the html is called as follows:

var editor = (("code"), {
mode: "application/xml",
styleActiveLine: true, //Select yes to highlightlineNumbers: true, // Whether to display the number of rowslineWrapping: true, //Is it automatic wrapping});

In fact, more custom parameters can be passed during the call process, but this is not the focus of the discussion. In short, it is to integrate custom properties into the defaultConfig of CodeMirror.
In CodeMirror, call function runMode(cm,text,mode,state,f) through functionhighlightLine(cm,line,state){}, and then call the exposed interface token through (stream,state).
Before the highlightLine() function was executed, a large number of configuration definitions and branches were made and the corresponding strings were formatted. I watched the rest two days ago, but now I really have to read them again to figure out the idea. Thousands of lines of code can be read in the most stupid way...

If simple words are highlighted and complex semantics are not required to be taken into account, regular expressions can be solved simply... For example:

var kw1 = new RegExp("(if|while|with|else|do|try|finally|return|break|continue|new|delete|throw|var|function|catch|for|switch|case|default|typeof|instanceof|true|false|null|undefined|NaN)"), //Match keywordskw2 = new RegExp("(\\/\\/[^\n&lt;]*(?:\n|$))(?!&lt;\\/)"), //Match comments

However, sometimes there are many problems with regular expressions. When there is semantics, writing regular expressions is very troublesome. The general way to highlight is to use syntax analysis + semantic analysis in the compilation principle, which is a bit difficult. I originally wanted to improve some things based on CodeMirror, but I found it was difficult. It would be better to write a simple one myself. I will try it myself if I have time in a few days. Recently, I started the final exam, and then there were a lot of comprehensive experiments, so I had less time to read. Calm down!! Keep your mind...

Attached:
For the application of CodeMirror, please refer to:/