SoFunction
Updated on 2025-02-28

Solution to generate null characters when splitting strings

Problem description

Some empty string "" appears when splitting a string using JavaScript's split method, especially when using regular expressions as delimiters.

Related questions

JavaScript regular expressions produce empty string group when grouping strings?

In the above question, the questioner used regular expression to split the string and generated multiple empty strings "" and the code is as follows:

Copy the codeThe code is as follows:

'Zhang sdf four methods asdf Wengf aa33net s'.split(/([\u4e00-\u9fa5]{1})/gi);
//Output["", "Zhang", "sdf", "four", "up", "", "law", "asdf", "weng", "", "", "fen", "aa33", "net", "s"]

So, what is the reason for these empty strings?

Problem analysis

After searching on Google, I found that there were not many related results. Even if there were, there were not many detailed explanations. I roughly said it and then gave a link to the ECMAScript specification. It seems that if you want to know the real reason, you can only bite the bullet and look at the norms.

Related standards

Then, according to international practice, first go to the standard town building of ECMAScript.

Copy the codeThe code is as follows:

(separator, limit)

This chapter introduces the execution steps of the split method in detail. If you are interested, you can read it carefully step by step. I will only explain the steps related to generating empty strings here. If there are any inappropriate points, everyone is welcome to mention them.

Related steps

Partial steps to extract:

The most important step in the whole process is the 13th cycle, and the main things this cycle does are as follows:
• Define the values ​​of p and q. The values ​​of p and q are the same at the beginning of each loop (this step is outside the loop);
• Call SplitMatch(S, q, R) to split the string;
• Different branches are executed according to the returned results, and the main branches are branches ⅲ;
• Branch ⅲ is divided into 8 small steps to fill the returned result into the predefined array A.
•In this 8 small steps, the purpose of step 1 is to return a substring of the original string, the start position is p (included) and the end position is q (included). Note: In this step, an empty string will be generated, and I marked it as intercepting the string for the convenience of quoting below.
•Add the substring from the previous step to array A
•The next few steps are to update the relevant variables and continue the next loop. (The purpose of step 7 is to save the capture grouping in the regular expression into array A, which has nothing to do with the generation of an empty string)
 
SplitMatch(S, q, R)

Next, we need to understand what the SplitMatch(S, q, R) method does. This method is mentioned below in the split specification. What it mainly does is to perform corresponding operations according to the type of separator:
•If the delimiter is of type RegExp, call RegExp's internal method [[Match]] to match the string. If the match fails, return failure. Otherwise, return a result of type MatchResult.
•If the delimiter is a string, match judgment is made, failure is returned, and a result of MatchResult type is successfully returned.
 
MatchResult

In the above steps, a variable of type MatchResult is introduced. By looking up the document, it was found that variables of this type have two attributes endIndex and captures. The value of endIndex is the position matching the string plus 1. Captures can be understood as an array. When the delimiter is a regular expression, the elements inside it are the values ​​captured by the group; when the delimiter is a string, it is an empty array.

Next

We can see from the above steps that the split string is generated in the step of intercepting the string (except for group capture of regular expressions). Its function is to intercept the string between the specified start (included) and the end position (included), so when will it return ""? There is a special case where the values ​​of the start position and the end position are equal, which is just a guess, because the specification does not give the specification steps for intercepting the string.

We have all come here, why not take a step forward?

So, I tried to search for some V8 source code to see if I could find a specific implementation method. I did find the relevant code, source code link

Here are some of them:

Copy the codeThe code is as follows:

function StringSplitJS(separator, limit) {
  ...
  ...
//The delimiter is a string
  if (!IS_REGEXP(separator)) {
    var separator_string = TO_STRING_INLINE(separator);

    if (limit === 0) return [];

    // ECMA-262 says that if separator is undefined, the result should
    // be an array of size 1 containing the entire string.
    if (IS_UNDEFINED(separator)) return [subject];

    var separator_length = separator_string.length;

//The separator is an empty string, which directly returns the character array
    if (separator_length === 0) return %StringToArray(subject, limit);

    var result = %StringSplit(subject, separator_string, limit);

    return result;
  }

  if (limit === 0) return [];

// When the delimiter is a regular expression, call StringSplitOnRegExp
  return StringSplitOnRegExp(subject, separator, limit, length);
}

//Several codes are omitted here

I found in the code that when filling the array, the %_SubString method will be called to intercept the string. Unfortunately, I did not find its relevant definition. If there are any students who found it, please let me know. However, I found that the StringSubstring method corresponding to substring method in JavaScript will call the %_SubString method and return the result. Then if 'abc'.substring(1,1) returns "", it means that the %_SubString method will return "" when the start position and the end position are the same. You can tell the result by giving it a try.

So, when will the start position equal to the end position (i.e. q === p) occur? I followed the above steps step by step and finally found:
•When the original string S matches the delimiter once, immediately afterwards, the next position of the string S also matches the delimiter. For example: 'abbbc'.split('b'), 'abbbc'.split(/(b){1}/)
• Another case is that one or several characters at the beginning of a string match the separator. For example: 'abc'.split('a'), 'abc'.split(/ab/)
•There is another case where one or several strings at the end of the string match the delimiter, and the relevant step is step 14.
Such as: 'abc'.split('c'), 'abc'.split(/bc/)

In addition, when using regular expressions as delimiters, undefined may appear in the returned result.
For example: 'abc'.split(/(d)*/)

Let’s look at the example at the beginning. Does it satisfy the above situations?

Off topic

This is the first time I have read the standard specifications of ECMAScript so carefully. The process of reading is indeed very painful, but after understanding it, I feel very happy. Thank you for this question and the follow-up question.
By the way, when a regular expression is used as a separator, the global modifier g will be ignored, which is also an additional gain.