Overview
Lua is fully 8-bit encoding, and characters in the Lua string can have any numerical encoding, including the numerical value 0. That is, any binary data can be stored in a string. Lua's strings are immutable values. If modified, it is essentially creating a new string. According to the above "Source Code Implementation of Data Types in Lua", in Lua, strings are objects managed by the automatic memory management mechanism, and the consortium TString implements the storage of string values. Below, we will look at the implementation of strings through the source code of Lua 5.2.1 and summarize the precautions for using strings in Lua.
Source code implementation
First, let’s look at the data structure TString corresponding to the string, and its source code is as follows ():
410 /* 411 ** Header for string value; string bytes follow the end of this structure 412 */ 413 typedef union TString { 414 L_Umaxalign dummy; /* ensures maximum alignment for strings */ 415 struct { 416 CommonHeader; 417 lu_byte extra; /* reserved words for short strings; "has hash" for longs */ 418 unsigned int hash; 419 size_t len; /* number of characters in string */ 420 } tsv; 421 } TString;
There are several points worth explaining about the definition of this union:
I. The member L_Umaxalign dummy in the union TString is used to ensure alignment with the C type of maximum length. It is defined as follows ():
48 /* type to ensure maximum alignment */ 49 #if !defined(LUAI_USER_ALIGNMENT_T) 50 #define LUAI_USER_ALIGNMENT_T union { double u; void *s; long l; } 51 #endif 52 53 typedef LUAI_USER_ALIGNMENT_T L_Umaxalign;
In other implementations of recyclable objects (such as tables), this union member can also be seen. The purpose of this is to speed up the CPU's access to memory through memory alignment.
II. Members of the union are truly used to implement strings. The member CommonHeader is used for GC, which will be defined in all recyclable objects in the form of a macro. The code is as follows ():
74 /* 75 ** Common Header for all collectable objects (in macro form, to be 76 ** included in other objects) 77 */ 78 #define CommonHeader GCObject *next; lu_byte tt; lu_byte marked
The corresponding structure of this macro is as follows ():
81 /* 82 ** Common header in struct form 83 */ 84 typedef struct GCheader { 85 CommonHeader; 86 } GCheader;
The structure GCheader is useful in the definition of the universal recyclable object union GCObject.
III, lu_byte extra For short strings, it is used to record whether this string is a reserved word. For long strings, it can be used to lazy find the Hash value; the unsigned int hash member is the Hash value corresponding to the string (there will be specific to how Lua calculates the Hash value of the string); size_t len is used to represent the length of the string.
IV. The above structure only describes the structure of a string, and the real string data is saved immediately after the structure.
Before Lua5.2.1, no strings were distinguished from long and short strings. All strings were stored in a global Hash table. For Lua virtual machines, the same string had only one copy of data. Starting from Lua5.2.1, short string strings (the current definition is less than or equal to 40) were placed in the global Hash table, while long strings were generated independently. At the same time, when calculating the Hash value, a random seed was introduced. The purpose of this is to prevent Hash Dos - the attacker from constructing many different strings with the same Hash value, thereby reducing the efficiency of Lua's strings being pushed into the external strings from the global Hash table. Here are the steps to generate a new string in Lua5.2.1, and the corresponding codes are all in:
(1) If the length of the string is greater than LUAI_MAXSHORTLEN (the default value is 40), it is a long string. Directly call createstrobj, the function that creates the string interface (of course, the length of the string needs to be saved in the member size_t len, otherwise it will be returned directly). The code is as follows ():
95 /* 96 ** creates a new string object 97 */ 98 static TString *createstrobj (lua_State *L, const char *str, size_t l, 99 int tag, unsigned int h, GCObject **list) { 100 TString *ts; 101 size_t totalsize; /* total size of TString object */ 102 totalsize = sizeof(TString) + ((l + 1) * sizeof(char)); 103 ts = &luaC_newobj(L, tag, totalsize, list, 0)->ts; 104 ts-> = l; 105 ts-> = h; 106 ts-> = 0; 107 memcpy(ts+1, str, l*sizeof(char)); 108 ((char *)(ts+1))[l] = '\0'; /* ending 0 */ 109 return ts; 110 }
You can see that the passed string is stored in the specific memory immediately after the structure TString memory, and pay attention to 108 lines, the string ends with "\0" and is compatible with C strings.
(2) If the string is a short string, first calculate the Hash value of the string, find the corresponding linked list (the global Hash table of the short string, which uses the link method, that is, put all conflicting elements in the same linked list), and find out whether the currently created string is already in the Hash table. If it already exists, return this string directly. Otherwise, the function newshrstr will be called, and the function newshrstr will call the above createstrobj function to create a new string and put the newly created string into the Hash table. The code is as follows ()):
130 /* 131 ** checks whether short string exists and reuses it or creates a new one 132 */ 133 static TString *internshrstr (lua_State *L, const char *str, size_t l) { 134 GCObject *o; 135 global_State *g = G(L); 136 unsigned int h = luaS_hash(str, l, g->seed); 137 for (o = g->[lmod(h, g->)]; 138 o != NULL; 139 o = gch(o)->next) { 140 TString *ts = rawgco2ts(o); 141 if (h == ts-> && 142 ts-> == l && 143 (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) { 144 if (isdead(G(L), o)) /* string is dead (but was not collected yet)? */ 145 changewhite(o); /* resurrect it */ 146 return ts; 147 } 148 } 149 return newshrstr(L, str, l, h); /* not found; create a new string */ 150 }
The global string Hash table is saved in the virtual machine's global state member strt ():
119 stringtable strt; /* hash table for strings */
The type stringtable is a structure, which is defined as follows ():
59 typedef struct stringtable { 60 GCObject **hash; 61 lu_int32 nuse; /* number of elements */ 62 int size; 63 } stringtable;
The member GCObject **hash is a pointer array, and each member in the array actually points to TString (note that TString includes the macro CommonHeader, the next member in the macro can construct a split linked list in the Hash table); nuse is the number of elements that have been used in the array hash; size is the size of the current array hash.
Before inserting a new string in the function newshrstr, it will determine whether the nuse value is greater than size. If it is greater than, it means that the Hash table size is not enough and needs to be expanded, then the Hash table size is expanded to twice the original. The corresponding code is as follows ():
121 if (tb->nuse >= cast(lu_int32, tb->size) && tb->size <= MAX_INT/2) 122 luaS_resize(L, tb->size*2); /* too crowded */
When gc, it will determine whether nuse is smaller than size/2 (in Lua 5.1, nuse is compared with size/4), and if so, resize the size of this stringtable to half of the original size. The corresponding code is as follows ():
783 int hs = g-> / 2; /* half the size of the string table */ 784 if (g-> < cast(lu_int32, hs)) /* using less than that half? */ 785 luaS_resize(L, hs); /* halve its size */
For string comparison, first compare the types. If it is a string of different types, it will definitely be different. Then distinguish between short strings and long strings. For short strings, if the pointer values of the two are equal, it will be the same, otherwise it will be different; for long strings, first compare the pointer values. If it is different, compare the length value and the content character-by-character comparison.
Summarize
(1) The reserved characters and meta-method names in Lua are both used as short strings. They are placed in the global short string Hash table when the virtual machine is started and are not recycled.
(2) Finding characters is relatively efficient, but modifying or inserting strings is relatively inefficient. In addition to calculations, at least the external strings must be copied to the virtual machine.
(3) The corresponding Hash value of a long string, Lua will not examine each character, so it can avoid quickly calculating the hash value of a long string. The corresponding code is as follows ():
51 unsigned int luaS_hash (const char *str, size_t l, unsigned int seed) { 52 unsigned int h = seed ^ l; 53 size_t l1; 54 size_t step = (l >> LUAI_HASHLIMIT) + 1; 55 for (l1 = l; l1 >= step; l1 -= step) 56 h = h ^ ((h<<5) + (h>>2) + cast_byte(str[l1 - 1])); 57 return h; 58 } 21 /* 22 ** Lua will use at most ~(2^LUAI_HASHLIMIT) bytes from a string to 23 ** compute its hash 24 */ 25 #if !defined(LUAI_HASHLIMIT) 26 #define LUAI_HASHLIMIT 5 27 #endif
(4) When there are multiple strings connected, the string concatenation operator "..." should not be directly used to connect the string, but to use operations or to speed up the string concatenation operation.
(5) The string Hash algorithm in Lua uses JSHash. Regarding various hash functions of strings, an article on the Internet summarizes it:/blog/string-hash-compare
Who is mentioned above is the entire content of this article, and I hope it will be helpful to everyone to learn Lua.