Methods of establishing user portrait system using PostgreSQL database

Speaking of applications in big data, many students may immediately think of user portraits. User portrait, called User Profile in English, can perfectly abstract the full picture of a user's information through user portrait data, and can accurately analyze various behavioral habits of users, such as consumption habits, hobbies, abilities and other important user information. Usually, user portraits are implemented by creating a tag system for users. This article introduces how to use PostgreSQL's json data type to create a user portrait database.

Speaking of applications in big data, many students may immediately think of user portraits.
User portrait, called User Profile in English, can perfectly abstract the full picture of a user's information through user portrait data, and can accurately analyze various behavioral habits of users, such as consumption habits, hobbies, abilities and other important user information. Usually, user portraits are implemented by creating a tag system for users. This article introduces how to use PostgreSQL's json data type to create a user portrait database.

Tag model

In order to explain the specific methods, we build a simple two-level labeling system:

Occupation: Farmer, worker, IT engineer, barber, doctor, teacher, artist, lawyer, civil servant, official
Hobbies: swimming, table tennis, badminton, tennis, mountain climbing, golf, skiing, mountain climbing, travel
Education: No education, elementary school, junior high school, high school, secondary school, junior college, undergraduate, master, doctoral degree
Character: Extrovert, introvert, cautious, steady, careful, careless, impetuous, confident

Of course, you can also establish a complex label system of three or more levels based on actual conditions.

Create tables and create data

The following is a specific example to illustrate how to use PostgreSQL's json data type to create a user tag database.
Create user tag table:

CREATE TABLE user_tag(uid serial primary key, tag jsonb);

This table has only two fields, the uid represents the user ID, and the tag field contains all the user's tags, and the tag field type is jsonb.
In order to illustrate how to use this scheme, you need to create some tag data in the table. First create a function, and this function will be called in the SQL of the INSERT below to generate some random tags:

CREATE OR REPLACE FUNCTION f_random_attr(attr text[], max_attr int)
RETURNS text[] AS $$
DECLARE
     i integer := 0; 
     r integer := 0;
     res  text[];
     v text;
     l integer;
     num integer;
 BEGIN
     num := (random()*max_attr)::int;
     IF num < 1 THEN
         num := 1;
     END IF;
     l := array_length(attr, 1);
     WHILE i < num LOOP
          r := round(random()*l)::int + 1;
          v := attr[r];
          IF res @> array[v] THEN
             continue;
          ELSE 
             res := array_append(res, v);
             i := i + 1;
          END IF;
     END LOOP;
     return res;
 END;
$$ LANGUAGE plpgsql;

We can create some random data through the following INSERT statement:

INSERT INTO user_tag(uid, tag) 
SELECT seq, 
       json_build_object(
         'Profession',
         f_random_attr(array['Farmer','Worker','IT Engineer','barber','doctor','teacher','Artist','lawyer','Civil Servant','Official'], 1),
         'Hobby',
         f_random_attr(array['swim','table tennis','badminton','tennis','climb mountains','golf','ski','climb mountains','travel'], 5),
         'Education',
         f_random_attr(array['No education','primary school','junior high school','high school','Technical Technical College','Quality','Bachelor','master','PhD'], 1),
         'character',
         f_random_attr(array['Extraverted','Introvert','cautious','steady','careful','careless','impetuous','confidence'], 3))::jsonb
  FROM generate_series(1, 10000) as t(seq);

Indexing and querying

To speed up the query, we create a GIN index on the tag column. GIN index is a special index in PostgreSQL that can implement functions similar to full-text search:

CREATE INDEX idx_user_tag_tag on user_tag using gin(tag);

At this time, if we want to query teachers with an extroverted and careful personality, then SQL statement:

osdba=# select * from user_tag where tag @> '{"personality":["extrovert","careful"]}' and tag @> '{"profession":["teacher"]}'; uid  |                                                         tag
------+----------------------------------------------------------------------------------------------------------------------
  881 | {"Education": ["Secondary Technical College"], "character": ["careful", "Introvert", "Extrovert"], "Hobby": ["golf"], "Profession": ["teacher"]}
 1031 | {"Education": [null], "character": ["Extrovert", "careful"], "Hobby": ["climb mountains"], "Profession": ["teacher"]}
 3313 | {"Education": [null], "character": ["Extrovert", "careful"], "Hobby": ["table tennis"], "Profession": ["teacher"]}
 4053 | {"Education": ["Undergraduate"], "character": ["careful", "Extrovert"], "Hobby": ["climb mountains", "ski", "swim"], "Profession": ["teacher"]}
 4085 | {"Education": ["junior high school"], "character": ["Extrovert", "careful"], "Hobby": ["tennis"], "Profession": ["teacher"]}
 4332 | {"Education": ["master"], "character": ["Extrovert", "careful"], "Hobby": ["tennis", "badminton"], "Profession": ["teacher"]}
 4997 | {"Education": ["primary school"], "character": ["Extrovert", "careful", "impetuous"], "Hobby": ["table tennis"], "Profession": ["teacher"]}
 5231 | {"Education": ["Undergraduate"], "character": ["Extrovert", "careful"], "Hobby": ["golf"], "Profession": ["teacher"]}
 5360 | {"Education": ["No education"], "character": ["impetuous", "Extrovert", "careful"], "Hobby": ["climb mountains", "tennis", "travel", "badminton"], "Profession": ["teacher"]}
 6281 | {"Education": ["Specialty"], "character": ["careful", "Extrovert", "confidence"], "Hobby": ["ski", null], "Profession": ["teacher"]}
 7681 | {"Education": ["primary school"], "character": ["careless", "Extrovert", "careful"], "Hobby": ["table tennis"], "Profession": ["teacher"]}
 8246 | {"Education": ["master"], "character": ["Extrovert", "careful"], "Hobby": ["ski", "tennis", "golf"], "Profession": ["teacher"]}
 8531 | {"Education": ["master"], "character": ["careful", "Extrovert", "careless"], "Hobby": ["ski", "climb mountains"], "Profession": ["teacher"]}
 8618 | {"Education": ["primary school"], "character": ["careful", "Extrovert", "impetuous"], "Hobby": ["table tennis"], "Profession": ["teacher"]}
 9508 | {"Education": ["primary school"], "character": ["impetuous", "Extrovert", "careful"], "Hobby": ["climb mountains", "travel", "golf"], "Profession": ["teacher"]}
(15 rows)

Time: 1.495 ms

The "@>" in the where condition in the SQL statement above is a special operator of jsonb type in PostgreSQL, meaning "inclusive".
If we want to query doctors who are extroverted and attentive in character and who like skiing and swimming, then SQL statement:

osdba=# select * from user_tag where tag @> '{"character":["extrovert","careful"]}' and tag @> '{"career":["doctor"]}' and tag @>'{"hobby":["skiing", "swimming"]}'; uid  |                                               tag
------+--------------------------------------------------------------------------------------------------
 4469 | {"Education": ["primary school"], "character": ["Extrovert", "careful", "steady"], "Hobby": ["ski", "swim"], "Profession": ["doctor"]}
(1 row)

Time: 2.139 ms

As can be seen from the above, it only takes 1~2ms to query the result. The reason is that we use the GIN index, and we can check the execution plan:

osdba=# explain select * from user_tag where tag @> '{"personality":["extrovert","careful"]}' and tag @> '{"profession":["doctor"]}' and tag @>'{"hobby":["skiing", "swimming"]}';                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on user_tag  (cost=48.00..52.02 rows=1 width=153)
   Recheck Cond: ((tag @&gt; '{"character": ["Extrovert", "careful"]}'::jsonb) AND (tag @&gt; '{"Profession": ["doctor"]}'::jsonb) AND (tag @&gt; '{"Hobby": ["ski", "swim"]}'::jsonb))
   -&gt;  Bitmap Index Scan on idx_user_tag_tag  (cost=0.00..48.00 rows=1 width=0)
         Index Cond: ((tag @&gt; '{"character": ["Extrovert", "careful"]}'::jsonb) AND (tag @&gt; '{"Profession": ["doctor"]}'::jsonb) AND (tag @&gt; '{"Hobby": ["ski", "swim"]}'::jsonb))
(4 rows)

Time: 1.736 ms

GIN index is a nuclear weapon in PostgreSQL. You can see from the above that the power of GIN index is. In addition to being used on json data types, GIN cables can also be used in array types.
jsonb provides two types of index operators for GIN indexes (you can think of them as two types of indexes):

jsonb_ops, this is the default, the index we created above is of this type.
jsonb_path_ops

jsonb_ops can support "@>", "?", "?&", "?|" and other operators for querying. Please see the meaning of these operators:Official Documentation
jsonb_path_ops only supports the "@>" operator. Of course, the jsonb_path_ops index is much smaller than the json_ops index. For details, please see:

osdba=# select pg_relation_size('idx_user_tag_tag');
 pg_relation_size
------------------
           245760
(1 row)

Time: 0.522 ms
osdba=# CREATE INDEX idx_user_tag_tag2 on user_tag using gin(tag jsonb_path_ops);
CREATE INDEX
Time: 46.947 ms
osdba=# select pg_relation_size('idx_user_tag_tag2');
 pg_relation_size
------------------
           147456
(1 row)

The method used by jsonb_path_ops index is basically the same as that of jsonb_ops:

osdba=# drop index idx_user_tag_tag;
DROP INDEX
Time: 2.833 ms
osdba=# select * from user_tag where tag @> '{"character":["extrovert","careful"]}' and tag @> '{"career":["doctor"]}' and tag @>'{"hobby":["skiing", "swimming"]}'; uid  |                                               tag
------+--------------------------------------------------------------------------------------------------
 4469 | {"Education": ["primary school"], "character": ["Extrovert", "careful", "steady"], "Hobby": ["ski", "swim"], "Profession": ["doctor"]}
(1 row)

Time: 1.401 ms
osdba=# explain select * from user_tag where tag @> '{"personality":["extrovert","careful"]}' and tag @> '{"profession":["doctor"]}' and tag @>'{"hobby":["skiing", "swimming"]}';                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on user_tag  (cost=24.00..28.02 rows=1 width=153)
   Recheck Cond: ((tag @&gt; '{"character": ["Extrovert", "careful"]}'::jsonb) AND (tag @&gt; '{"Profession": ["doctor"]}'::jsonb) AND (tag @&gt; '{"Hobby": ["ski", "swim"]}'::jsonb))
   -&gt;  Bitmap Index Scan on idx_user_tag_tag2  (cost=0.00..24.00 rows=1 width=0)
         Index Cond: ((tag @&gt; '{"character": ["Extrovert", "careful"]}'::jsonb) AND (tag @&gt; '{"Profession": ["doctor"]}'::jsonb) AND (tag @&gt; '{"Hobby": ["ski", "swim"]}'::jsonb))
(4 rows)

Time: 0.634 ms

You can write more SQL to analyze user portraits, so I won’t go into details here.
Of course, you can also use PostgreSQL's array type to build a tag system. The array type also runs GIN index, but in general, it is not intuitive to use json type.

This is the end of this article about using PostgreSQL database to establish a user portrait system. For more related content of PostgreSQL user portrait system, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!