ETL is the abbreviation of Extract-Transform-Load in English, used to describe the process of extracting, transforming, and loading data from the source to the destination.
The term ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses.
Its main function is to integrate various data in the enterprise that are scattered, non-completely structured, and inconsistent in standards to form an enterprise-level unified data warehouse, and provide a quality assurance data source for enterprise analysis and decision-making.
process
It mainly includes three major stages, namely data extraction, data conversion, and data loading.
Process-Data extraction
3.1 Relational Database
Relational database data is highly structured, the data volume is not particularly large, the data update is relatively frequent, the data quality is high, and the data acquisition is relatively convenient. It is divided into offline and real-time
Offline extraction, the application scenario is insensitive to updates in a short period of time, and only cares about the final state after a period of time, and is generally fixed-period and timed extraction.
The extraction method can use snapshots to take a full table snapshot of the entire table; or use incremental extraction based on time fields, but you need to have a good understanding of the update scenario of this field.
Real-time extraction. The usual application scenario is very sensitive to the latest state of the data. When new data is inserted or old data is updated, it needs to be extracted in time.
There are many extraction methods, one of which is commonly used is based on binlog of consumer databases, such as Alibaba's open source canal.
3.2 Server program log file
Various log data are mixed together (you need to filter out what you need), the data format is relatively diverse (the format may not be unified), and the data volume is relatively large.
For this type of data, the usual method is to filter and extract. The extraction method can use flume to monitor the file directory, or use FileBeat to monitor the changes in file content in real time.
3.3 Client user behavior data
The data volume is very large, the data sources are very scattered, the data quality is uneven, and the data structure is high.
For this kind of data, the usual method of extraction is to design a data collection system specifically to deal with the problem of collecting data.
Process-Data conversion
The core link of ETL is also the most complex link.
Its main goal is to clean the extracted data, convert the format, fill missing values, remove duplication and other operations, and finally obtain a copy of data that is unified in format, highly structured, high data quality and good compatibility, providing reliable data support for subsequent analysis and decisions.
4.1 ETL process-data cleaning
Clean the dirty data to avoid mixing into normal data, causing data quality to decline and leading to biased wrong decisions.
Common practices include md5 verification, key field check, format type consistency check, invalidity check, etc.
(1) md5 verification is usually used for data transmitted through the public network. In order to prevent data from being tampered with, the md5 value needs to be compared to ensure the security of the data. Data that fails to pass md5 verification will be considered dirty data and filtered out.
(2) Key field check refers to a very critical field in a data record. If its value is illegal or its type is incorrect, it will be considered illegal data. For example, when the value of a field that recognizes a user's behavior is not within the specified value range, it will not be able to identify what specific operations are being done, and it will be considered illegal data to be cleaned out.
(3) Format type consistency check is to check whether the overall format or type of important fields of a data record complies with the specification. If it does not comply, it is also impossible to process it and can only be processed as dirty data. For example, illegal date format.
(4) Invalidity check is usually a relatively strict filtering method. Its main purpose is to ensure high reliability of data. It usually stipulates the credibility range of data in advance. If it is not satisfied, it will be regarded as untrusted invalid data.
4.2 Format conversion
The data format is uniformly and standardized, so that subsequent analysis operations can better use data.
Generally, there will be different implementations depending on the source data format and the target format. Generally includes record format conversion and field format conversion.
Record format conversion is to ultimately convert a record into what format, such as converting it to json format or csv format.
Field format conversion is a unified process of formatting field values, such as converting all long integer timestamps into human-readable date formats.
4.3 Error data processing
The reason for this type of error is that the business system is not sound enough, and it is not judged after receiving input and written directly into the background database, such as inputting numerical data into full-width numeric characters, a carriage return operation after string data, incorrect date format, and date crossing boundaries. This type of data also needs to be classified. For problems similar to full-width characters and invisible characters before and after the data, it can only be found by writing SQL statements, and then the customer is required to extract it after the business system is corrected.
An error with incorrect date format or a date that has an out-of-bounds error will cause the ETL to fail. This type of error needs to be selected in the business system database in SQL, and handed over to the business authority for correction within a time limit, and then extracted after correction.
4.4 Missing value filling
Fill in missing spaces in the data to ensure consistency or integrity of the data type.
For example, for an integer field of a quantity class, when its value is missing, you can consider padding 0.
It should be noted here that the missing value filling must be relatively cautious, because this process is equivalent to modifying the original data. It is necessary to ensure that the filled value will not have a misleading impact on subsequent analysis.
4.5 Exclude duplicate data
Prevent data from being uploaded repeatedly and causing misjudgment. It is very necessary to eliminate duplicate data and is also a prerequisite for ensuring data quality.
Data cleaning is a repetitive process and cannot be completed within a few days. Only by constantly discovering problems and solving them.
Whether to filter or not, whether to correct is generally required to confirm. For filtered data, write to Excel files or write filtered data to data tables. In the early stage of ETL development, you can send filtered data emails to business units every day to prompt them to correct errors as soon as possible, and it can also be used as a basis for verifying data in the future.
When cleaning data, it is important to note that you do not filter out useful data, and carefully verify each filtering rule and ask the user to confirm.
4.6 Add necessary information
Add some tracking information to the data or generate some necessary identification fields. Some fields are strongly related to specific business, and there are also some common fields.
For example, there are only long integer timestamp fields in the source data, which can add date and time fields with higher readability, and can also add data source fields to facilitate tracking data blood relationships.
Process--.Data loading
Load data to a destination, such as a data warehouse.
The usual practice is to write the processed data into a file in a specific format (such as parquet, csv, etc.), and then mount the file to the specified table partition.
Some tables also have very small data volumes and will not use partition tables, but will directly generate the final data table.
--Log processing
(1) Execution process log. This part of the log is a record of each step executed during the ETL execution process. It records the start time of each step each time, affects how many rows of data, and the form of a streaming account.
(2) Error log. When a module makes an error, write an error log to record the time of each error, the error module, and the error information.
(3) Overall log, only records information on whether the ETL start time and end time are successful. If you use ETL tools, the ETL tools will automatically generate some logs, and this type of log can also be used as part of the ETL log.
This is the end of this article about the basic concepts of ELT in data analysis. For more basic concepts of ELT, please search for my previous articles or continue browsing the related articles below. I hope you will support me in the future!