Use sqoop to import data into hive common statements
Import the hive table directly
sqoop import --connect jdbc:postgresql://ip/db_name --username user_name --table table_name --hive-import -m 5
The actual internal execution is divided into three parts: 1. Import the data into hdfs (the corresponding directory can be found on hdfs), 2. Create a table with the same hive table name, 3. Pass the data on hdfs into the hive table
sqooop creates hive table based on postgresql table
sqoop create-hive-table --connect jdbc:postgresql://ip/db_name --username user_name --table table_name --hive-table hive_table_name ( --hive-partition-key partition_nameIf you need a partition, add a partition name)
Import the table that has been created by hive
sqoop import --connect jdbc:postgresql://ip/db_name --username user_name --table table_name --hive-import -m 5 --hive-table hive_table_name (--hive-partition-key partition_name --hive-partition-value partititon_value);
Import hive tables using query
sqoop import --connect jdbc:postgresql://ip/db_name --username user_name --query "select ,* from retail_tb_order where \$CONDITIONS" --hive-import -m 5 --hive-table hive_table_name (--hive-partition-key partition_name --hive-partition-value partititon_value);
Note: The $CONDITIONS condition must be available. If the query clause is double quoted, then $CONDITIONS needs to be escaped. If single quotes are used, there is no need to be escaped.
Encountering problems
How to solve the problem if you need to add a column of data that is not in the original relational database under the premise of importing the hive data table.
First of all, what we think of is that adding a partition can easily add "a column" of data. The use of partition is very similar to an ordinary column. There is no problem with commonly used SQL execution.
Secondly, I think of adding a constant or a variable to the query's SQL, for example: "select 'hello',* from retail_tb_order where \$CONDITIONS", an exception will be reported after execution
12/08/28 14:41:31 INFO : Beginning code generation 12/08/28 14:41:31 INFO : Executing SQL statement: select 'hello',* from retail_tb_order where (1 = 0) 12/08/28 14:41:32 INFO : Executing SQL statement: select 'hello',* from retail_tb_order where (1 = 0) 12/08/28 14:41:32 ERROR : Cannot resolve SQL type 1111 12/08/28 14:41:32 ERROR : Cannot resolve SQL type 1111 12/08/28 14:41:32 ERROR : No Java type for SQL type 1111 for column _column_ 12/08/28 14:41:32 ERROR : No Java type for SQL type 1111 for column _column_ 12/08/28 14:41:32 ERROR : No Java type for SQL type 1111 for column _column_ 12/08/28 14:41:32 ERROR : No Java type for SQL type 1111 for column _column_ 12/08/28 14:41:32 ERROR : No Java type for SQL type 1111 for column _column_ 12/08/28 14:41:32 ERROR : No Java type for SQL type 1111 for column _column_ 12/08/28 14:41:32 ERROR : Got exception running Sqoop: at (:900) at (:925) at (:999) at (:1314) at (:1138) at (:82) at (:367) at (:453) at (:145) at (:65) at (:181) at (:220) at (:229) at (:238) at (:57)
The reason for this problem is that the sqoop ClassWriter class will parse all columns in SQL in the postgresql table. When the constant 'hello' is parsed, the database will not find the corresponding data type without the column.
To solve this problem, you should modify the ClassWriter source code.
Supplement: Using Sqoop, the data imported into hive is inconsistent with the data in the original database.
Sqoop is an open source tool, mainly used to transfer data between Hadoop (Hive) and traditional databases (mysql, postgresql...). You can import data from a relational database (for example: MySQL, Oracle, Postgres, etc.) into Hadoop's HDFS, or you can import HDFS data into a relational database.
1. Problem background
Use Sqoop to place a table in the oracle database, assuming it is student here, import the data into hdfs, and then create the external table of hive, and locate the location where the data was saved in hdfs. Finally, it was found that the result was inconsistent with the result in oracle when counting the specific conditions of the table in hive.
1.1 Import data into hdfs under the /user/hadoop/student path
sqoop import --connect "jdbc:oracle:thin:@//localhost:1521/student" --password "***" --username "***" --query "select * from student where name='zhangsan' and class_id='003' and \$CONDITIONS" --target-dir "/user/hadoop/student" --verbose -m 1
At this time, the table data imported from oracle is saved under /user/hadoop/student on hdfs.
How is table data stored on HDFS? Pay attention to this point, resulting in an inconsistent error in the end.
Let's take a look at how data is stored on HDFS. We run hadoop fs -cat /user/hadoop/student/part-m-00000, and we can see that the original fields are separated by ‘,’. This is the default of sqoop. At this time, if a field value contains ‘,’, and then the separation will occur when inserting data into hive. Because hive is also separated by ‘,’.
2. Analyze the problem
Comparing the results of the select count(*) from student in hive and the results of the select count(*) from student in oracle, we found that the number of digits is the same, which means there is no load data. Then why are the results in certain conditions inconsistent, and there are fewer numbers in hive than in oracle. That is to run select count(*) at the same time from student where class_id='003'
Finally, when hive is separated by commas, there are several data fields with commas, so the fields and values correspond to values and cannot get the correct result.
We recommend using ‘\001’ to split the data when sqoop imports. That is, --fields-terminated-by <char> parameter.
refer to:/docs/1.4.2/#_large_objects
The last optimized sqoop statement is:
sqoop import --connect "jdbc:oracle:thin:@//localhost:1521/student" --password "***" --username "***" --query "select * from student where name='zhangsan' and class_id='003' and \$CONDITIONS" --target-dir "/user/hadoop/student" --fields-terminated-by "\001" --verbose -m 1
The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.