1. Problem
The environment is jdk1.8 and spark3.2.1. The garbled code appears when reading the GB18030 encoding format file in hadoop.
2. Sad journey
To solve this problem, many methods have been tried, but none of them have succeeded
1. TextFile+Configuration method—garbled code
String filePath = "hdfs:///user/"; //Create instances of SparkSession and SparkContext String encoding = "GB18030"; SparkSession spark = () .master("local[*]").appName("Spark Example") .getOrCreate(); JavaSparkContext sc = (()); Configuration entries = (); ("", "\n"); ("",filePath);("", "GB18030"); JavaRDD<String> rdd = (filePath);
2. ().option method - garbled code
Dataset<Row> load = ().format("text").option("encoding", "GB18030").load(filePath); (row -> { (()); (new String(().getBytes(encoding),"UTF-8")); (new String(().getBytes(encoding),"GBK")); });
3. newAPIHadoopFile+Configuration—garbled code
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = (filePath, , , , entries ); ("longWritableTextJavaPairRDD count ="+()); (k->{ (k._2); });
4. newAPIHadoopFile+custom class—garbled code
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = (filePath, , , , entries ); ("longWritableTextJavaPairRDD count ="+()); (k->{ (k._2); });
In the code, it is copied to modify the internal UTF-8 to GB18030.
5. newAPIHadoopRDD+custom class—garbled code
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = (entries, , , ); ("longWritableTextJavaPairRDD count ="+()); (k->{ (k._2()); });
3. Final solution
The above method feels that the specified character encoding has not taken effect. I don’t know why. If you understand the reason, please help me solve my doubts. Thank you.
The final solution is as follows
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = (filePath, , , , new Configuration()); ("longWritableTextJavaPairRDD count ="+()); (k->{ (new String(k._2.copyBytes(), encoding)); }); JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = (entries, , , ); ("longWritableTextJavaPairRDD count ="+()); (k->{ (new String(k._2().copyBytes(),encoding)); (new String(k._2.copyBytes(),encoding)); });
Mainly new String(k._2().copyBytes(), encoding) can be solved
This is the end of this article about the solution to the garbled problem of java spark file reading. For more related content of java spark file reading garbled code, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!