SoFunction
Updated on 2025-03-03

Solution to the problem of garbled code reading of java spark file

1. Problem

The environment is jdk1.8 and spark3.2.1. The garbled code appears when reading the GB18030 encoding format file in hadoop.

2. Sad journey

To solve this problem, many methods have been tried, but none of them have succeeded

1. TextFile+Configuration method—garbled code

        String filePath = "hdfs:///user/";
        //Create instances of SparkSession and SparkContext        String encoding = "GB18030";

        SparkSession spark = ()
                .master("local[*]").appName("Spark Example")
                .getOrCreate();

        JavaSparkContext sc = (());
        Configuration entries = ();
        ("", "\n");
        ("",filePath);("", "GB18030");
        
        JavaRDD<String> rdd = (filePath);

2. ().option method - garbled code

        Dataset<Row> load = ().format("text").option("encoding", "GB18030").load(filePath);

        (row -> {
            (());
            (new String(().getBytes(encoding),"UTF-8"));
            (new String(().getBytes(encoding),"GBK"));


        });

3. newAPIHadoopFile+Configuration—garbled code

        JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = (filePath, , , , entries );

        ("longWritableTextJavaPairRDD  count ="+());
        (k->{

            (k._2);
        });

4. newAPIHadoopFile+custom class—garbled code

        JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = (filePath, , , , entries );

        ("longWritableTextJavaPairRDD  count ="+());
        (k->{

            (k._2);
        });

In the code, it is copied to modify the internal UTF-8 to GB18030.

5. newAPIHadoopRDD+custom class—garbled code

        JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = (entries, , , );
        ("longWritableTextJavaPairRDD  count ="+());
        (k->{
            (k._2());
        });

3. Final solution

The above method feels that the specified character encoding has not taken effect. I don’t know why. If you understand the reason, please help me solve my doubts. Thank you.

The final solution is as follows

       JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = (filePath, , , , new Configuration());

        ("longWritableTextJavaPairRDD  count ="+());
        (k->{
            (new String(k._2.copyBytes(), encoding));
        });

        JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = (entries, , , );

        ("longWritableTextJavaPairRDD  count ="+());
        (k->{
            (new String(k._2().copyBytes(),encoding));
            (new String(k._2.copyBytes(),encoding));
        });

Mainly new String(k._2().copyBytes(), encoding) can be solved

This is the end of this article about the solution to the garbled problem of java spark file reading. For more related content of java spark file reading garbled code, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!