java read and write Parquet format data sample code

This article describes the java read and write Parquet format data, share the following:

import ;
import ;
import ;
import ;
import ;

import ;
import ;
import .;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;

public class ReadParquet {
  static Logger logger=();
  public static void main(String[] args) throws Exception {
    
//    parquetWriter("test\\parquet-out2","");
    parquetReaderV2("test\\parquet-out2");
  }
  
  
  static void parquetReaderV2(String inPath) throws Exception{
    GroupReadSupport readSupport = new GroupReadSupport();
    Builder<Group> reader= (readSupport, new Path(inPath));
    ParquetReader<Group> build=();
    Group line=null;
    while((line=())!=null){
Group time= ("time", 0);
// Get both subscripts and field names through the
/*((0, 0)+"\t"+
(1, 0)+"\t"+
(0, 0)+"\t"+
(1, 0)+"\t");*/
(("city", 0)+"\t"+
("ip", 0)+"\t"+
("ttl", 0)+"\t"+
("ttl2", 0)+"\t");
//(());
    }
    ("End of reading.");
  } 
  //new version of new ParquetReader() all constructor methods seem to be abandoned, use the above builder to construct the object
  static void parquetReader(String inPath) throws Exception{
    GroupReadSupport readSupport = new GroupReadSupport();
    ParquetReader<Group> reader = new ParquetReader<Group>(new Path(inPath),readSupport);
    Group line=null;
    while((line=())!=null){
     (());
    }
    ("End of reading.");
    
  }
  /**
   *
   * @param outPath output Parquet format
   * @param inPath input normal text file
   * @throws IOException
   */
  static void parquetWriter(String outPath,String inPath) throws IOException{
    MessageType schema = ("message Pair {\n" +
        " required binary city (UTF8);\n" +
        " required binary ip (UTF8);\n" +
        " repeated group time {\n"+
        " required int32 ttl;\n"+
         " required binary ttl2;\n"+
        "}\n"+
       "}");
    GroupFactory factory = new SimpleGroupFactory(schema);
    Path path = new Path(outPath);
    Configuration configuration = new Configuration();
    GroupWriteSupport writeSupport = new GroupWriteSupport();
    (schema,configuration);
    ParquetWriter<Group> writer = new ParquetWriter<Group>(path,configuration,writeSupport);
// Read in local files to generate parquet files.
    BufferedReader br =new BufferedReader(new FileReader(new File(inPath)));
    String line="";
    Random r=new Random();
    while((line=())!=null){
      String[] strs=("\\s+");
      if(==2) {
        Group group = ()
            .append("city",strs[0])
            .append("ip",strs[1]);
        Group tmpG =("time");
        ("ttl", (9)+1);
        ("ttl2", (9)+"_a");
        (group);
      }
    }
    ("write end");
    ();
  }
}

Schema (schema is required to write Parquet format data, and is "automatically recognized" when reading it)

/*
 * Each field has three attributes: the number of occurrences, the data type, and the field name, and the number of occurrences can be any of the following three:
 * required (occurs 1 time)
 * repeated(occurs 0 or more times)
 * optional (occurs 0 or 1 times)
 * The data type of each field can be split into two:
 * group (complex type)
 * primitive(basic type)
 * Data types are
 * INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY
 */

The difference between repeated and required is not only in the number of times, but also in the type of data generated after serialization, e.g., repeated modifies ttl2 to print out as WrappedArray([7,7_a]) while required modifies ttl2 to print out as [7,7_a] In addition to generating the MessageType with the class In addition to generating a MessageType from a class, you can also use the following method

(Note that there is a pitfall here - spark will have this problem - ttl2 here as(OriginalType.UTF8) and required binary city (UTF8) play the same role, plus UTF8, can be converted to a StringType when reading, if not, it will report an error [B cannot be cast to )

/*MessageType schema = ("message Pair {\n" +
        " required binary city (UTF8);\n" +
        " required binary ip (UTF8);\n" +
        "repeated group time {\n"+
        "required int32 ttl;\n"+
        "required binary ttl2;\n"+
        "}\n"+
        "}");*/
    
//import ;
MessageType schema = () 
      .required().as(OriginalType.UTF8).named("city") 
      .required().as(OriginalType.UTF8).named("ip") 
      .repeatedGroup().required(PrimitiveTypeName.INT32).named("ttl")
              .required().as(OriginalType.UTF8).named("ttl2")
              .named("time")
     .named("Pair");

settle (a dispute) [B cannot be cast to exceptions：

1. Either generate a parquet file with a UTF8
2. Either the same schema class is provided to specify the type of the field when it is read, such as the following.

maven dependencies (I'm using 1.7)

<dependency>
  <groupId></groupId>
  <artifactId>parquet-hadoop</artifactId>
  <version>1.7.0</version>
</dependency>

This is the whole content of this article.