SoFunction
Updated on 2025-03-03

Detailed explanation of how SparkSql outputs data

1. Ordinary file output method

Method 1: Given the type and address of the output data source

("json").save(path)
("csv").save(path)
("parquet").save(path)

Method 2: Directly call the method corresponding to the data source type

(path)
(path)
(path)

append: append mode, continue to append when data exists
overwrite: Overwrite mode, when data exists, overwrite the previous data and store the current latest data;
error/errorifexists: If the target exists, an error will be reported, the default mode
ignore: Ignore, do nothing when the data exists

Code writing template:

(saveMode="append").format("csv").save(path)

Code demonstration of ordinary file output format:

import os
from  import SparkSession
if __name__ == '__main__':
    # Configure the environment    ['JAVA_HOME'] = 'C:/Program Files/Java/jdk1.8.0_241'
    # The path to configure Hadoop is the path that was decompressed in the previous    ['HADOOP_HOME'] = 'D:/hadoop-3.3.1'
    # Configure the path to the Python parser in the base environment    ['PYSPARK_PYTHON'] = 'C:/ProgramData/Miniconda3/'  # Configure the path to the Python parser in the base environment    ['PYSPARK_DRIVER_PYTHON'] = 'C:/ProgramData/Miniconda3/'
    spark = ("local[2]").appName("").config(
        "", 2).getOrCreate()
    df = ("../../datas/")
    # Get the name of the oldest person    ("persons")
    rsDf = ("""
       select name,age from persons where age = (select max(age) from persons)
    """)
    # Print the result to the console    #("console").save()
    #("../../datas/result",mode="overwrite")
    #(saveMode='overwrite').format("json").save("../../datas/result")
    #(saveMode='overwrite').format("csv").save("../../datas/result1")
    #(saveMode='overwrite').format("parquet").save("../../datas/result2")
    #(saveMode='append').format("csv").save("../../datas/result1")
    # text Save path as hdfs directly reported an error, not supported    #(saveMode='overwrite').text("hdfs://bigdata01:9820/result")
    #("hdfs://bigdata01:9820/result",mode="overwrite")
    ("hdfs://bigdata01:9820/result", mode="overwrite")
    ()

2. Save to the database

Code demo:

import os
# Import pyspark modulefrom pyspark import SparkContext, SparkConf
from  import SparkSession
if __name__ == '__main__':
    # Configure the environment    ['JAVA_HOME'] = 'D:\Download\Java\JDK'
    # The path to configure Hadoop is the path that was decompressed in the previous    ['HADOOP_HOME'] = 'D:\\bigdata\hadoop-3.3.1\hadoop-3.3.1'
    # Configure the path to the Python parser in the base environment    ['PYSPARK_PYTHON'] = 'C:/ProgramData/Miniconda3/'  # Configure the path to the Python parser in the base environment    ['PYSPARK_DRIVER_PYTHON'] = 'C:/ProgramData/Miniconda3/'
    spark = ('local[*]').appName('').config("", 2).getOrCreate()
    df5 = ("csv").option("sep", "\t").load("../../datas/zuoye/")\
       .toDF('eid','ename','salary','sal','dept_id')
    ('emp')
    rsDf = ("select * from emp")
    ("jdbc") \
        .option("driver", "") \
        .option("url", "jdbc:mysql://bigdata01:3306/mysql") \
        .option("user", "root") \
        .option("password", "123456") \
        .option("dbtable", "emp1") \
        .save(mode="overwrite")
    ()
    # After use,Remember to close

3. Save to hive

Code demonstration:

import os
# Import pyspark modulefrom pyspark import SparkContext, SparkConf
from  import SparkSession
if __name__ == '__main__':
    # Configure the environment    ['JAVA_HOME'] = 'D:\Download\Java\JDK'
    # The path to configure Hadoop is the path that was decompressed in the previous    ['HADOOP_HOME'] = 'D:\\bigdata\hadoop-3.3.1\hadoop-3.3.1'
    # Configure the path to the Python parser in the base environment    ['PYSPARK_PYTHON'] = 'C:/ProgramData/Miniconda3/'  # Configure the path to the Python parser in the base environment    ['PYSPARK_DRIVER_PYTHON'] = 'C:/ProgramData/Miniconda3/'
    ['HADOOP_USER_NAME'] = 'root'
    spark = SparkSession \
        .builder \
        .appName("HiveAPP") \
        .master("local[2]") \
        .config("", 'hdfs://bigdata01:9820/user/hive/warehouse') \
        .config('', 'thrift://bigdata01:9083') \
        .config("", 2) \
        .enableHiveSupport() \
        .getOrCreate()
    df5 = ("csv").option("sep", "\t").load("../../datas/zuoye/") \
        .toDF('eid', 'ename', 'salary', 'sal', 'dept_id')
    ('emp')
    rsDf = ("select * from emp")
    ("")
    ()
    # After use,Remember to close

This is the end of this article about the way SparkSql output data. For more related content on SparkSql output data, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!