PySpark's DataFrameWriter class offers the write() method to save DataFrames in various supported file formats such as CSV, JSON, and Parquet. Just like the read method, the write method also provides options for saving CSV files using csv() or format().save() methods.
However, one limitation of this method is that it doesn't provide direct control over specifying the name of the output files. By default, all files are saved with a "part" prefix. The number of output files generated during the write operation depends on the cluster's configuration and the amount of data being written.
Let us see this with examples, using a sample dataframe.
Using CSV Method:
Using format().save() Method:
Mode method or property:
When attempting to load data into a location that already contains existing files, Spark will raise an error saying the location already exists. This occurs because the default mode property is set to 'Error.' For a more practical understanding, please refer below.
The DataFrameWriter's Write methods support the Mode property, allowing you to specify how the cluster should handle existing files or directories. Here are the typical options supported by the Mode property:
- Append: This option adds the new data to existing files without overwriting them.
- Overwrite: When set to 'Overwrite,' the cluster first deletes the existing data and replaces it with the new data.
- Ignore: If you choose 'Ignore,' the Spark cluster will skip writing the DataFrame data to the location if it already contains files, essentially disregarding this operation without raising an error.
- Error (Default): The default setting for the Mode method is 'Error.' This means it will throw a 'file already exists' error if the target location is not empty.
data = [(1,"India"),(2,"US"),(3,"UK"),(4,"Chaina"),(5,"AUS")]
schema = ["ID","CountryName"]
df = spark.createDataFrame(data,schema)
display(df)
df.write.csv("dbfs:/FileStore/WriteCSVFile/")
df.write.format("csv").save("dbfs:/FileStore/WriteCSVFileWithformat/")
df.write.mode("Append").csv("dbfs:/FileStore/WriteCSVFile/")
help(df.write.csv)
df.write.csv(path="dbfs:/FileStore/WriteCSVFileWithOptions/" , header= True, sep = '\t')
###OR
#df.write.format("csv").option("header","True").option("sep","\t").save("dbfs:/FileStore/WriteCSVFileWithformat/")
####OR
#df.write.mode("overwrite").options(header = True, sep = "\t").csv("dbfs:/FileStore/WriteCSVFileWithOptions/")
No comments:
Post a Comment