PySpark: Write DataFrame Data to CSV file

PySpark's DataFrameWriter class offers the write() method to save DataFrames in various supported file formats such as CSV, JSON, and Parquet. Just like the read method, the write method also provides options for saving CSV files using csv() or format().save() methods.

However, one limitation of this method is that it doesn't provide direct control over specifying the name of the output files. By default, all files are saved with a "part" prefix. The number of output files generated during the write operation depends on the cluster's configuration and the amount of data being written.

Let us see this with examples, using a sample dataframe.
















Using CSV Method:

















Using format().save() Method:


















Mode method or property:

When attempting to load data into a location that already contains existing files, Spark will raise an error saying the location already exists. This occurs because the default mode property is set to 'Error.' For a more practical understanding, please refer below.










The DataFrameWriter's Write methods support the Mode property, allowing you to specify how the cluster should handle existing files or directories. Here are the typical options supported by the Mode property:

  • Append: This option adds the new data to existing files without overwriting them.
  • Overwrite: When set to 'Overwrite,' the cluster first deletes the existing data and replaces it with the new data.
  • Ignore: If you choose 'Ignore,' the Spark cluster will skip writing the DataFrame data to the location if it already contains files, essentially disregarding this operation without raising an error.
  • Error (Default): The default setting for the Mode method is 'Error.' This means it will throw a 'file already exists' error if the target location is not empty.
Here I will show the append option, the rest is left to you to practice.






























Supported Options for CSV Files:

When writing to CSV files, PySpark provides several options that you can use to customize the output. Here's a list of these options, as documented in Databricks help:








Here, I'll introduce you to a few essential properties. You can further explore additional properties as you gain experience through practice.

























Conclusion: 

In this blog post, we have explored various methods for writing DataFrame data to CSV files. It's important to note that if we omit the 'header' option during data writing, the DataFrame write methods won't include column names in the output files. Consequently, when attempting to read CSV files with the 'header' option enabled, we may encounter a 'query returned no results' message. Therefore, it's crucial to carefully choose the options that align with your specific project requirements.

Please find below the complete code used in this blog.
  
data = [(1,"India"),(2,"US"),(3,"UK"),(4,"Chaina"),(5,"AUS")]
schema = ["ID","CountryName"]
df = spark.createDataFrame(data,schema)
display(df)

df.write.csv("dbfs:/FileStore/WriteCSVFile/")

df.write.format("csv").save("dbfs:/FileStore/WriteCSVFileWithformat/")

df.write.mode("Append").csv("dbfs:/FileStore/WriteCSVFile/")

help(df.write.csv)

df.write.csv(path="dbfs:/FileStore/WriteCSVFileWithOptions/" , header= True, sep = '\t')
###OR
#df.write.format("csv").option("header","True").option("sep","\t").save("dbfs:/FileStore/WriteCSVFileWithformat/")
####OR
#df.write.mode("overwrite").options(header = True, sep = "\t").csv("dbfs:/FileStore/WriteCSVFileWithOptions/")



No comments:

Post a Comment