PySpark: Read csv file to DataFrame

CSV files are widely favored as source files in data engineering, primarily because they support multiple delimiters and, most importantly, are easily readable by humans.

Pyspark offers two convenient methods, csv() and format().load(), both accessible within the DataFrameReader class, to facilitate the reading of CSV files. In this article, we will explore how to effectively read single and multiple CSV files, as well as delve into various options that enhance and simplify the CSV file reading process

Using csv() method: 















Using format().load() method:














In the methods mentioned above, we utilized the "csv" and the "format.load" methods without specifying any options supported by the CSV file. Consequently, these methods read all the rows in the file as data, including the column names, and assign default names like '_c0', '_c1', '_c2', and so on to the columns. By default, the data type for all these columns is also set to STRING.

To overcome this behavior, these methods offer options tailored to the data source. Let's explore a few of the most commonly used options for CSV files.

  • header: By default, this property is set to False, which means the first row is treated as data and not as column names. To treat the first row as a header, set this property to True.
  • sep: This is the delimiter value by which each column is separated.
  • inferSchema: By default, this value is set to False, which means the data type of all the columns is set to STRING. To let PySpark automatically determine the data types after reading the complete dataset, set this property to True. However, using inferSchema is not recommended for large datasets because PySpark has to read the data twice: first to determine the data types for all the columns and then to display the DataFrame. For larger datasets, consider using the option below.
  • schema: Using this option, you can specify the schema, including column names and their data types, for the dataset you are about to read.
Here is the list of options supported by csv file from Databricks help.







Let us see this by examples.

Using CSV Method with Options:














Using format().load() method with Options:













By now, you should have a good understanding of how to read CSV files. Additionally, there are two more methods to specify options aside from the ones mentioned earlier, Such as option and options method. Lets see this with examples.

Using the Option() function: 
This function supports setting one property at a time, but you can set multiple properties by using a series of Option instances separated by periods ('.').

































Using options():
This function supports multiple properties, and you can separate them with commas.





























Reading Multiple csv Files: 

The csv and format.load methods support reading multiple CSV files. To read multiple files, simply specify each file name along with their directory as a comma-separated list, as shown below.












To read all the csv files in a directory:















In this case, if you only specify a directory, the DataFrame reader will only read files within that specific folder and not those available in subfolders. To read files in the current directory and its subfolders, you can use the asterisk (*) as a wildcard.

Conclusion: 

In this blog, you have gained knowledge on how to read CSV files using various methods. It's important to note that you can choose any of the methods mentioned above, as they all function in the same way, with no discernible differences in performance or execution.

Note: When using the format().load() methods, you can specify data source names using their fully qualified names (e.g., org.apache.spark.sql.csv) or built-in short names (e.g., csv). Both options are available for your convenience.

Please find below is the complete code used in this blog.

df =spark.read.csv("dbfs:/FileStore/Olympic_File_1.csv")
display(df)

df = spark.read.format("csv").load("dbfs:/FileStore/Olympic_File_1.csv")
display(df)

df = spark.read.csv(path = "dbfs:/FileStore/Olympic_File_1.csv", header= True,sep = ',', inferSchema = True)
display(df)

df = spark.read.format("csv").load(path="dbfs:/FileStore/Olympic_File_1.csv", header = True, sep = ',',inferSchema = True)
display(df)

df = spark.read.option("header","true")\
               .option("inferSchema","true")\
               .option("sep",",")\
               .csv("dbfs:/FileStore/Olympic_File_1.csv")
display(df)

df = spark.read.format("csv").option("header","true")\
                             .option("inferSchema","true")\
                             .option("sep",",")\
                             .load("dbfs:/FileStore/Olympic_File_1.csv")
display(df)

df = spark.read.options(header=True, sep= ",", inferSchema=True).csv("dbfs:/FileStore/Olympic_File_1.csv")
display(df)

df = spark.read.format("csv").options(header = True, sep = ",", inferSchema = True).load("dbfs:/FileStore/Olympic_File_1.csv")
display(df)

df = spark.read.options(header = True, sep = ",", inferSchema = True).csv(["dbfs:/FileStore/Olympic_File_1.csv" , "dbfs:/FileStore/Olympic_File_2.csv"])
display(df)

df = spark.read.format("csv").options(header = True, sep = ",", inferSchema = True).load("dbfs:/FileStore/*.csv")
display(df)


1 comment: