Pyspark is a powerful tool for large-scale data processing. However, using column names with spaces can result in syntax errors and other complications if those spaces are not handled correctly during the dataframe creation. So by removing the empty spaces, we can avoid these errors and complications.
To Demonstrate this process we have used an Olymic public data source.
There are different approaches available to remove these empty space, out of all we discuss most commonly used methods here,
Using withColumnRenamed:
- Dataframes are immutable, so we can't remove or update columns. Instead, we create a new dataframe with renamed columns.
- This is the most commonly used function for renaming columns and it takes two arguments: the existing column name and the new column name.
Using SELECT:
Using toDF:
- The toDF function is primarily used to create a DataFrame from an RDD, and it can also be used to rename existing column names in a Spark DataFrame.
Using List Comprehension or Dynamic Method to Replace Space:
Please check the Code used in this blog:
df = spark.read.format("csv")\
.option("header","true")\
.option("delimiter","\t")\
.option("inferSchema","true")\
.load("dbfs:/FileStore/Olympic.txt")
display(df)
df.printSchema()
df=df.withColumnRenamed("Closing Ceremony Date","Closing_Ceremony_Date")
df.printSchema()
df=df.withColumnRenamed("Gold Medals","Gold_Medals")\
.withColumnRenamed("Silver Medals","Silver_Medals")\
.withColumnRenamed("Bronze Medals","Bronze_Medals")\
.withColumnRenamed("Total Medals","Total_Medals")
df.printSchema()
from pyspark.sql.functions import col
df=df.select(col("Athlete")\
,col("Age")\
,col("Country")\
,col("Year")\
,col("Closing Ceremony Date").alias("Closing_Ceremony_Date")\
,col("Sport")\
,col("Gold Medals").alias("Gold_Medals")\
,col("Silver Medals").alias("Silver_Medals")\
,col("Bronze Medals").alias("Bronze_Medals")\
,col("Total Medals").alias("Total_Medals"))
df.printSchema()
NewColumnNames = ["Athlete","Age","Country","Year","Closing_Ceremony_Date","Sport","Gold_Medals","Silver_Medals","Bronze_Medals","Total_Medals"]
df = df.toDF(*NewColumnNames)
df.printSchema()
from pyspark.sql.functions import *
df = df.select([col(c).alias(c.replace(" ","_")) for c in df.columns])
df.printSchema()
No comments:
Post a Comment