The select()
function in PySpark is a powerful tool from the pyspark.sql.DataFrame
class.
It is used to extract specific columns from a DataFrame and create a new DataFrame with just those columns. Whether you need one column, multiple columns, or even all columns, select()
has you covered. Additionally, you can use this function to create new columns with calculated or derived values.
What makes select()
unique is that it is a transformation function. This means it doesn’t run immediately but waits until an action, like show()
or write()
, is called. This lazy execution helps optimize performance, especially when working with large datasets.
In summary, select()
is a simple yet versatile method that helps refine your data for analysis or further processing.
Now, let’s explore various ways to use the select() function to display or create a new DataFrame with specific columns.
To demonstrate this, we will first create a simple DataFrame as our starting point.
Selecting All Columns:
Selecting Single Column:
Selecting Multiple Columns:
Selecting with column Index:(indexing starts with 0)
Selecting with columns List
Selecting based on a Pattern
Selecting with SQL Expressions
Selecting in Nested Dataframes
DataFrames are not always limited to simple rows and columns. In some cases, they may contain nested structures, where columns hold complex data like arrays or dictionaries. It's essential to understand how to navigate and explore these nested DataFrames effectively.
Let’s dive into some practical examples to see how to handle and work with nested DataFrames. for this lets create a nested dataframe.
Lets see how to retrive or display data from nested dataframes
Complete Code Example
For your convenience, here's the complete code used in this blog.
df = spark.createDataFrame([(1,"DataBricks"),(2,"Azure Synapse"),(3,"PySpark")], ("Id","Skills"))
df.show()
#####Selecting all columns####
#Using the "*"
df.select("*").show()
#Using the List comprehension
df.select([i for i in df.columns]).show()
####Selecting Single Column####
from pyspark.sql.functions import col
#Using Column Names
df.select("Skills").show()
#Using col Function
df.select(col("Skills")).show()
#Using DataFrame Aliases
df.select(df.Skills).show()
df.select(df["Skills"]).show()
####Selecting Multiple Column####
from pyspark.sql.functions import col
#Using Column Names
df.select("Id","Skills").show()
#Using col Function
df.select(col("Id"),col("Skills")).show()
#Using DataFrame Aliases
df.select(df.Id,df.Skills).show()
df.select(df["Id"],df["Skills"]).show()
#Selecting Column based on Column Index values
df.select(df.columns[0:1]).show()
#passign the column names as a list of collection
columnNames = ["Id","Skills"]
df.select(columnNames).show()
#selecting column if the column names matches with some string
from pyspark.sql.functions import *
df.select([i for i in df.columns if i.endswith("s")]).show()
df.select([i for i in df.columns if i.startswith("I")]).show()
# Select Using SQL Expressions
df.select("Id",expr("case when Id > 1 then 11 else Id end as `Derived Column`"),"Skills").show()
#Selecting Nested Columns
from pyspark.sql.types import *
data = [(("abc",None,"Alex"),32,"Us"),(("def","Lenon",None),49,"UK")]
schema = StructType([\
StructField("Name",StructType([\
StructField("FName",StringType()),\
StructField("MiddleName",StringType()),\
StructField("LName",StringType())\
])),\
StructField("Age",IntegerType()),\
StructField("Country",StringType())\
])
df = spark.createDataFrame(data,schema)
display(df)
df.printSchema()
display(df.select("Name.*","Age","Country"))
#OR#
display(df.select("Name.FName","Name.MiddleName","Name.LName","Age","Country"))
No comments:
Post a Comment