Using PySpark’s select()

The select() function in PySpark is a powerful tool from the pyspark.sql.DataFrame class.

It is used to extract specific columns from a DataFrame and create a new DataFrame with just those columns. Whether you need one column, multiple columns, or even all columns, select() has you covered. Additionally, you can use this function to create new columns with calculated or derived values.

What makes select() unique is that it is a transformation function. This means it doesn’t run immediately but waits until an action, like show() or write(), is called. This lazy execution helps optimize performance, especially when working with large datasets.

In summary, select() is a simple yet versatile method that helps refine your data for analysis or further processing.

Now, let’s explore various ways to use the select() function to display or create a new DataFrame with specific columns.

To demonstrate this, we will first create a simple DataFrame as our starting point.

Tap For Closer Look

Selecting All Columns:

Tap For Closer Look

Selecting Single Column:

Tap For Closer Look

Selecting Multiple Columns:

Tap For Closer Look

Selecting with column Index:(indexing starts with 0)

Tap For Closer Look

Selecting with columns List

Tap For Closer Look

Selecting based on a Pattern

Tap For Closer Look

Selecting with SQL Expressions

Tap For Closer Look

Selecting in Nested Dataframes

DataFrames are not always limited to simple rows and columns. In some cases, they may contain nested structures, where columns hold complex data like arrays or dictionaries. It's essential to understand how to navigate and explore these nested DataFrames effectively.

Let’s dive into some practical examples to see how to handle and work with nested DataFrames. for this lets create a nested dataframe.

Tap For Closer Look

Lets see how to retrive or display data from nested dataframes

Tap For Closer Look

Complete Code Example

For your convenience, here's the complete code used in this blog.

     
 df = spark.createDataFrame([(1,"DataBricks"),(2,"Azure Synapse"),(3,"PySpark")], ("Id","Skills"))
df.show()

#####Selecting all columns####
#Using the "*" 
df.select("*").show()
#Using the List comprehension
df.select([i for i in df.columns]).show()

####Selecting Single Column####
from pyspark.sql.functions import col
#Using Column Names
df.select("Skills").show()
#Using col Function
df.select(col("Skills")).show()
#Using DataFrame Aliases
df.select(df.Skills).show()
df.select(df["Skills"]).show()

####Selecting Multiple Column####
from pyspark.sql.functions import col
#Using Column Names
df.select("Id","Skills").show()
#Using col Function
df.select(col("Id"),col("Skills")).show()
#Using DataFrame Aliases
df.select(df.Id,df.Skills).show()
df.select(df["Id"],df["Skills"]).show()

#Selecting Column based on Column Index values
df.select(df.columns[0:1]).show()

#passign the column names as a list of collection
columnNames = ["Id","Skills"]
df.select(columnNames).show()

#selecting column if the column names matches with some string
from pyspark.sql.functions import *
df.select([i for i in df.columns if i.endswith("s")]).show()
df.select([i for i in df.columns if i.startswith("I")]).show()

# Select Using SQL Expressions
df.select("Id",expr("case when Id > 1 then 11 else Id end as `Derived Column`"),"Skills").show()

#Selecting Nested Columns
from pyspark.sql.types import *
data = [(("abc",None,"Alex"),32,"Us"),(("def","Lenon",None),49,"UK")]
schema = StructType([\
                        StructField("Name",StructType([\
                                                        StructField("FName",StringType()),\
                                                        StructField("MiddleName",StringType()),\
                                                        StructField("LName",StringType())\
                                                      ])),\
                        StructField("Age",IntegerType()),\
                        StructField("Country",StringType())\
                    ])

df = spark.createDataFrame(data,schema)
display(df)
df.printSchema()

display(df.select("Name.*","Age","Country"))
#OR#
display(df.select("Name.FName","Name.MiddleName","Name.LName","Age","Country"))

Using PySpark’s select()

Post a Comment

Popular Posts

Contact form

Using PySpark’s select()

You Might Like

Post a Comment

Popular Posts

Contact form