PySpark : Filter & Where

The filter and where are PySpark DataFrame methods/functions used to filter data based on specified conditions. These functions are mapped to transformation operations in PySpark. Importantly, where is an alias for the filter function in PySpark, indicating that there is no difference between these functions.

PySpark introduced the where function to cater to users who are more familiar with SQL code, as it is the sole function utilized for filtering non-aggregated data in SQL tables.

Check the below images to gain more understanding on these functions.

<<please click image to enlarge the image>>








































Since the Where is an alias to Filter it is not highlighted as filter did.

Please find below complete code used in this article. 

 
#DataFrame Creation
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(1001,"India","IN"),(1002,"United States","USA"),(1003,"Canada","CAN"),(1004,"Bratain","UK")]
schema = StructType([StructField("ID",IntegerType(),True),\
                     StructField("Country Name",StringType(),True),\
                     StructField("Country Code",StringType(),True)])

CountryDF = spark.createDataFrame(data = data, schema = schema)
display(CountryDF)

#Databricks Documentation help
help(CountryDF.filter)

#Display Data Using Filter Method
display(CountryDF.select("*").filter(col("Country Code") == "IN"))

#Display Data Using Where Method
display(CountryDF.select("*").where(col("Country Code") == "IN"))

 




No comments:

Post a Comment