PySpark: Filtering Array Columns

a cartoonized close up picture of a man

2024-01-11

Gabriel Montpetit

I have been looking online for a way to filter PySpark array columns based on other columns and I could only find examples where the filter used a static condition.

Here is how it is possible to filter array columns and include other columns in the predicate.

                  
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("filtering_numbers")
    .master("local[2]")
    .getOrCreate()
)
                

Create The dataframe

                  
# create initial dataframe
df = spark.createDataFrame([
    {"name": "Haskell", "preferred_number": 5, "available_numbers": [1, 6, 100, 33, 44, 99]},
    {"name": "Odersky", "preferred_number": 2, "available_numbers": [1, 6, 100, 33, 44, 99]},
])
                

Apply The filtering

In this example, we want to filter the available_numbers with a predicate that check if available_number MOD preferred_number == 0

                  
# filter the available numbers based on the person's preferred number
(
    df
    .withColumn("modulo_filter", F.filter("available_numbers", lambda x: (x % F.col("preferred_number") == 0)))
    .show(truncate=False)
)
                

Results