Beginning Apache Spark 3 Pdf -
from pyspark.sql.functions import udf def squared(x): return x * x
Example:
from pyspark.sql import SparkSession spark = SparkSession.builder .appName("MyApp") .config("spark.sql.adaptive.enabled", "true") .getOrCreate() 3.1 RDD – The Original Foundation RDDs (Resilient Distributed Datasets) are low‑level, immutable, partitioned collections. They provide fault tolerance via lineage. However, they are not recommended for new projects because they lack optimization. beginning apache spark 3 pdf
squared_udf = udf(squared, IntegerType()) df.withColumn("squared_val", squared_udf(df.value)) from pyspark
General rule: 2–3 tasks per CPU core.