Apache Spark Scala Interview Questions- Shyam Mallesh -

val rdd = sc.parallelize(1 to 4) rdd.map(x => x * 2) // 2,4,6,8 rdd.flatMap(x => 1 to x) // 1,1,2,1,2,3,1,2,3,4 rdd.mapPartitions(iter => iter.map(_ * 2)) // same as map but per partition Spark uses lineage (RDD dependency graph). Each RDD remembers how it was built from other datasets. If a partition is lost, Spark recomputes it using the lineage, not replication. However, you can also cache/persist with replication (e.g., StorageLevel.MEMORY_AND_DISK_2 ).

HereвЂ™s a curated set of , structured in the style of Shyam Mallesh (known for clear, practical, and depth-driven technical content). These range from beginner to advanced, covering RDDs, DataFrames, Spark SQL, optimizations, and internals. рџљЂ Apache Spark Scala Interview Questions вЂ“ By Shyam Mallesh вњ… 1. What are the differences between map , flatMap , and mapPartitions in Spark? | Transformation | Description | |----------------|-------------| | map | Applies a function to each element of an RDD/DataFrame and returns a new collection of same size. | | flatMap | Applies a function that returns a sequence (or Option) and flattens the result. Useful for one-to-many transformations. | | mapPartitions | Applies a function to each partition as an iterator. Avoids per-element function call overhead. Good for initialization (e.g., DB connections). | Apache Spark Scala Interview Questions- Shyam Mallesh

вљ пёЏ coalesce(1) avoids shuffle but may cause data skew. Only safe if current partitions are small. With schema inference (slow but automatic): val rdd = sc

val rdd = sc.textFile("data.txt") // nothing read yet val words = rdd.flatMap(_.split(" ")) // transformation val counts = words.map(w => (w, 1)).reduceByKey(_ + _) // transformation counts.saveAsTextFile("output") // рџ”Ґ Action triggers job | Operation | Shuffle Behavior | Performance | |----------------|------------------|--------------| | groupByKey | Sends all values for a key across the network в†’ high shuffle I/O | Slower, risks OOM | | reduceByKey | Combines values locally (map-side reduce) before shuffle в†’ reduces data transfer | Faster, memory efficient | However, you can also cache/persist with replication (e