Spark is a lightning-fast cluster computing technology. Its in-memory cluster computing feature increases the processing speed of an application. Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, which can be created in two ways: by referencing datasets in external storage and by applying transformations. Actions is performed when we want to work with the actual dataset. Remember that RDD is lazy, so nothing will be executed until a transformation or action is triggered.
Transformation |
Description |
map |
Create new data from function. |
filter |
Filter by creteria. |
flatMap |
Function returns a sequence instead of a value. |
sample |
Sample data. |
union |
Union of two datasets. |
intersection |
Intersection of two datasets. |
distinct |
Distinct elements. |
groupByKey |
Group on a dataset of (K, V) pairs. |
reduceByKey |
Reduce on a dataset of (K, V) pairs. |
sortByKey |
Sort on a dataset of (K, V) pairs. |
join |
Join two datasets based on a related column. |
RDD Actions
Action |
Description |
reduce |
Aggregation. |
collect |
Return as array. |
count |
Number of elements. |
countByKey |
Count of each key. |
take |
First n elements. |
saveAsTextFile |
Write to text file. |