_index.org

Common Spark Actions

Last edited: August 8, 2025
  • collect(): get all of your data
  • count(): get a count of the elements in the RDD
  • countByValue(): list the times each value appears
  • reduce(func): the reduce part of MapReduce
  • first(), take(n): return some number of elements
  • top(n): return the highest n values in the list

Common Spark Transformations

Last edited: August 8, 2025
  • map(func): apply a function on all functions
  • filter(func): filter based on function
  • flatMap(func): flatten returned lists into one giant list
  • union(rdd): create a union of multiple RDD0
  • subtract(rdd): subtract RDDs
  • cartesian(rdd): cartesian product of rdd
  • parallelize(list): make an RDD from list

Special transformations for Pair RDDs

  • reduceByKey(func): key things
  • groupByKey(func): key things
  • sortByKey(func): key things

See also Database “Join”

Communication Complexity (Chapter)

Last edited: August 8, 2025

Communication Complexity tries to model one aspect of distributed computing.

See Communication Complexity

Let us consider two parties—Alice and Bob. They want to compute some function:

\begin{equation} f: \qty{0,1}^{*} \times \qty{0,1}^{ *} \to \qty{0,1} \end{equation}

against two inputs held by Alice and Bob respectively, \(x \in \qty{0,1}^{*}\) and \(y \in \qty{0,1}^{ *}\), where \(|x| = |y| = n\), where \(n\) is very large (i.e. just sending all of \(x\) over to the other party and compute \(f\) on one end isn’t good).

commutativity

Last edited: August 8, 2025

commutativity means that the same operation can be ran in any order.

That is:

\begin{equation} ABC = ACB \end{equation}

comparison function

Last edited: August 8, 2025
  1. return < 0 if first value should come before second value
  2. return > 0 if first value should come AFTEr second value
  3. 0 if the first and second value are equivalent