pyspark.SparkContext.runJob¶

SparkContext.runJob(rdd: pyspark.rdd.RDD[T], partitionFunc: Callable[[Iterable[T]], Iterable[U]], partitions: Optional[Sequence[int]] = None, allowLocal: bool = False) → List[U][source]¶

Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.

If ‘partitions’ is not specified, this will run over all partitions.

New in version 1.1.0.

Parameters

rddRDD: target RDD to run tasks on
partitionFuncfunction: a function to run on each partition of the RDD
partitionslist, optional: set of partitions to run on; some jobs may not want to compute on all partitions of the target RDD, e.g. for operations like first
allowLocalbool, default False: this parameter takes no effect

Returns

list: results of specified partitions

See also

SparkContext.cancelAllJobs()

Examples

>>> myRDD = sc.parallelize(range(6), 3)
>>> sc.runJob(myRDD, lambda part: [x * x for x in part])
[0, 1, 4, 9, 16, 25]

>>> myRDD = sc.parallelize(range(6), 3)
>>> sc.runJob(myRDD, lambda part: [x * x for x in part], [0, 2], True)
[0, 1, 16, 25]

pyspark.SparkContext.resources pyspark.SparkContext.sequenceFile