MLlib (DataFrame-based) for Spark Connect#
Warning
The namespace for this package can change in the future Spark version.
Pipeline APIs#
| Abstract class for transformers that transform one dataset into another. | |
| Abstract class for estimators that fit models to data. | |
| 
 | Abstract class for models that are fitted by estimators. | 
| Base class for evaluators that compute metrics from predictions. | |
| 
 | A simple pipeline, which acts as an estimator. | 
| 
 | Represents a compiled pipeline with transformers and fitted models. | 
Feature#
| 
 | Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. | 
| 
 | Model fitted by MaxAbsScaler. | 
| 
 | Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. | 
| 
 | Model fitted by StandardScaler. | 
| 
 | A feature transformer that merges multiple input columns into an array type column. | 
Classification#
| 
 | Logistic regression estimator. | 
| 
 | Model fitted by LogisticRegression. | 
Functions#
| 
 | Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances | 
| 
 | Converts a column of MLlib sparse/dense vectors into a column of dense arrays. | 
Tuning#
| 
 | K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. | 
| 
 | CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data. | 
Evaluation#
| 
 | Evaluator for Regression, which expects input columns prediction and label. | 
| 
 | Evaluator for binary classification, which expects input columns prediction and label. | 
| Evaluator for multiclass classification, which expects input columns prediction and label. | 
Utilities#
| The base interface Estimator / Transformer / Model / Evaluator needs to inherit for supporting saving and loading. | |
| Meta-algorithm such as pipeline and cross validator must implement this interface. |