Abstract class for transformers that transform one dataset into another.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Abstract class for estimators that fit models to data.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Abstract class for models that are fitted by estimators.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit() is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit() method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform() method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is an PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as an identity transformer.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Get pipeline stages.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets params for Pipeline.
Set pipeline stages. :param value: a list of transformers or estimators :return: the pipeline instance
Represents a compiled pipeline with transformers and fitted models.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Binarize a column of continuous features given a threshold.
>>> df = sqlContext.createDataFrame([(0.5,)], ["values"])
>>> binarizer = Binarizer(threshold=1.0, inputCol="values", outputCol="features")
>>> binarizer.transform(df).head().features
0.0
>>> binarizer.setParams(outputCol="freqs").transform(df).head().freqs
0.0
>>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"}
>>> binarizer.transform(df, params).head().vector
1.0
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets params for this Binarizer.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Maps a sequence of terms to their term frequencies using the hashing trick.
>>> df = sqlContext.createDataFrame([(["a", "b", "c"],)], ["words"])
>>> hashingTF = HashingTF(numFeatures=10, inputCol="words", outputCol="features")
>>> hashingTF.transform(df).head().features
SparseVector(10, {7: 1.0, 8: 1.0, 9: 1.0})
>>> hashingTF.setParams(outputCol="freqs").transform(df).head().freqs
SparseVector(10, {7: 1.0, 8: 1.0, 9: 1.0})
>>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"}
>>> hashingTF.transform(df, params).head().vector
SparseVector(5, {2: 1.0, 3: 1.0, 4: 1.0})
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCol or its default value.
Gets the value of numFeatures or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of numFeatures.
Sets params for this HashingTF.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Compute the Inverse Document Frequency (IDF) given a collection of documents.
>>> from pyspark.mllib.linalg import DenseVector
>>> df = sqlContext.createDataFrame([(DenseVector([1.0, 2.0]),),
...     (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"])
>>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf")
>>> idf.fit(df).transform(df).head().idf
DenseVector([0.0, 0.0])
>>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs
DenseVector([0.0, 0.0])
>>> params = {idf.minDocFreq: 1, idf.outputCol: "vector"}
>>> idf.fit(df, params).transform(df).head().vector
DenseVector([0.2877, 0.0])
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of minDocFreq.
Model fitted by IDF.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Normalize a vector to have unit norm using the given p-norm.
>>> from pyspark.mllib.linalg import Vectors
>>> svec = Vectors.sparse(4, {1: 4.0, 3: 3.0})
>>> df = sqlContext.createDataFrame([(Vectors.dense([3.0, -4.0]), svec)], ["dense", "sparse"])
>>> normalizer = Normalizer(p=2.0, inputCol="dense", outputCol="features")
>>> normalizer.transform(df).head().features
DenseVector([0.6, -0.8])
>>> normalizer.setParams(inputCol="sparse", outputCol="freqs").transform(df).head().freqs
SparseVector(4, {1: 0.8, 3: 0.6})
>>> params = {normalizer.p: 1.0, normalizer.inputCol: "dense", normalizer.outputCol: "vector"}
>>> normalizer.transform(df, params).head().vector
DenseVector([0.4286, -0.5714])
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.
See also
StringIndexer for converting categorical values into category indices
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> model = stringIndexer.fit(stringIndDf)
>>> td = model.transform(stringIndDf)
>>> encoder = OneHotEncoder(inputCol="indexed", outputCol="features")
>>> encoder.transform(td).head().features
SparseVector(2, {0: 1.0})
>>> encoder.setParams(outputCol="freqs").transform(td).head().freqs
SparseVector(2, {0: 1.0})
>>> params = {encoder.dropLast: False, encoder.outputCol: "test"}
>>> encoder.transform(td, params).head().test
SparseVector(3, {0: 1.0})
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets params for this OneHotEncoder.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at http://en.wikipedia.org/wiki/Polynomial_expansion, “In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition”. Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([(Vectors.dense([0.5, 2.0]),)], ["dense"])
>>> px = PolynomialExpansion(degree=2, inputCol="dense", outputCol="expanded")
>>> px.transform(df).head().expanded
DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])
>>> px.setParams(outputCol="test").transform(df).head().test
DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets params for this PolynomialExpansion.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is true). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
>>> df = sqlContext.createDataFrame([("a b  c",)], ["text"])
>>> reTokenizer = RegexTokenizer(inputCol="text", outputCol="words")
>>> reTokenizer.transform(df).head()
Row(text=u'a b  c', words=[u'a', u'b', u'c'])
>>> # Change a parameter.
>>> reTokenizer.setParams(outputCol="tokens").transform(df).head()
Row(text=u'a b  c', tokens=[u'a', u'b', u'c'])
>>> # Temporarily modify a parameter.
>>> reTokenizer.transform(df, {reTokenizer.outputCol: "words"}).head()
Row(text=u'a b  c', words=[u'a', u'b', u'c'])
>>> reTokenizer.transform(df).head()
Row(text=u'a b  c', tokens=[u'a', u'b', u'c'])
>>> # Must use keyword arguments to specify params.
>>> reTokenizer.setParams("text")
Traceback (most recent call last):
    ...
TypeError: Method setParams forces keyword arguments.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of minTokenLength.
Sets params for this RegexTokenizer.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
>>> standardScaler = StandardScaler(inputCol="a", outputCol="scaled")
>>> model = standardScaler.fit(df)
>>> model.transform(df).collect()[1].scaled
DenseVector([1.4142])
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets params for this StandardScaler.
Model fitted by StandardScaler.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0.
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> model = stringIndexer.fit(stringIndDf)
>>> td = model.transform(stringIndDf)
>>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
...     key=lambda x: x[0])
[(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)]
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Model fitted by StringIndexer.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
>>> df = sqlContext.createDataFrame([("a b c",)], ["text"])
>>> tokenizer = Tokenizer(inputCol="text", outputCol="words")
>>> tokenizer.transform(df).head()
Row(text=u'a b c', words=[u'a', u'b', u'c'])
>>> # Change a parameter.
>>> tokenizer.setParams(outputCol="tokens").transform(df).head()
Row(text=u'a b c', tokens=[u'a', u'b', u'c'])
>>> # Temporarily modify a parameter.
>>> tokenizer.transform(df, {tokenizer.outputCol: "words"}).head()
Row(text=u'a b c', words=[u'a', u'b', u'c'])
>>> tokenizer.transform(df).head()
Row(text=u'a b c', tokens=[u'a', u'b', u'c'])
>>> # Must use keyword arguments to specify params.
>>> tokenizer.setParams("text")
Traceback (most recent call last):
    ...
TypeError: Method setParams forces keyword arguments.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
A feature transformer that merges multiple columns into a vector column.
>>> df = sqlContext.createDataFrame([(1, 0, 3)], ["a", "b", "c"])
>>> vecAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features")
>>> vecAssembler.transform(df).head().features
DenseVector([1.0, 0.0, 3.0])
>>> vecAssembler.setParams(outputCol="freqs").transform(df).head().freqs
DenseVector([1.0, 0.0, 3.0])
>>> params = {vecAssembler.inputCols: ["b", "a"], vecAssembler.outputCol: "vector"}
>>> vecAssembler.transform(df, params).head().vector
DenseVector([0.0, 1.0])
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of inputCols or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Class for indexing categorical feature columns in a dataset of [[Vector]].
- Automatically identify categorical features (default behavior)
- This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
- Set maxCategories to the maximum number of categorical any categorical feature should have.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.
- Index all features, if all features are categorical
- If maxCategories is set to be very large, then this will build an index of unique values for all features.
- Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical.
This returns a model which can transform categorical features to use 0-based indices.
- This is not guaranteed to choose the same category index across multiple runs.
- If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity.
- More stability may be added in the future.
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([(Vectors.dense([-1.0, 0.0]),),
...     (Vectors.dense([0.0, 1.0]),), (Vectors.dense([0.0, 2.0]),)], ["a"])
>>> indexer = VectorIndexer(maxCategories=2, inputCol="a", outputCol="indexed")
>>> model = indexer.fit(df)
>>> model.transform(df).head().indexed
DenseVector([1.0, 0.0])
>>> indexer.setParams(outputCol="test").fit(df).transform(df).collect()[1].test
DenseVector([0.0, 1.0])
>>> params = {indexer.maxCategories: 3, indexer.outputCol: "vector"}
>>> model2 = indexer.fit(df, params)
>>> model2.transform(df).head().vector
DenseVector([1.0, 0.0])
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of maxCategories.
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
>>> sent = ("a b " * 100 + "a c " * 10).split(" ")
>>> doc = sqlContext.createDataFrame([(sent,), (sent,)], ["sentence"])
>>> model = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model").fit(doc)
>>> model.transform(doc).head().model
DenseVector([-0.0422, -0.5138, -0.2546, 0.6885, 0.276])
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of inputCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Gets the value of seed or its default value.
Gets the value of stepSize or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of numPartitions.
Sets params for this Word2Vec.
Sets the value of vectorSize.
Model fitted by Word2Vec.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Logistic regression.
>>> from pyspark.sql import Row
>>> from pyspark.mllib.linalg import Vectors
>>> df = sc.parallelize([
...     Row(label=1.0, features=Vectors.dense(1.0)),
...     Row(label=0.0, features=Vectors.sparse(1, [], []))]).toDF()
>>> lr = LogisticRegression(maxIter=5, regParam=0.01)
>>> model = lr.fit(df)
>>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0))]).toDF()
>>> model.transform(test0).head().prediction
0.0
>>> model.weights
DenseVector([5.5...])
>>> model.intercept
-2.68...
>>> test1 = sc.parallelize([Row(features=Vectors.sparse(1, [0], [1.0]))]).toDF()
>>> model.transform(test1).head().prediction
1.0
>>> lr.setParams("vector")
Traceback (most recent call last):
    ...
TypeError: Method setParams forces keyword arguments.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
param for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
param for whether to fit an intercept term.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of regParam or its default value.
Gets the value of tol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of elasticNetParam.
Sets the value of featuresCol.
Sets the value of fitIntercept.
Sets params for logistic regression.
Sets the value of predictionCol.
Sets the value of probabilityCol.
param for threshold in binary classification prediction, in range [0, 1].
Model fitted by LogisticRegression.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
>>> from pyspark.mllib.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = sqlContext.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
>>> model = dt.fit(td)
>>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
param for Criterion used for information gain calculation (case-insensitive).
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for the DecisionTreeClassifier.
Sets the value of predictionCol.
Model fitted by DecisionTreeClassifier.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features. Note: Multiclass labels are not currently supported.
>>> from pyspark.mllib.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = sqlContext.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed")
>>> model = gbt.fit(td)
>>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxIter or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
param for Loss function which GBT tries to minimize (case-insensitive).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for Gradient Boosted Tree Classification.
Sets the value of predictionCol.
Sets the value of subsamplingRate.
Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of
Fraction of the training data used for learning each decision tree, in range (0, 1].
Model fitted by GBTClassifier.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
http://en.wikipedia.org/wiki/Random_forest Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
>>> from pyspark.mllib.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = sqlContext.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> rf = RandomForestClassifier(numTrees=2, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)
>>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
param for The number of features to consider for splits at each tree node
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
param for Criterion used for information gain calculation (case-insensitive).
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
param for Number of trees to train (>= 1)
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featureSubsetStrategy.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for linear classification.
Sets the value of predictionCol.
Sets the value of subsamplingRate.
param for Fraction of the training data used for learning each decision tree,
Model fitted by RandomForestClassifier.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Alternating Least Squares (ALS) matrix factorization.
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called ‘factor’ matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as “users” and “products”) into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user’s feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the “out-links” of each user (which blocks of products it will contribute to) and “in-link” information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users’ ratings and update the products based on these messages.
For implicit preference data, the algorithm used is based on “Collaborative Filtering for Implicit Feedback Datasets”, available at http://dx.doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.
Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r > 0 and 0 if r <= 0. The ratings then act as ‘confidence’ values related to strength of indicated user preferences rather than explicit ratings given to items.
>>> df = sqlContext.createDataFrame(
...     [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
...     ["user", "item", "rating"])
>>> als = ALS(rank=10, maxIter=5)
>>> model = als.fit(df)
>>> model.rank
10
>>> model.userFactors.orderBy("id").collect()
[Row(id=0, features=[...]), Row(id=1, ...), Row(id=2, ...)]
>>> test = sqlContext.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"])
>>> predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
>>> predictions[0]
Row(user=0, item=2, prediction=0.39...)
>>> predictions[1]
Row(user=1, item=0, prediction=3.19...)
>>> predictions[2]
Row(user=2, item=0, prediction=-1.15...)
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of checkpointInterval or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of regParam or its default value.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of checkpointInterval.
Sets the value of implicitPrefs.
Sets the value of nonnegative.
Sets both numUserBlocks and numItemBlocks to the specific value.
Sets the value of numItemBlocks.
Sets the value of numUserBlocks.
Sets params for ALS.
Sets the value of predictionCol.
Model fitted by ALS.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree learning algorithm for regression. It supports both continuous and categorical features.
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> dt = DecisionTreeRegressor(maxDepth=2)
>>> model = dt.fit(df)
>>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
param for Criterion used for information gain calculation (case-insensitive).
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for the DecisionTreeRegressor.
Sets the value of predictionCol.
Model fitted by DecisionTreeRegressor.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs) learning algorithm for regression. It supports both continuous and categorical features.
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> gbt = GBTRegressor(maxIter=5, maxDepth=2)
>>> model = gbt.fit(df)
>>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxIter or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
param for Loss function which GBT tries to minimize (case-insensitive).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for Gradient Boosted Tree Regression.
Sets the value of predictionCol.
Sets the value of subsamplingRate.
Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of
Fraction of the training data used for learning each decision tree, in range (0, 1].
Model fitted by GBTRegressor.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Linear regression.
The learning objective is to minimize the squared error, with regularization. The specific squared error loss function used is: L = 1/2n ||A weights - y||^2^
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0)
>>> model = lr.fit(df)
>>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
-1.0
>>> model.weights
DenseVector([1.0])
>>> model.intercept
0.0
>>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> lr.setParams("vector")
Traceback (most recent call last):
    ...
TypeError: Method setParams forces keyword arguments.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
param for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of regParam or its default value.
Gets the value of tol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of elasticNetParam.
Sets the value of featuresCol.
Sets params for linear regression.
Sets the value of predictionCol.
Model fitted by LinearRegression.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
http://en.wikipedia.org/wiki/Random_forest Random Forest learning algorithm for regression. It supports both continuous and categorical features.
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
>>> model = rf.fit(df)
>>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
0.5
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
param for The number of features to consider for splits at each tree node
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
param for Criterion used for information gain calculation (case-insensitive).
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
param for Number of trees to train (>= 1)
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featureSubsetStrategy.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for linear regression.
Sets the value of predictionCol.
Sets the value of subsamplingRate.
param for Fraction of the training data used for learning each decision tree,
Model fitted by RandomForestRegressor.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Builder for a param grid used in grid search-based model selection.
>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> output = ParamGridBuilder() \
...     .baseOn({lr.labelCol: 'l'}) \
...     .baseOn([lr.predictionCol, 'p']) \
...     .addGrid(lr.regParam, [1.0, 2.0]) \
...     .addGrid(lr.maxIter, [1, 5]) \
...     .build()
>>> expected = [
...     {lr.regParam: 1.0, lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'},
...     {lr.regParam: 2.0, lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'},
...     {lr.regParam: 1.0, lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'},
...     {lr.regParam: 2.0, lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}]
>>> len(output) == len(expected)
True
>>> all([m in expected for m in output])
True
K-fold cross validation.
>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.evaluation import BinaryClassificationEvaluator
>>> from pyspark.mllib.linalg import Vectors
>>> dataset = sqlContext.createDataFrame(
...     [(Vectors.dense([0.0]), 0.0),
...      (Vectors.dense([0.4]), 1.0),
...      (Vectors.dense([0.5]), 0.0),
...      (Vectors.dense([0.6]), 1.0),
...      (Vectors.dense([1.0]), 1.0)] * 10,
...     ["features", "label"])
>>> lr = LogisticRegression()
>>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
>>> evaluator = BinaryClassificationEvaluator()
>>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
>>> cvModel = cv.fit(dataset)
>>> evaluator.evaluate(cvModel.transform(dataset))
0.8333...
param for estimator to be cross-validated
param for estimator param maps
param for the evaluator used to select hyper-parameters that maximize the cross-validated metric
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Fits a model to the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | fitted model(s) | 
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
param for number of folds for cross validation
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of estimatorParamMaps.
Model from k-fold cross validation.
best model from cross validation
Creates a copy of this instance with a randomly generated uid and some extra params. This copies the underlying bestModel, creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | transformed dataset | 
Base class for evaluators that compute metrics from predictions.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Evaluates the output with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | metric | 
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Evaluator for binary classification, which expects two input columns: rawPrediction and label.
>>> from pyspark.mllib.linalg import Vectors
>>> scoreAndLabels = map(lambda x: (Vectors.dense([1.0 - x[0], x[0]]), x[1]),
...    [(0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)])
>>> dataset = sqlContext.createDataFrame(scoreAndLabels, ["raw", "label"])
...
>>> evaluator = BinaryClassificationEvaluator(rawPredictionCol="raw")
>>> evaluator.evaluate(dataset)
0.70...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "areaUnderPR"})
0.83...
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Evaluates the output with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | metric | 
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of rawPredictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
param for metric name in evaluation (areaUnderROC|areaUnderPR)
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of metricName.
Sets params for binary classification evaluator.
Sets the value of rawPredictionCol.
Evaluator for Regression, which expects two input columns: prediction and label.
>>> scoreAndLabels = [(-28.98343821, -27.0), (20.21491975, 21.5),
...   (-25.98418959, -22.0), (30.69731842, 33.0), (74.69283752, 71.0)]
>>> dataset = sqlContext.createDataFrame(scoreAndLabels, ["raw", "label"])
...
>>> evaluator = RegressionEvaluator(predictionCol="raw")
>>> evaluator.evaluate(dataset)
2.842...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "r2"})
0.993...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "mae"})
2.649...
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
Evaluates the output with optional parameters.
| Parameters: | 
 | 
|---|---|
| Returns: | metric | 
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
param for metric name in evaluation (mse|rmse|r2|mae)
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of metricName.
Sets params for regression evaluator.
Sets the value of predictionCol.