Feature Engineering

Tangram chooses the appropriate feature groups to generate based on the input column types in the CSV. Tangram will either automatically infer the column types or you can provide the column types explicitly by passing a config file when invoking tangram train. See the guide Train with Custom Configuration.

Identity Feature Groups

Identity feature groups are used for Number and Enum columns when training Gradient Boosted Decision Trees. An Identity feature group for an Enum column is also known as Label Encoding.

Enum Column Example

For instance, suppose we have a column in our CSV called color which takes on three values: Red, Green, and Blue. The Identity feature group would assign the following feature values to each of the input’s enum variant values.

Input ValueFeature Value
OOV0
Red1
Green2
Blue3

The feature value 0 is reserved for Out of Vocabulary (OOV) values. These are values that we did not see in the training dataset but may appear in testing or when the model is deployed.

One Hot Encoded Feature Groups

One Hot Encoded feature groups are used for Enum columns when training linear models. If we were to use the Identity feature encoding in linear models, we would be assuming an order: Red < Green < Blue. Therefore, we use a One Hot Encoding. The feature group consists of n + 1 features, one for each of the Enum column’s variants and one for the Out of Vocabulary (OOV) value. The feature value at index i is 1 if input value is equal to the i_th variant.

Example

For instance, suppose we have a column in our CSV called color which takes on three values: Red, Green, and Blue. The One Hot Encoding feature group would assign the following feature values to each of the input’s enum variant values.

Input ValueFeature Value
OOV1 0 0 0
Red0 1 0 0
Green0 0 1 0
Blue0 0 0 1

Normalized Feature Groups

Normalized feature groups are used for Number columns when training linear models. The feature mapping transforms the input column values into a feature column with mean zero and unit variance.

Bag of Words Feature Groups

Bag of Words feature groups are used to encode Text columns. Bag of Words feature groups consist of n features, one for each of the unique ngrams in the text column. The feature value for a given ngram depends on the strategy used: Present, Count, or TF-IDF. The Present strategy assigns a value of 0 or 1 depending on whether the ngram appears in the text. The Count strategy assigns a value equal to the count of the number of times the ngram appears in the text. The TF-IDF strategy assigns a value equal to the tf-idf weighted count of the number of times the ngram appears in the text. See Bag of Words Model to learn more.