An elegant way to do feature engineering — feature engineering foundations

After repeating creating use case-specific and business logic coupled feature engineering code for a couple of years, I am thinking if it is possible to have a model agnostic (both algorithms and frameworks ), production-ready, data scientist friendly way to do feature engineering. I would summarize what we did for the use cases in financial service and healthcare, and propose some best practices (this is not accurate, we’ve seen good and better, but never seen best) to handle common problems in the lifecycle of ML models.

The purpose of feature engineering is to prepare features for ML models to process, so we want the output of feature engineering to

  • improve the accuracy of the models

The discussion would be divided into the following sections:

  • Feature engineering foundations

This article is about the 1st section. A bunch of information in this blog was extracted from the book “Machine learning design patterns”.

Content

  • Naming convention

Naming convention

  • input: the real-world data fed to the model

Univariate data representation methods

Numerical Inputs

Need scale due to quickly converge for LR/NN & standardize magnitudes across features.

  1. Linear scaling
  • Min-max scaling

2. Nonlinear transformation

When

  • Data is skewed and neither uniformly distributed nor distributed like a bell curve

Methods:

  • bins/quantiles: choose buckets is to do histogram equalization, where the bins of the histogram are chosen based on quantiles of the raw distribution. Or the bu

An Array of numbers

  • Representing the input array in terms of its bulk statistics: average, median, minimum, maximum, and so forth.

Categorical Inputs

  • one-hot encoding

An array of Categorical Inputs

  • counting/relative-frequency

Other Inputs

Multi-variate data representation methods

Feature Cross

Concatenating two or more categorical features in order to capture the interaction between them. This would make models simpler and have a better performance.

We should preprocess numeric features into categorical features to enable feature cross.

And to handle high cardinality, we can play it with hashed or embedding methods. L1 and L2 regulation are useful to encourage sparsity of features and reduce overfitting.

Concatenating representations

An input to a model can be represented as a number or as a category, an image, or free-form text. But in real-world use cases, the information for a machine learning problem could come from different sources, and we can enable models to make the decision for the multi-input problem. To achieve this, we can concatenate the encoded features before input them into models.

Reference:

Data scientist & MLE & SWE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store