After repeating creating use case-specific and business logic coupled feature engineering code for a couple of years, I am thinking if it is possible to have a model agnostic (both algorithms and frameworks ), production-ready, data scientist friendly way to do feature engineering. I would summarize what we did for the use cases in financial service and healthcare, and propose some best practices (this is not accurate, we’ve seen good and better, but never seen best) to handle common problems in the lifecycle of ML models.

The purpose of feature engineering is to prepare features for ML models to process…


ML projects in the real world. Photo by elCarito on Unsplash

Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It has been an important part of computer science for decades. Recently years, with the development of better algorithms, plenty of data, and more powerful computing power, the performance of ML models has been improved in a significant amount, thus ML started more and more contributing to business and industry use cases.

Different from research practice, ML in the industry requires a more standard processing pipeline, more robust experiment analysis, and more affordable deployment, leading to the creation of tools that help companies bring theoretical ML…


In this day and age, more and more organizations would like to have a uniform tool to deal with their tons of features. For a simple POC product, feature governance and linage seem to be overkill, but when it comes to big, complex, and continuously evolving projects, they do prompt high quality of feature and execution efficiency in the following data science projects.

To summarize, why do we need a feature store?

  • Standardizing the feature definitions. It should be an open, extensible, unified platform for feature storage.
  • Promoting feature reusability. It facilitates discovery and feature reuse across machine learning projects.


Cited from: https://unsplash.com/photos/0rTCXZM7Xfo

Time series forecasting is a well-studied statistics/ machine learning branch and a common statistical task in business. In the real world, time-series data sometimes need to be combined with other data sources to construct more powerful machine learning models.

In this article, I would like to summarize common ways to combine time-series data and tabular data to complete a machine learning project.

1. Problem analysis

Before diving into methods, let us review the scenarios where we have time-series data and tabular data at the same time.


Updates

  • 2020–02–21: change the title and content from “5 methods” to “X methods” to scale for future updates. Add an introduction to hydra.
  • 2021–01–27: added gin config in the reference
  • 2021–06–01: added dynaconf in the reference

Introduction

Config management (to be clarified, not software configuration management)for python application is a common but not trivial task. A good design config management should support a flexible parameter setting and dynamic loading. Here, I would like to summarize the frequently used configuration management strategies. Hopefully, you can find the most suitable methods to help you create applications.

  1. Config file(Json / yaml / INI / *.cfg).

Kevin Du

Data scientist & MLE & SWE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store