Feature Engineering: The Data Structures of Machine Learning

Features Are Your API to the Model

In software, you design data structures to represent your domain. In ML, features are that data structure — the representation you hand to the model. Bad features make good models fail. Good features make simple models work.

Feature engineering is the craft of translating raw data into representations that make the signal easy for a model to find.

The Two Jobs of Feature Engineering

Make implicit structure explicit: If day-of-week matters, encode it. Don't make the model discover the concept of "weekend" from raw timestamps.
Scale and normalize: Most models are sensitive to the magnitude of inputs. Bring your features to a common scale.

Numerical Features

Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

salaries = np.array([45000, 65000, 120000, 95000, 200000]).reshape(-1, 1)

# Standardization: mean=0, std=1 (use for most models)
scaler = StandardScaler()
standardized = scaler.fit_transform(salaries)
# [-1.11, -0.67, 0.42, 0.10, 1.26]

# Min-max: range [0, 1] (use for neural networks, image data)
minmax = MinMaxScaler()
normalized = minmax.fit_transform(salaries)
# [0.0, 0.13, 0.48, 0.32, 1.0]

When to use which:

StandardScaler: linear models, SVM, PCA, most algorithms
MinMaxScaler: neural networks, when you need bounded outputs
Neither: tree-based models (XGBoost, Random Forest) — they're invariant to monotonic transforms

Log Transforms for Skewed Data

import pandas as pd

# Price data: heavily right-skewed
prices = pd.Series([10, 12, 11, 500, 15, 13, 1200, 9])

print(prices.skew())          # 2.1 — very skewed
print(np.log1p(prices).skew())  # 0.4 — much better

Log transforms are essential for features like prices, counts, and durations. They compress outliers and make multiplicative relationships additive (which linear models prefer).

Binning Continuous Features

Sometimes discretizing a continuous feature outperforms using it raw:

ages = pd.Series([23, 35, 42, 19, 67, 28, 55, 31])

# Equal-width bins
pd.cut(ages, bins=3)
# (18, 35], (18, 35], (35, 51], (18, 35], (51, 67], ...

# Equal-frequency bins (quantile-based)
pd.qcut(ages, q=3)
# Ensures roughly equal number of samples per bin

# Custom bins with business logic
age_groups = pd.cut(ages, bins=[0, 25, 35, 50, 100],
                    labels=["Gen Z", "Millennial", "Gen X", "Boomer"])

Use binning when you believe the relationship is non-linear and you're using a linear model, or when you want to encode domain knowledge about thresholds.

Categorical Features

One-Hot Encoding

# Low-cardinality categoricals (< ~20 unique values)
df = pd.DataFrame({"color": ["red", "blue", "green", "blue", "red"]})

encoded = pd.get_dummies(df["color"], prefix="color")
#    color_blue  color_green  color_red
# 0       False        False       True
# 1        True        False      False

Watch out for the dummy variable trap: with k categories, you only need k-1 columns (the last is implied). Use drop_first=True for linear models.

Ordinal Encoding

from sklearn.preprocessing import OrdinalEncoder

# When categories have a natural order
sizes = np.array([["Small"], ["Large"], ["Medium"], ["XL"]])

encoder = OrdinalEncoder(categories=[["Small", "Medium", "Large", "XL"]])
encoder.fit_transform(sizes)
# [[0.], [2.], [1.], [3.]]

Target Encoding (High-Cardinality)

For features with many categories (zip codes, user IDs):

# Replace category with mean of target for that category
# Example: encode "city" using mean house price per city
city_means = df.groupby("city")["price"].mean()
df["city_encoded"] = df["city"].map(city_means)

Critical: always compute target encoding on training data only, then apply to validation/test. Leaking test targets into features is a common bug.

Datetime Features

Timestamps are packed with signal — extract it explicitly:

df["timestamp"] = pd.to_datetime(df["timestamp"])

# Extract components
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek  # 0=Monday
df["month"] = df["timestamp"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["is_business_hours"] = df["hour"].between(9, 17).astype(int)

# Cyclical encoding: hour 23 and hour 0 are adjacent
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)

The cyclical encoding (sin/cos) is the key insight: raw hour values treat hour 23 and hour 0 as far apart, but they're adjacent. Sin/cos encoding preserves this cyclical structure.

Text Features

Bag of Words

from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are pets"
]

vectorizer = TfidfVectorizer(max_features=1000, stop_words="english")
X = vectorizer.fit_transform(docs)
# Sparse matrix: rows=documents, cols=words, values=TF-IDF scores

TF-IDF weights rare, discriminative words higher and common words lower. It's simple and still competitive for many classification tasks.

Embeddings (Modern Approach)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(docs)
# Shape: (3, 384) — each doc is a 384-dimensional vector

Embeddings capture semantic meaning and outperform bag-of-words when you have limited labeled data or need to handle synonyms and paraphrases.

Interaction Features

Sometimes the signal is in the combination of two features, not either alone:

# CTR depends on both ad quality AND user intent match
df["ctr_signal"] = df["ad_quality_score"] * df["query_relevance_score"]

# House price depends on both size AND neighborhood
df["price_per_sqft_area"] = df["house_sqft"] * df["area_desirability_index"]

# Polynomial features (for linear models)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_poly = poly.fit_transform(X[["feature_a", "feature_b"]])

When to create interaction features: when you have domain knowledge that two features interact, and you're using a linear model (which can't discover interactions on its own). Tree-based models handle interactions automatically.

The Feature Engineering Checklist

Before training any model, ask:

Are numerical features on comparable scales?
Are categorical features with many unique values encoded appropriately?
Are timestamps decomposed into meaningful components?
Is any signal buried in string fields that needs extraction?
Are there any features you could compute from existing features that encode domain logic?
Have you checked for data leakage (test data influencing feature computation)?
Are missing values handled explicitly?

A Note on Feature Importance

After training, always inspect which features your model uses most:

import xgboost as xgb
import pandas as pd

model = xgb.XGBClassifier()
model.fit(X_train, y_train)

importance = pd.Series(model.feature_importances_, index=feature_names)
importance.sort_values().plot(kind="barh")

If a feature you expected to be important isn't, either it doesn't matter, it's correlated with another feature, or your encoding is wrong.

For the next step, learn how features flow through a production system in our guide to feature stores.