Eric Jin← Back to projects
PythonScikit-learnLogistic RegressionRandom ForestEDA

Conversion Rate Optimization

ML model that targets the top 10% of users responsible for 95% of conversions

Overview

Analyzed e-commerce user behavior data across 300K+ sessions to identify the drivers of conversion and built a classification pipeline that ranks users by purchase probability. The model enables targeted marketing campaigns with measurable ROI — at a $1 outreach cost and $40 profit per conversion, only a 2.5% predicted probability is needed to break even.

Methods

EDA across traffic source (Ads, SEO, Direct), country, age, and session engagement segments
Logistic Regression with one-hot encoding of categorical features
Random Forest (200 estimators) as a nonlinear comparison model
Lift curve and decile analysis to quantify campaign targeting efficiency
Feature importance via logistic coefficients and Random Forest impurity scores
Secondary model excluding session engagement to isolate static user signals

Key Findings

  • Session engagement (pages visited) accounts for ~90% of Random Forest feature importance — the single most predictive behavioral signal
  • Returning users convert at 7.2% vs. 1.4% for new users — a 5× difference
  • Top 10% of users scored by the model capture 95% of all conversions
  • Germany and UK users convert at 6.3% and 5.3%, nearly double the US rate
  • Without session engagement, model AUC drops from 0.986 to 0.82 — confirming it as the dominant predictor
  • The decision problem is near-linearly separable in log-odds space, explaining why Logistic Regression outperforms Random Forest

Results

ROC AUC: 0.986 (Logistic Regression), 0.975 (Random Forest)

Precision 0.85, Recall 0.69 on converters at default threshold

Top decile conversion rate: 30.3% vs. 3.2% overall average — a 9.5× lift

Targeted campaigns viable at ≥2.5% predicted probability, far below the model's score range for high-confidence users

Ask About This Project

Have a technical question? Ask here.