Eric Jin← Back to projects
PythonScikit-learnLogistic RegressionRandom ForestEDA

Conversion Rate Optimization

ML model that targets the top 10% of users responsible for 95% of conversions

Overview

Analyzed e-commerce user behavior data across 300K+ sessions to identify the drivers of conversion and built a classification pipeline that ranks users by purchase probability. The model enables targeted marketing campaigns with measurable ROI — at a $1 outreach cost and $40 profit per conversion, only a 2.5% predicted probability is needed to break even.

Methods

  • EDA across traffic source (Ads, SEO, Direct), country, age, and session engagement segments
  • Logistic Regression with one-hot encoding of categorical features
  • Random Forest (200 estimators) as a nonlinear comparison model
  • Lift curve and decile analysis to quantify campaign targeting efficiency
  • Feature importance via logistic coefficients and Random Forest impurity scores
  • Secondary model excluding session engagement to isolate static user signals

Key Findings

  • Session engagement (pages visited) accounts for ~90% of Random Forest feature importance — the single most predictive behavioral signal
  • Returning users convert at 7.2% vs. 1.4% for new users — a 5× difference
  • Top 10% of users scored by the model capture 95% of all conversions
  • Germany and UK users convert at 6.3% and 5.3%, nearly double the US rate
  • Without session engagement, model AUC drops from 0.986 to 0.82 — confirming it as the dominant predictor
  • The decision problem is near-linearly separable in log-odds space, explaining why Logistic Regression outperforms Random Forest

Results

ROC AUC: 0.986 (Logistic Regression), 0.975 (Random Forest)

Precision 0.85, Recall 0.69 on converters at default threshold

Top decile conversion rate: 30.3% vs. 3.2% overall average — a 9.5× lift

Targeted campaigns viable at ≥2.5% predicted probability, far below the model's score range for high-confidence users