PythonScikit-learnLogistic RegressionRandom ForestEDA
Conversion Rate Optimization
ML model that targets the top 10% of users responsible for 95% of conversions
Overview
Analyzed e-commerce user behavior data across 300K+ sessions to identify the drivers of conversion and built a classification pipeline that ranks users by purchase probability. The model enables targeted marketing campaigns with measurable ROI — at a $1 outreach cost and $40 profit per conversion, only a 2.5% predicted probability is needed to break even.
Methods
- EDA across traffic source (Ads, SEO, Direct), country, age, and session engagement segments
- Logistic Regression with one-hot encoding of categorical features
- Random Forest (200 estimators) as a nonlinear comparison model
- Lift curve and decile analysis to quantify campaign targeting efficiency
- Feature importance via logistic coefficients and Random Forest impurity scores
- Secondary model excluding session engagement to isolate static user signals
Key Findings
- Session engagement (pages visited) accounts for ~90% of Random Forest feature importance — the single most predictive behavioral signal
- Returning users convert at 7.2% vs. 1.4% for new users — a 5× difference
- Top 10% of users scored by the model capture 95% of all conversions
- Germany and UK users convert at 6.3% and 5.3%, nearly double the US rate
- Without session engagement, model AUC drops from 0.986 to 0.82 — confirming it as the dominant predictor
- The decision problem is near-linearly separable in log-odds space, explaining why Logistic Regression outperforms Random Forest
Results
ROC AUC: 0.986 (Logistic Regression), 0.975 (Random Forest)
Precision 0.85, Recall 0.69 on converters at default threshold
Top decile conversion rate: 30.3% vs. 3.2% overall average — a 9.5× lift
Targeted campaigns viable at ≥2.5% predicted probability, far below the model's score range for high-confidence users
← All projectsEric Jin