PythonNLPTF-IDFWord2VecLinearSVCLogistic Regression
Amazon Review Sentiment Analysis
94.3% accuracy NLP pipeline classifying 568K+ Amazon product reviews
Overview
Built an end-to-end NLP pipeline to classify sentiment in 568K+ Amazon product reviews. Benchmarked TF-IDF and Word2Vec feature representations across Logistic Regression, Linear SVC, and Random Forest classifiers, analyzing the trade-offs between overall accuracy, class balance, and minority-class recall across a 3.6:1 class-imbalanced dataset.
Methods
- Custom text cleaning pipeline: HTML unescaping, URL removal, lowercasing, punctuation stripping, negation-preserving stopword removal, Porter stemming
- TF-IDF vectorization with unigrams and bigrams (min_df=2, max_df=0.95) producing a 1.37M-feature sparse matrix
- Word2Vec skip-gram embeddings (300 dimensions, window=5) trained on the full corpus
- Document vectors via mean-pooling of word embeddings
- Logistic Regression (standard and class_weight='balanced'), Linear SVC, Random Forest with class balancing
- Class imbalance analysis across 443K positive vs. 124K negative reviews
Key Findings
- Linear SVC with TF-IDF achieves the best overall performance (94.3% accuracy, macro F1: 0.91)
- Class balancing improves negative recall from 0.76 to 0.90, at the cost of ~1.1% overall accuracy
- Averaged Word2Vec embeddings lose sentiment polarity information, capping Random Forest at 90.2% accuracy
- TF-IDF bigrams capture phrase-level sentiment signals ('not good', 'highly recommend') that unigrams miss
- Mean-pooling neutralizes negation context — the core limitation of dense embedding approaches for sentiment
Results
Best model: LinearSVC — 94.3% accuracy, macro F1: 0.91 across 113K test samples
Negative recall: 0.83 (standard LR) → 0.90 (balanced LR) with minimal precision loss
TF-IDF + LinearSVC outperforms Word2Vec + RandomForest by 4.1% accuracy
Corpus: 568K reviews, 1.37M TF-IDF features, 300-dimension Word2Vec vocabulary
← All projectsEric Jin