PythonNLPTF-IDFWord2VecLinearSVCLogistic Regression

Amazon Review Sentiment Analysis

94.3% accuracy NLP pipeline classifying 568K+ Amazon product reviews

Overview

Built an end-to-end NLP pipeline to classify sentiment in 568K+ Amazon product reviews. Benchmarked TF-IDF and Word2Vec feature representations across Logistic Regression, Linear SVC, and Random Forest classifiers, analyzing the trade-offs between overall accuracy, class balance, and minority-class recall across a 3.6:1 class-imbalanced dataset.

Methods

Custom text cleaning pipeline: HTML unescaping, URL removal, lowercasing, punctuation stripping, negation-preserving stopword removal, Porter stemming
TF-IDF vectorization with unigrams and bigrams (min_df=2, max_df=0.95) producing a 1.37M-feature sparse matrix
Word2Vec skip-gram embeddings (300 dimensions, window=5) trained on the full corpus
Document vectors via mean-pooling of word embeddings
Logistic Regression (standard and class_weight='balanced'), Linear SVC, Random Forest with class balancing
Class imbalance analysis across 443K positive vs. 124K negative reviews

Key Findings

Linear SVC with TF-IDF achieves the best overall performance (94.3% accuracy, macro F1: 0.91)
Class balancing improves negative recall from 0.76 to 0.90, at the cost of ~1.1% overall accuracy
Averaged Word2Vec embeddings lose sentiment polarity information, capping Random Forest at 90.2% accuracy
TF-IDF bigrams capture phrase-level sentiment signals ('not good', 'highly recommend') that unigrams miss
Mean-pooling neutralizes negation context — the core limitation of dense embedding approaches for sentiment

Results

Best model: LinearSVC — 94.3% accuracy, macro F1: 0.91 across 113K test samples

Negative recall: 0.83 (standard LR) → 0.90 (balanced LR) with minimal precision loss

TF-IDF + LinearSVC outperforms Word2Vec + RandomForest by 4.1% accuracy

Corpus: 568K reviews, 1.37M TF-IDF features, 300-dimension Word2Vec vocabulary

← All projectsEric Jin