Eric Jin← Back to projects
PythonNLPTF-IDFWord2VecLinearSVCLogistic Regression

Amazon Review Sentiment Analysis

94.3% accuracy NLP pipeline classifying 568K+ Amazon product reviews

Overview

Built an end-to-end NLP pipeline to classify sentiment in 568K+ Amazon product reviews. Benchmarked TF-IDF and Word2Vec feature representations across Logistic Regression, Linear SVC, and Random Forest classifiers, analyzing the trade-offs between overall accuracy, class balance, and minority-class recall across a 3.6:1 class-imbalanced dataset.

Methods

  • Custom text cleaning pipeline: HTML unescaping, URL removal, lowercasing, punctuation stripping, negation-preserving stopword removal, Porter stemming
  • TF-IDF vectorization with unigrams and bigrams (min_df=2, max_df=0.95) producing a 1.37M-feature sparse matrix
  • Word2Vec skip-gram embeddings (300 dimensions, window=5) trained on the full corpus
  • Document vectors via mean-pooling of word embeddings
  • Logistic Regression (standard and class_weight='balanced'), Linear SVC, Random Forest with class balancing
  • Class imbalance analysis across 443K positive vs. 124K negative reviews

Key Findings

  • Linear SVC with TF-IDF achieves the best overall performance (94.3% accuracy, macro F1: 0.91)
  • Class balancing improves negative recall from 0.76 to 0.90, at the cost of ~1.1% overall accuracy
  • Averaged Word2Vec embeddings lose sentiment polarity information, capping Random Forest at 90.2% accuracy
  • TF-IDF bigrams capture phrase-level sentiment signals ('not good', 'highly recommend') that unigrams miss
  • Mean-pooling neutralizes negation context — the core limitation of dense embedding approaches for sentiment

Results

Best model: LinearSVC — 94.3% accuracy, macro F1: 0.91 across 113K test samples

Negative recall: 0.83 (standard LR) → 0.90 (balanced LR) with minimal precision loss

TF-IDF + LinearSVC outperforms Word2Vec + RandomForest by 4.1% accuracy

Corpus: 568K reviews, 1.37M TF-IDF features, 300-dimension Word2Vec vocabulary