Eric Jin← Back to projects
PythonLDANLPGensimScikit-learnTopic Modeling

TripAdvisor Topic Modeling

LDA topic model linking hotel review themes to guest satisfaction ratings

Overview

Applied Latent Dirichlet Allocation (LDA) to TripAdvisor hotel reviews to surface the latent themes driving guest satisfaction and dissatisfaction. By linking discovered topics to star ratings, the analysis provides actionable intelligence for hospitality operators — precisely identifying what drives 5-star reviews and where service failures cluster.

Methods

  • Text preprocessing: lowercasing, regex cleaning, Porter stemming, domain stopword augmentation
  • CountVectorizer with unigrams and bigrams (max_df=0.8, min_df=15)
  • Custom domain stopword list to remove high-frequency noise terms ('hotel', 'room', 'stay', 'night')
  • LDA fitted across k ∈ {5, 10, 15, 20} topics
  • Coherence score (c_v) optimization via Gensim CoherenceModel for k selection
  • Topic-rating correlation: assigned dominant topic per review, computed mean star rating per topic

Key Findings

  • k=10 selected as optimal — coherence 0.42, with marginal gain from k=15 and k=20
  • Topic 4 (staff excellence, location, 'wonderful') → avg rating 4.72 — the highest-satisfaction theme
  • Topic 2 (service, pool, amenities, 'excel') → avg rating 4.61 — premium property experiences
  • Topic 5 (front desk complaints, booking issues, 'nt') → avg rating 2.46 — clearest pain point
  • Topic 1 (bathroom, room size, 'small', 'star') → avg rating 2.53 — physical property complaints
  • Bigram addition substantially improved topic coherence and semantic interpretability

Results

10 interpretable topics with avg star ratings spanning 2.46 to 4.72

Service excellence and location identified as primary drivers of 5-star reviews

Front desk/booking failures and room quality complaints are primary dissatisfaction themes

Coherence scores: 0.410 (k=5) → 0.425 (k=20) — clear diminishing returns beyond k=10