PythonLDANLPGensimScikit-learnTopic Modeling

TripAdvisor Topic Modeling

LDA topic model linking hotel review themes to guest satisfaction ratings

Overview

Applied Latent Dirichlet Allocation (LDA) to TripAdvisor hotel reviews to surface the latent themes driving guest satisfaction and dissatisfaction. By linking discovered topics to star ratings, the analysis provides actionable intelligence for hospitality operators — precisely identifying what drives 5-star reviews and where service failures cluster.

Methods

Text preprocessing: lowercasing, regex cleaning, Porter stemming, domain stopword augmentation
CountVectorizer with unigrams and bigrams (max_df=0.8, min_df=15)
Custom domain stopword list to remove high-frequency noise terms ('hotel', 'room', 'stay', 'night')
LDA fitted across k ∈ {5, 10, 15, 20} topics
Coherence score (c_v) optimization via Gensim CoherenceModel for k selection
Topic-rating correlation: assigned dominant topic per review, computed mean star rating per topic

Key Findings

k=10 selected as optimal — coherence 0.42, with marginal gain from k=15 and k=20
Topic 4 (staff excellence, location, 'wonderful') → avg rating 4.72 — the highest-satisfaction theme
Topic 2 (service, pool, amenities, 'excel') → avg rating 4.61 — premium property experiences
Topic 5 (front desk complaints, booking issues, 'nt') → avg rating 2.46 — clearest pain point
Topic 1 (bathroom, room size, 'small', 'star') → avg rating 2.53 — physical property complaints
Bigram addition substantially improved topic coherence and semantic interpretability

Results

10 interpretable topics with avg star ratings spanning 2.46 to 4.72

Service excellence and location identified as primary drivers of 5-star reviews

Front desk/booking failures and room quality complaints are primary dissatisfaction themes

Coherence scores: 0.410 (k=5) → 0.425 (k=20) — clear diminishing returns beyond k=10

← All projectsEric Jin