PythonLDANLPGensimScikit-learnTopic Modeling
TripAdvisor Topic Modeling
LDA topic model linking hotel review themes to guest satisfaction ratings
Overview
Applied Latent Dirichlet Allocation (LDA) to TripAdvisor hotel reviews to surface the latent themes driving guest satisfaction and dissatisfaction. By linking discovered topics to star ratings, the analysis provides actionable intelligence for hospitality operators — precisely identifying what drives 5-star reviews and where service failures cluster.
Methods
- Text preprocessing: lowercasing, regex cleaning, Porter stemming, domain stopword augmentation
- CountVectorizer with unigrams and bigrams (max_df=0.8, min_df=15)
- Custom domain stopword list to remove high-frequency noise terms ('hotel', 'room', 'stay', 'night')
- LDA fitted across k ∈ {5, 10, 15, 20} topics
- Coherence score (c_v) optimization via Gensim CoherenceModel for k selection
- Topic-rating correlation: assigned dominant topic per review, computed mean star rating per topic
Key Findings
- k=10 selected as optimal — coherence 0.42, with marginal gain from k=15 and k=20
- Topic 4 (staff excellence, location, 'wonderful') → avg rating 4.72 — the highest-satisfaction theme
- Topic 2 (service, pool, amenities, 'excel') → avg rating 4.61 — premium property experiences
- Topic 5 (front desk complaints, booking issues, 'nt') → avg rating 2.46 — clearest pain point
- Topic 1 (bathroom, room size, 'small', 'star') → avg rating 2.53 — physical property complaints
- Bigram addition substantially improved topic coherence and semantic interpretability
Results
10 interpretable topics with avg star ratings spanning 2.46 to 4.72
Service excellence and location identified as primary drivers of 5-star reviews
Front desk/booking failures and room quality complaints are primary dissatisfaction themes
Coherence scores: 0.410 (k=5) → 0.425 (k=20) — clear diminishing returns beyond k=10
← All projectsEric Jin