top of page

Machine Learning and Statistical Data Analysis:
Analyzing and Categorizing Amazon Product Reviews

Machine Learning and Statistical Data Analysis Analyzing and Categorizing Amazon Product R

In the whirlwind of digital transformation, the ecommerce industry has taken the stage by storm. It's a world where clicking to shop has become the norm, and Amazon reigns as the undisputed market king. But it's not just about buying and selling—it's a realm of endless services. A crown jewel in this kingdom is the treasure trove of product reviews. Stars shining out of 5, comments, and feedback illuminate the path for new buyers and guide companies to polish their star products while steering clear of the ones that dim their brand's brilliance. 🌟🛒

Screenshot 2023-09-30 170206.png

Hence, as a part of my Machine Learning and Statistical Data Analysis final project, I will be delving into the Amazon Review Dataset provided. My aim? To extract valuable insights using a range of techniques! Imagine this: binary classification unveiling clear trends, multi-class classification painting a vivid picture, and clustering helping organize the chaos. Buckle up as I navigate through logistic regression, dance through decision tree classification, and orchestrate a symphony with random forest classification. As if that wasn't thrilling enough, k-means clustering will be my compass in this exhilarating journey to dissect and categorize the Amazon product reviews. 🚀🔍📊

Feature Selection

The goal is to evolve from a binary classifier to a multi-class classifier, embracing a five-class product rating scale (1 to 5).


For this transformation, I'm sticking to my trio of classifiers: Logistic Regression (saga solver), followed by Decision Tree and Random Forest Classifiers. When it came to clustering, the initial features didn't quite cut it, so I pivoted to "style" as a key feature. It stood out due to its diverse JSON values across categories.


Combining it with other variables didn't amplify the results significantly. Hence, in the end, "style" emerged as the standout feature for clustering. 🌟🎛️

Screenshot 2023-09-30 170949.png

Logistic Regression

In logistic regression, initial attempts with default methods for both binary and multi-class classification didn't converge effectively. The 'liblinear' solver stood out for binary classification due to its suitability for noisy data and large feature sets. For multi-class, the 'saga' solver with the hyperparameter "multinomial" setting, using stochastic average gradient descent, proved efficient. Hyperparameters like "C", "l1_ratio", and "max_iter" were carefully tuned to balance model fit, prevent overfitting, and ensure effective convergence, employing "GridSearchCV" for optimal parameter selection through 5-fold cross-validation. 🛠️🎯

Screenshot 2023-09-30 171424.png
Screenshot 2023-09-30 171618.png
Screenshot 2023-09-30 171646.png
Screenshot 2023-09-30 171748.png
Screenshot 2023-09-30 171845.png

Decision Tree Classifier

In experimenting with the Decision Tree Classifier, an initial dip in F1 score, AUC, and accuracy was observed. Delving into the model's mechanics, focused tuning of hyperparameters "splitter" and "max_depth" was conducted to enhance performance. By adjusting the "splitter," control over node splitting was achieved, directly impacting accuracy and the resulting decision tree structure. Furthermore, optimizing "max_depth" ensured an ideal tree depth, striking a balance between overfitting and underfitting, ultimately enhancing predictive capabilities. Hyperparameter fine-tuning was facilitated using "GridSearchCV." The confusion matrix and ROC Curve for both binary and multi-class classification can be found in  below. 🌳📊🎯

Screenshot 2023-09-30 172326.png
Screenshot 2023-09-30 172457.png
Screenshot 2023-09-30 172524.png

Prediction Of Recession In FUTURE

Screenshot 2023-09-30 172551.png
Screenshot 2023-09-30 172621.png

Random Forest Classifier

In implementing the Random Forest Classifier, although the model had a lengthy runtime, the goal was to leverage its potential as a combination of decision trees for superior performance. By utilizing a random subset of features for each decision tree within the model, an enhancement in scores and accuracy was anticipated. Hyperparameter tuning focused on "max_depth" and "max_features," the latter controlling the features considered during node splitting. This tuning ensured optimal feature selection per tree, ultimately elevating the classifier's performance. 🌲⚙️📈

Screenshot 2023-09-30 172856.png
Screenshot 2023-09-30 172923.png
Screenshot 2023-09-30 172959.png
Screenshot 2023-09-30 173042.png
Screenshot 2023-09-30 173112.png

In this gripping odyssey through the realm of classification models, one truth stood bold and resolute: Logistic Regression, when tuned to the beat of precision, emerged as the champion. It wielded its power, eclipsing the performance of its counterparts—the formidable Random Forest Classifier and the promising Decision Tree Classifier—in the arena of the Amazon Product Review Dataset. The evidence is undeniable: meticulous feature selection, judicious data pre-processing, and a symphony of hyper-parameter tuning orchestrate a triumphant crescendo, painting a portrait of how strategic model refinement can unlock unparalleled insights. In this data symposium, Logistic Regression donned the crown of triumph, showcasing that with the right melody, any dataset can dance to the tune of accurate predictions. 🏆🎶🚀

Complete Report


bottom of page