This report explores the application of Word2Vec and its pretrained models to train a Language Model (LM) in Python. It evaluates model performance, compares algorithm accuracy before and after improvements, and analyzes the impact of data preprocessing.
The split_data
function is implemented in Python 3.9 within the PyCharm environment. It reads, preprocesses, tokenizes, and splits a CSV file, playing a crucial role in the project. Key steps include CSV opening, data preprocessing (sentiment transformation), data splitting, and saving split data in CSV format.
This function opens a CSV file, preprocesses data, splits it into training and testing sets, and saves the split data into separate CSV files based on specified percentages.
This function opens a CSV file containing the IMDB dataset, extracts features (reviews) and targets (sentiments), prepares data for training, and returns extracted features and targets for further analysis.
This function converts text data into average word embeddings using a pretrained Word2Vec model, transforms text data into numerical representations, trains a Multi-Layer Perceptron (MLP) model, and returns the trained model for testing.
This function creates word embeddings using a specific logarithmic transformation from Word2Vec models, handles missing embeddings to avoid errors, trains an MLP model using modified word embeddings, and returns the trained model with student-specific embeddings.
This function reads a test dataset, prepares features (numerical representations), predicts sentiments using a trained model, evaluates accuracy, and saves the output to a CSV file.
- Average Embedding Model Accuracy: 81.4%
- Student Embedding Model Accuracy: 81.22%
-
Baseline Accuracy: Pretrained Word2Vec models show strong baseline accuracy, exceeding 81%, indicating effectiveness in sentiment analysis.
-
Confusion Matrix Analysis: Balanced matrices for both models highlight accurate predictions, demonstrating Word2Vec embeddings' robustness in handling diverse sentiment patterns.
-
Comparison with Logistic Regression: Word2Vec models, even with basic preprocessing, rival logistic regression, achieving accuracy over 70%, showcasing natural language understanding prowess.
-
Impact of Preprocessing: Effective preprocessing, including sentiment transformation and data splitting, significantly contributes to high model accuracy.
-
Embedding Generation Impact: The use of average and student-specific embeddings showcases Word2Vec's adaptability, providing diverse embeddings for improved sentiment analysis outcomes.
-
Handling Missing Embeddings: The Student Embedding Model's resilience to missing words, managed through logarithmic transformations, ensures a robust training process with minimized errors.
This report underscores the efficacy of Word2Vec pretrained models, emphasizing their adaptability, competitive accuracy, and potential across various natural language processing tasks.