This repository contains code for analyzing and visualizing a bibliographic graph using a heterogeneous network. The methodology involves data preparation, graph construction, node feature extraction, text processing and vectorization, dimensionality reduction, clustering, cluster visualization, and variable distribution evaluation.
Clean the dataset by removing missing values and duplicates, select relevant columns, and save the cleaned dataset.
Build a directed graph using NetworkX, where nodes represent paper titles and citations, and edges represent connections between them.
Extract node features, such as the number of topics or papers associated with each node.
Process and vectorize textual data using techniques like TF-IDF to convert text into numerical representations.
Apply Truncated SVD to reduce the dimensionality of the vectorized data while preserving essential information.
Use K-means clustering to group similar data points based on their reduced dimensional representations.
Visualize the clusters using scatter plots to gain insights into the clustering results and patterns in the data.
Evaluate the distribution of selected variables using count plots to identify patterns and imbalances.