The sustech-cse5002-project from zhicongsun

Author: Zhicong Sun
Data: 2021.5.30

Abstrct

This my the mini project of CSE5002 Intelligent Data Analysis of South University of Science and Technology.

1. Introduction

This my the mini project of CSE5002 Intelligent Data Analysis. The first section of this report describes the problems to be solved and the work done. The second section introduces the methods and materials used, mainly the process of data analysis and preprocessing. The third section introduces two experiments in detail. Experiment 1 uses the attribute dataset and compares five classification models. Experiment 2 uses the attribute dataset and the adjacency list dataset to trains and test on the final ten-layer neural network classifier. After adding the adjacency list data set, the accuracy of the classifier is significantly improved by 50%, the final accuracy on the test dataset reached 80%.

1.1 What is the problem to solve

In this mini project, an attributed social network at MIT (MIT, for short) is used as a toy example.
The original dataset comes from [1]. To simulate the above scenario, the related term to “age” is
“class year” in MIT dataset. Therefore, we adopt “class year” as the label in our mini project. We
have preprocessed MIT dataset by removing the lines with 0 presented in “class year”, which
finally yields 5298 rows of data.

Assume that there are some missing labels of “class year”. We need to predict the missing labels
(a multi-class classification problem) based on two sources of information. One comes from node
attributes, while another is from network topology. Specifically, our dataset consists of

attr.csv: node_id,degree,gender,major,second_major,dormitory,high_school (5298 rows)
adjlist.csv: node_id,neighbor_id_1,neighbor_id_2,… (5298 rows)
label_train.csv: node_id,class_year (4000 rows)
label_test.csv: node_id,class_year (1298 rows)

where node_id (each corresponds to a person) ranges from 0 to 5297. In this mini project, our
training set contains node_id from 0 to 3999, and testing set contains node_id from 4000 to 5297.

The objective is to train a classifier, utilizing node attributes, or network topology, or both, to
make good predictions for the missing labels in testing set.

[1] Traud, Amanda L., et al. "Comparing community structure to characteristics in online collegiate
social networks." SIAM review 53.3 (2011): 526-543

1.2 What has this project done

The task I have done in this project can be divided into five parts:

Data analysis
Data preprocessing
Model evaluation and selection
Comparing the performance between one source of information and two sources of information
Using the best classification to predict the missing label

1.3 How to set up the environment

Platform: Macbook Air
System: macOS Big Sur 11.2.1
Main running environment of this project:

Anaconda 2020.11
Spyder 4.1.5
python 3.8.5
pytorch 1.8.1
scikit-learn 0.23.2
numpy 1.19.2
pandas 1.1.3
matplotlib 3.3.2

1.4 How to use this source code

data_analysis.ipynb is used for data analysis.
clf_without_adjlist.ipynb corresponds to the content of Experiment 1, using the attribute data set to cross-validate and compare five models, and finally use the integrated model to train and predict on the test set.
clf_with_attr.ipynb corresponds to the Experiment 2, using the attribute dataset and the adjacency list dataset, the classifier is a neural network with ten layers, the code first cross-validates the training data, and then all training data is trained and performed on the test set prediction.
The above 3 files need to be opened in Jpydter and then run completely and respectively. It is best not to run only one of the fragments.

2 Methods and Materials

2.1 Data analysis

2.1.1 First meeting with data

**First meeting with attribute data and training labels: **

It can be seem that the attribute data set has 5298 row and 6 column, which means there are 6 features. Because the data set has no feature names, so we use feature i to both training dataset and testing dataset.

The format of the changed training dataset is:

The format of the changed testing dataset is:

2.1.2 Data exploration

Check if the attrbute data has missing value

There are no missing value, we don't need to supplement any value.

Label visualization

Label visualization of training dataset:

Label visualization of testing dataset

We counted the number of different labels in training dataset and test dataset, and the total number of different labels in training dataset and testing dataset:

There are 29 labels in the train dataset while 22 labels in test dataset, and the total number of labels is 32.It means that some labels of the test dataset are not available in the training dataset, and some labels of the training set do not appear in the test dataset. So we should encode all 32 labels.

Correlation matrix visualization and statistics for features

We analyze the correlation of different features in order to study whether it is necessary to combine different features to construct new features. The following results show that the correlation between features is not high：

Divide features into categorical value and continous value

We counted the different values contained in each feature, the results are as follows:

Based on the above results, we can divide features into caegorical features and continuous features. feature1 and feature2 can be regarded as categorical feature, feature3, feature4, feature5 and feature6 can be regarded as continous feature:

Continous features visualization

By analyzing the above four figures, we can draw a conclusion:

The continuous features have almost no outliers.
The feature3, feature6 seem to obey the normal distribution.
It is hard to figure out if the feature4 and feature5 obey the normal distribution, but I tend to treat them as a feature following the normal distribution. Because it appears to follow a normal distribution if we supplement the data on the left side of the axis (even if the data is meaningless).

Categorical features visualization

It is hard to determin which features is ordered. So we regard them as disorderd features and tolerate the loss of information caused by using features as an unordered variable.

2.2 Data preprocessing

Based on the above data analysis, our data preprocessing can be divided into the following 3 steps：

The categorical features in attribute dataset are encoded as dummy variables.
Do feature scaling for continuous features, here we mainly focus on standardization.
Deal with the adjacency list and convert it to the accessible matrix.

2.2.1 Dummy coding of categorical features

After dummy variable processing, the number of attributes increased from 6 to 13:

2.2.2 Features scaling of continous features

After standardizing continuous features, the data of attribute is as follows：

2.2.3 Process adjacent list

In the original adjacency list, the values placed in each row represent the neighbor samples connected with the current sample. We first convert this list to adjacency matrix, and then to reachability matrix. Every value in the reachability matrix is either zero or one. For example, matrix (i, j) = 1 means that the i-th sample and the j-th sample are connected,otherwise they don't connect.
The purpose of this processing is to regard whether each sample is connected with any other sample as a feature. The final reachable matrix is a 5298 x 5298 matrix with the following format:

Note:

If the accessible matrix should be used as train data, we need to combine this matrix and other features.
In this case, we add 5298 features, so the total number of labels reached 5311.
These new features are used to indicate whether the current sample point is connected with another sample point

3 Experiments and discussion

3.1 Experiment 1

3.1.1 Overview:

Classification using attribute dataset but not adjacent dataset. There are several characteristics in this experiment：

Dataset

This experiment only use the attibute dataset but not the network topology data to train and predict.

Classifier

This experiment tests the performance of five models on the dataset，these five models are: K Nearest Neighbor(KNN), Support Vector Machine (SVM), Decision Tree, Neural Netwrok and Ensembel Model.

Procedure

This experiment first uses k-fold cross validation to compare the performance of different models on the validation set, then uses the whole attrbute training dataset to train the model with the best performance, and finally makes prediction on the testing dataset.

Performance evaluation

The model performance evaluation criterion used in this experiment is the Accuracy. All the more complex evaluation criteria are not used because this is only our preliminary test. Although the accuracy has other shortcomings, it can most intuitively reflect the prediction performance. In Experiment 2, we will use various measures to evaluate the performance of the model.

3.1.2 Results and discuss

Results of k-fold cross validation:

The testing accuracy in the above figure is the accuracy on the verification set.
These results show that the performance of neural network is the best among all Non-ensemble models, but the accuracy can only reach 0.3.

Rresult of using all attribute training dataset in the Ensemble model:

The accuracy of the ensembel model reaches 31%, and the Micro F1-score reaches 31%.
It shows that only using attribute data to train the model can only improve the accuracy by about 20% compared with random guessing.
We need to make full use of the information provided by the topology.

3.2 Experiment 2

3.2.1 Overview

Classification using adjacent dataset and attribute dataset. There are several characteristics in this experiment：

Dataset

This experiment use both the attibute dataset and the adjacency dataset of network topology to train and predict. The preprocessing of training dataset and test set is the same

Classifier

This experiment tests the performance of five models on the dataset，these five models are: Neural Netwrok.
The network consists of input layer, output layer and 8 hidden layers. The number of neurons in input layer is 5311 (equal to the number of features), and that in output layer is 32 (equal to the total number of classes). The details are as follows:

In addition, compared with other activation functions, relu has the following advantages: for linear functions, relu is more expressive, especially in deep networks; For nonlinear functions, because the gradient of non negative interval is constant, there is no vanishing gradient problem, which makes the convergence rate of the model maintain a stable state. Here is a little description of what is the gradient vanishing problem: when the gradient is less than 1, the error between the predicted value and the real value will decay once per layer of propagation. If sigmoid is used as the activation function in the deep model, this phenomenon is particularly obvious, which will lead to the stagnation of the model convergence.

Procedure

This experiment first uses k-fold cross validation to test the performance of models on the validation set, then uses the whole training dataset to train the model with the best performance, and finally makes prediction on the testing dataset.

Performance evaluation

The evaluation criteria for classification problems include accuracy, precision, recall, F1 score, etc. we use** **Accuracy, Precision and F1-score to measure the quality of the model. The multi classification method is different from the binary classification method, so we use the Micro F1 score, Micro Precision, Micro Precision and Weighted Precision. In these criterias, **Micro F1-score and Accuracy **are the most important criterias we refer to.
There are precision and recall rates for binary classification problems. Each class of muti-classification problems has their own precision rate, recall rate, TP, TN, FP and FN. In this project, P is used to represent the precision rate, R is the recall rate. Micro-method F1-score caculates the average of TP, TN, FP and FN respectively, and then calculate P, R and F 1.
The formulas of Micro F1-score are as follows:

3.2.2 Results and discuss

The accuracy of the current model reaches 80%, and the Micro F1-score reaches 80%, which indicates that the connection relationship of samples can provide more information for the model and improve the fitting degree of the model.
Using both two sources ofinformation is better than using a single source of information.

4 limitation and future work

This project first visualized and statistically analyzed the dataset, divided the attribute data into categorical features and continuous features, processed the categorical features as dummy variables, and standardized the continuous features. Then two experiments were carried out.
Experiment 1 uses attribute dataset to train five models and k-fold cross-validation, and the accuracy is used as to evaluate models. The accuracy of the models on the test set can only reach about 0.30. And the results show that the neural network may have better performance. Experiment 2 used the attribute dataset and the adjacency list dataset which represents the network topology. The adjacency list is processed into a reachable matrix. The connection relationship between each sample and any other sample is a feature. The neural network has ten layers and the model is trained in batches, using F1-score and accuracy as metrics. In the end, both the **accuracy and Micro F1-score **of the model on the test set reached 0.80, which was significantly better than all the models in Experiment 1.
The limitation of the final model proposed in this project is that the connection relationship between each sample and any other sample is used as a feature, which brings the number of features to 5,311. If it is a larger social network, the number of features will increase linearly with the number of samples. Therefore, the dimensionality reduction processing of this topology data can be a solution, and another solution is to use a graph neural network.

zhicongsun / sustech-cse5002-project Goto Github PK

sustech-cse5002-project's Introduction