This will be a notebook where I explore a data set with multiple algorithms and see what observations can be made
This algorithm takes data points and attempts to classify them against an x and y based on its nearest neighbors. Changing K changes the amount of neighbors tested. The neighbor with the highest count is decided to be the group that the data point belongs to.
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load the data set
data = pd.read_csv('data/6class.csv')
# Label encode the star color and spectral class
le = LabelEncoder()
data['Star color'] = le.fit_transform(data['Star color'])
data['Spectral Class'] = le.fit_transform(data['Spectral Class'])
X = data.drop(['Star color'], axis=1)
y = data['Star color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now let us take a look at what these neighborhoods look like
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(data['Star color'], data['Radius(R/Ro)'], data['Absolute magnitude(Mv)'], c=data['Star color'], marker='o')
ax.set_xlabel('Color')
ax.set_ylabel('Radius(R/Ro)')
ax.set_zlabel('Absolute Magnitude(Mv)')
plt.show()
There appears to be an anomaly in the earlier colors. Many stars of the same color and radius but varying widely in magnitude, interesting.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(data['Spectral Class'], data['Radius(R/Ro)'], data['Absolute magnitude(Mv)'], c=data['Star color'], marker='o')
ax.set_xlabel('Spectral Class')
ax.set_ylabel('Radius(R/Ro)')
ax.set_zlabel('Absolute Magnitude(Mv)')
plt.show()
The same anomaly appears again. I am assuming that those stars of that color also share the same spectral class.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(data['Temperature (K)'], data['Radius(R/Ro)'], data['Absolute magnitude(Mv)'], c=data['Star color'], marker='o')
ax.set_xlabel('Temperature (K)')
ax.set_ylabel('Radius(R/Ro)')
ax.set_zlabel('Absolute Magnitude(Mv)')
plt.show()
So those stars share the same class, color, and heat. Another observation that can be made is that as radius and temperature increase, magnitude appears to decrease.
# Train the model
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
# Test the model
y_pred = knn.predict(X_test)
# Evaluate the model
score = knn.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.75
What does this mean? Well if we look back, we see that I am testing against star color. In our case star colors are discrete values, meaning they behave as steps and/or levels
rather than slopes or gradients. So as we can see, by using 2 neighbor, we have a fairly decent (for this case) accuracy of 75%.
Let us look at a different feature to test this classifier a bit more.
X = data.drop(['Temperature (K)'], axis='columns')
y = data['Temperature (K)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Test the model
y_pred = knn.predict(X_test)
# Evaluate the model
score = knn.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.0
0% accuracy. What exactly is going on here. What this is telling me is that the data is inaccessible without temperature, as if temperature is the most salient feature.
I wonder if that is the case. Let us try a new feature.
X = data.drop(['Absolute magnitude(Mv)'], axis='columns')
y = data['Absolute magnitude(Mv)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
# Train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Test the model
y_pred = knn.predict(X_test)
# Evaluate the model
score = knn.score(X_test, y_test)
print("Accuracy:", score)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File c:\Users\thewa\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\multiclass.py:200, in check_classification_targets(y)
192 y_type = type_of_target(y, input_name="y")
193 if y_type not in [
194 "binary",
195 "multiclass",
(...)
198 "multilabel-sequences",
199 ]:
--> 200 raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
The data is continuous. It appears that base knn is more suited for classification than regression. So what do we do? Well of course, KNR, KNNs older and cooler brother.
from sklearn.neighbors import KNeighborsRegressor
# Train the model
knr = KNeighborsRegressor(n_neighbors=17)
knr.fit(X_train, y_train)
# Test the model
y_pred = knr.predict(X_test)
# Evaluate the model
score = knr.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.8794034361057438
We have explored our neighboring stars and seen how the k-neighbors algorithm can help us traverse our data set. We have seen how k-neighbors can be used to find the closest neighbors in a data set, and how the number of neighbors we choose affects the accuracy of the model. Finally, we have seen how to use k-neighbors to make predictions about a new data point. Let's press on, and see what else we can learn.
It can be daunting looking at a new data set for the first time.
But thankfully rather than getting lost, the forest can help you find your way.
Random forests in machine learning is a method with which one can discover salient features within the dataset, in other words, what feature has the greatest effect on another feature.
It does so by spawning decision trees and traversing different features and recording their "outcomes" at the leaf nodes.
It ranks the occurrences of each traversal with the outcome and determines from that the saliency of features.
https://builtin.com/data-science/random-forest-algorithm
# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
# loading data
df = pd.read_csv('data/6class.csv')
We have to take care of string data, because the machine learning algorithms are allergic.
# Encoding string columns
le = LabelEncoder()
df['Star color'] = le.fit_transform(df['Star color'])
df['Spectral Class'] = le.fit_transform(df['Spectral Class'])
We will be judging a star by its color, to tell if it is Hot or Not. In other words, we will be running a Random Forest Classifier to determine the salience of each feature in determining a star's color.
# defining the target and feature data
X = df.drop('Star color', axis=1)
y = df['Star color']
# training the model
model = RandomForestClassifier()
model.fit(X, y)
# checking the feature importance scores
feature_importance = model.feature_importances_
# printing the most important features
for i in range(len(X.columns)):
print(X.columns[i], ':', feature_importance[i])
# visualize
plt.bar(X.columns, feature_importance)
plt.title('Feature Importance in predicting Star Color')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.rcParams['figure.figsize'] = [15,15]
plt.show()
Temperature (K) : 0.30311045924691504
Luminosity(L/Lo) : 0.15382759275535335
Radius(R/Ro) : 0.10960999576549249
Absolute magnitude(Mv) : 0.11047568775637619
Star type : 0.09795726399720774
Spectral Class : 0.2250190004786552
It appears that a star's temperature is of decisive importance in deciding its color. What an intriguing way to view the data! Let us press forward...
Using what I have experimented with within the data and what I have discovered through my resources, I wonder if it is possible to predict the likelihood of a Goldilock Star within a given data-set.
This resource provided me with reliable information on what to look for: https://iopscience.iop.org/article/10.3847/2041-8213/ab0651/meta
It mentions that the majority of goldilock stars are of the K class, so we wil be training a model to accurately predict the Spectral Class of a given star.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import BaggingClassifier
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('data/6class.csv')
le = LabelEncoder()
df['Star color'] = le.fit_transform(df['Star color'])
df['Spectral Class'] = le.fit_transform(df['Spectral Class'])
#goldilock stars are K class
X = df.drop('Spectral Class', axis=1)
y = df['Spectral Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, )
Why Logistic Regression? The difference lies in whether we are looking to classify discrete data or continuous data. As spectral class exists in discrete types, we will be using an optimized Logistic Regression model and not something more common such as a Linear Regression model
clf = LogisticRegression(solver="liblinear", max_iter=100, verbose=True)
clf.fit(X_train, y_train)
[LibLinear]
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))
# Printing the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
Accuracy: 0.875
Precision: 0.797183794466403
Recall: 0.875
F1-score: 0.9279673814557535
I like these results but it could be better. Finding a new habitable star depends on it, so we can not be sloppy.
Idea adapted from: https://towardsdatascience.com/logistic-regression-model-tuning-with-scikit-learn-part-1-425142e01af5
# Create first pipeline for base without reducing features.
pipe = Pipeline([('classifier' , RandomForestClassifier())])
# pipe = Pipeline([('classifier', RandomForestClassifier())])
# Create param grid.
param_grid = [
{'classifier' : [LogisticRegression()],
'classifier__penalty' : ['l2', 'l1'],
'classifier__C' : np.logspace(-4, 4, 20)},
{'classifier' : [RandomForestClassifier()],
'classifier__n_estimators' : list(range(10,101,10)),
'classifier__max_features' : list(range(6,32,5))}
]
# Create grid search object
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
# Fit on data
best_clf = clf.fit(X_train, y_train)
preds = best_clf.predict(X_test)
# Create second pipeline for feature reduction
pipe_2 = Pipeline([('reducer', PCA()),
('classifier', RandomForestClassifier())])
# Create a param grid for the second pipeline
param_grid_2 = [
{'classifier' : [LogisticRegression()],
'classifier__penalty' : ['l2', 'l1'],
'classifier__C' : np.logspace(-4, 4, 20)},
{'reducer__n_components' : [6,9,12],
'classifier' : [RandomForestClassifier()],
'classifier__n_estimators' : list(range(10,101,10)),
'classifier__max_features' : list(range(6,32,5))}
]
# Create grid search object
clf_2 = GridSearchCV(pipe_2, param_grid = param_grid_2, cv = 5, verbose=True, n_jobs=-1)
# Fit on data
best_clf_2 = clf_2.fit(X, y)
# Predict on test data
preds_2 = best_clf_2.predict(X_test)
# Create third pipeline for ensemble
pipe_3 = Pipeline([('ensemble', BaggingClassifier())])
# Create param grid for the third pipeline
param_grid_3 = {'ensemble__base_estimator': [LogisticRegression(), RandomForestClassifier()],
'ensemble__n_estimators' : [10,20,100],
'ensemble__max_samples': [0.5, 0.7, 1.0]}
# Create grid search object
clf_3 = GridSearchCV(pipe_3, param_grid = param_grid_3, cv = 5, verbose=True, n_jobs=-1)
# Fit on data
best_clf_3 = clf_3.fit(X, y)
# Predict on test data
preds_3 = best_clf_3.predict(X_test)
print('Model 1 Results:', accuracy_score(y_test,preds))
print('Model 2 Results:', accuracy_score(y_test,preds_2))
print('Model 3 Results:', accuracy_score(y_test,preds_3))
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Fitting 5 folds for each of 220 candidates, totalling 1100 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Model 1 Results: 0.8958333333333334
Model 2 Results: 1.0
Model 3 Results: 0.9791666666666666
WOW. It appears that with Model 2 (PCA feature reduction) we have reached an accuracy of 100%! And even across the board, the use of an algorithmic pipeline has increased the accuracy in all 3 cases.
Now that we have found the optimal model, we can go ahead and implement it to search for new star systems that are likely to be habitable.
We might investigate employing more advanced Deep Learning Architecture in the future to solve this issue. To further enhance the results, we might add include ensemble models like stacking and boosting. Additionally, to speed up and improve the efficiency of an algorithmic pipeline, we may consider Distributed Computing strategies like employing Spark. In the end, everything is based on the project's requirements and scope.
https://iopscience.iop.org/article/10.3847/2041-8213/ab0651/meta
https://towardsdatascience.com/logistic-regression-model-tuning-with-scikit-learn-part-1-425142e01af5