Git Product home page Git Product logo

serengil / chefboost Goto Github PK

View Code? Open in Web Editor NEW
443.0 18.0 101.0 1.11 MB

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4.5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting, Random Forest and Adaboost w/categorical features support for Python

Home Page: https://www.youtube.com/watch?v=Z93qE5eb6eg&list=PLsS_1RYmYQQHp_xZObt76dpacY543GrJD&index=3

License: MIT License

Python 99.66% Makefile 0.10% Shell 0.23%
decision-trees gradient-boosting gradient-boosting-machine random-forest adaboost id3 c45-trees cart regression-tree gbm data-mining gradient-boosting-machines data-science kaggle gbdt gbrt machine-learning python categorical-features

chefboost's Introduction

👨‍🍳 ChefBoost

Downloads Stars License Tests DOI

Blog YouTube Twitter Support me on Patreon GitHub Sponsors

ChefBoost is a lightweight decision tree framework for Python with categorical feature support. It covers regular decision tree algorithms: ID3, C4.5, CART, CHAID and regression tree; also some advanved techniques: gradient boosting, random forest and adaboost. You just need to write a few lines of code to build decision trees with Chefboost.

Installation - Demo

The easiest way to install ChefBoost framework is to download it from from PyPI. It's going to install the library itself and its prerequisites as well.

pip install chefboost

Then, you will be able to import the library and use its functionalities

from chefboost import Chefboost as chef

Usage - Demo

Basically, you just need to pass the dataset as pandas data frame and the optional tree configurations as illustrated below.

import pandas as pd

df = pd.read_csv("dataset/golf.txt")
config = {'algorithm': 'C4.5'}
model = chef.fit(df, config = config, target_label = 'Decision')

Pre-processing

Chefboost handles the both numeric and nominal features and target values in contrast to its alternatives. So, you don't have to apply any pre-processing to build trees.

Outcomes

Built decision trees are stored as python if statements in the tests/outputs/rules directory. A sample of decision rules is demonstrated below.

def findDecision(Outlook, Temperature, Humidity, Wind):
   if Outlook == 'Rain':
      if Wind == 'Weak':
         return 'Yes'
      elif Wind == 'Strong':
         return 'No'
      else:
         return 'No'
   elif Outlook == 'Sunny':
      if Humidity == 'High':
         return 'No'
      elif Humidity == 'Normal':
         return 'Yes'
      else:
         return 'Yes'
   elif Outlook == 'Overcast':
      return 'Yes'
   else:
      return 'Yes'

Testing for custom instances

Decision rules will be stored in outputs/rules/ folder when you build decision trees. You can run the built decision tree for new instances as illustrated below.

prediction = chef.predict(model, param = ['Sunny', 'Hot', 'High', 'Weak'])

You can consume built decision trees directly as well. In this way, you can restore already built decision trees and skip learning steps, or apply transfer learning. Loaded trees offer you findDecision method to test for new instances.

module_name = "outputs/rules/rules" #this will load outputs/rules/rules.py
tree = chef.restoreTree(module_name)
prediction = tree.findDecision(['Sunny', 'Hot', 'High', 'Weak'])

tests/global-unit-test.py will guide you how to build a different decision trees and make predictions.

Model save and restoration

You can save your trained models. This makes your model ready for transfer learning.

chef.save_model(model, "model.pkl")

In this way, you can use the same model later to just make predictions. This skips the training steps. Restoration requires to store .py and .pkl files under outputs/rules.

model = chef.load_model("model.pkl")
prediction = chef.predict(model, ['Sunny',85,85,'Weak'])

Sample configurations

ChefBoost supports several decision tree, bagging and boosting algorithms. You just need to pass the configuration to use different algorithms.

Regular Decision Trees

Regular decision tree algorithms find the best feature and the best split point maximizing the information gain. It builds decision trees recursively in child nodes.

config = {'algorithm': 'C4.5'} #Set algorithm to ID3, C4.5, CART, CHAID or Regression
model = chef.fit(df, config)

The following regular decision tree algorithms are wrapped in the library.

Algorithm Metric Tutorial Demo
ID3 Entropy, Information Gain Tutorial Demo
C4.5 Entropy, Gain Ratio Tutorial Demo
CART GINI Tutorial Demo
CHAID Chi Square Tutorial Demo
Regression Standard Deviation Tutorial Demo

Gradient Boosting Tutorial, Demo

Gradient boosting is basically based on building a tree, and then building another based on the previous one's error. In this way, it boosts results. Predictions will be the sum of each tree'e prediction result.

config = {'enableGBM': True, 'epochs': 7, 'learning_rate': 1, 'max_depth': 5}

Random Forest Tutorial, Demo

Random forest basically splits the data set into several sub data sets and builds different data set for those sub data sets. Predictions will be the average of each tree's prediction result.

config = {'enableRandomForest': True, 'num_of_trees': 5}

Adaboost Tutorial, Demo

Adaboost applies a decision stump instead of a decision tree. This is a weak classifier and aims to get min 50% score. It then increases the unclassified ones and decreases the classified ones. In this way, it aims to have a high score with weak classifiers.

config = {'enableAdaboost': True, 'num_of_weak_classifier': 4}

Feature Importance - Demo

Decision trees are naturally interpretable and explainable algorithms. A decision is clear made by a single tree. Still we need some extra layers to understand the built models. Besides, random forest and GBM are hard to explain. Herein, feature importance is one of the most common way to see the big picture and understand built models.

df = chef.feature_importance("outputs/rules/rules.py")
feature final_importance
Humidity 0.3688
Wind 0.3688
Outlook 0.2624
Temperature 0.0000

Paralellism

ChefBoost offers parallelism to speed model building up. Branches of a decision tree will be created in parallel in this way. You should set enableParallelism argument to False in the configuration if you don't want to use parallelism. Its default value is True. It allocates half of the total number of cores in your environment if parallelism is enabled.

if __name__ == '__main__':
   config = {'algorithm': 'C4.5', 'enableParallelism': True, 'num_cores': 2}
   model = chef.fit(df, config)

Notice that you have to locate training step in an if block and it should check you are in main.

To not use parallelism set the parameter to False.

config = {'algorithm': 'C4.5', 'enableParallelism': False}
model = chef.fit(df, config)

Contribution Tests

Pull requests are more than welcome! You should run the unit tests and linting locally by running make test and make lint commands before creating a PR. Once a PR created, GitHub test workflow will be run automatically and unit test results will be available in GitHub actions before approval.

Support

There are many ways to support a project - starring⭐️ the GitHub repos is just one 🙏

You can also support this work on Patreon

Citation

Please cite ChefBoost in your publications if it helps your research. Here is an example BibTeX entry:

@misc{serengil2021chefboost,
  author       = {Serengil, Sefik Ilkin},
  title        = {ChefBoost: A Lightweight Boosted Decision Tree Framework},
  month        = oct,
  year         = 2021,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.5576203},
  howpublished = {https://doi.org/10.5281/zenodo.5576203}
}

Also, if you use chefboost in your GitHub projects, please add chefboost in the requirements.txt.

Licence

ChefBoost is licensed under the MIT License - see LICENSE for more details.

chefboost's People

Contributors

anapaulamendes avatar jannisbush avatar nurettin avatar serengil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chefboost's Issues

spawn make it unable to run on linux

I guess because of This line in Chefboost.py set_start_method("spawn", force=True)

I'm on linux, and I'm unable to run chef.fit both in jupyter and in a main (if __name__ == '__main__':) unless I disable the parallelism (enableParallelism: False)

Parallelism does not work properly

Hi,

I'm testing the library using the following code snippet:

config = {
'algorithm': 'C4.5',
#'enableParallelism': True, 'num_cores': 32,
}
model = chef.fit(df, config = config)

Which prints: "finished in 9.606534719467163 seconds"

Then, enabling parallelism uncommenting the line in the config, it never finishes but it uses 100% of CPU for a really long time - much more than 10 seconds

Getting KeyError: 'Decision'

Trying to find gain using for SepalLengthCm in IRIS dataset following this .

config = {'algorithm': 'ID3'}
sorted(df['SepalLengthCm'].unique())

threshold = 6.0
idx = df[df['SepalLengthCm'] <= threshold].index
tmp_df = df.copy()
tmp_df['SepalLengthCm'] = '>'+str(threshold)
tmp_df.loc[idx, 'SepalLengthCm'] = '<='+str(threshold)

gain = Training.findGains(tmp_df, config)
print(threshold, ': ', gain)

Also

df = iris[['SepalLengthCm', 'y']]

When running this I get the following error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\_libs\index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\_libs\index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Decision'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Input In [10], in <cell line: 7>()
      4 tmp_df['SepalLengthCm'] = '>'+str(threshold)
      5 tmp_df.loc[idx, 'SepalLengthCm'] = '<='+str(threshold)
----> 7 gain = Training.findGains(tmp_df, config)
      8 print(threshold, ': ', gain)

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\chefboost\training\Training.py:107, in findGains(df, config)
    104 def findGains(df, config):
    106 	algorithm = config['algorithm']
--> 107 	decision_classes = df["Decision"].unique()
    109 	#-----------------------------
    111 	entropy = 0

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\frame.py:3505, in DataFrame.__getitem__(self, key)
   3503 if self.columns.nlevels > 1:
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):
   3507     indexer = [indexer]

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\indexes\base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.
   3628     self._check_indexing_error(key)

KeyError: 'Decision'

findDecision(obj) and accuracy giving weird results

I try to run this code:

import Chefboost as chef
import pandas as pd

if __name__ == "__main__":
    df = pd.read_csv("dataset/golf.txt")
    
    config = {'algorithm': 'C4.5'}
    model = chef.fit(df, config)

and then when i check outputs/rules/rules.py this is what i get :

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[2] == 'Rain':
      if obj[0] == 'Weak':
         return 'Yes'
      elif obj[0] == 'Strong':
         return 'No'
   elif obj[2] == 'Sunny':
      if obj[1] == 'High':
         return 'No'
      elif obj[1] == 'Normal':
         return 'Yes'
   elif obj[2] == 'Overcast':
      return 'Yes'

obj[0] isn't Outlook, but Wind.. and also, sometimes i get accuracy 0% after running the code 2 or 3 times..

'numpy.float32' object has no attribute 'is_integer'

Tried to do the following on a dataset with float samples. (Running on Python 3.7)

configGBM = {'algorithm': 'C4.5', 'enableGBM': True, 'epochs': 7, 'learning_rate': 1, 'max_depth': 5}
modelGBM = chef.fit(train, config = configGBM)

Error Log:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/vk/rw3fbc110n3fsf_xhz6_r4m00000gn/T/ipykernel_67628/3037199772.py in <module>
      1 configGBM = {'algorithm': 'C4.5', 'enableGBM': True, 'epochs': 7, 'learning_rate': 1, 'max_depth': 5}
----> 2 modelGBM = chef.fit(train, config = configGBM)
    
/usr/local/lib/python3.7/site-packages/chefboost/Chefboost.py in fit(df, config, target_label, validation_df)
    190 
    191                 if df['Decision'].dtypes == 'object': #transform classification problem to regression
--> 192                         trees, alphas = gbm.classifier(df, config, header, dataset_features, validation_df = validation_df, process_id = process_id)
    193                         classification = True
    194 

/usr/local/lib/python3.7/site-packages/chefboost/tuning/gbm.py in classifier(df, config, header, dataset_features, validation_df, process_id)
    270                                 instance['P_'+str(j)] = probabilities[j]
    271 
--> 272                         worksheet.loc[row] = instance
    273 
    274                 for i in range(0, len(classes)):

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    721 
    722         iloc = self if self.name == "iloc" else self.obj.iloc
--> 723         iloc._setitem_with_indexer(indexer, value, self.name)
    724 
    725     def _validate_key(self, key, axis: int):

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
   1728         if take_split_path:
   1729             # We have to operate column-wise
-> 1730             self._setitem_with_indexer_split_path(indexer, value, name)
   1731         else:
   1732             self._setitem_single_block(indexer, value, name)

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer_split_path(self, indexer, value, name)
   1795                 # We are setting multiple columns in a single row.
   1796                 for loc, v in zip(ilocs, value):
-> 1797                     self._setitem_single_column(loc, v, pi)
   1798 
   1799             elif len(ilocs) == 1 and com.is_null_slice(pi) and len(self.obj) == 0:

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_single_column(self, loc, value, plane_indexer)
   1918             # set the item, possibly having a dtype change
   1919             ser = ser.copy()
-> 1920             ser._mgr = ser._mgr.setitem(indexer=(pi,), value=value)
   1921             ser._maybe_update_cacher(clear=True)
   1922 

/usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
    353 
    354     def setitem(self: T, indexer, value) -> T:
--> 355         return self.apply("setitem", indexer=indexer, value=value)
    356 
    357     def putmask(self, mask, new, align: bool = True):

/usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

/usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
    924         # coerce if block dtype can store value
    925         values = self.values
--> 926         if not self._can_hold_element(value):
    927             # current dtype cannot store value, coerce to common dtype
    928             return self.coerce_to_target_dtype(value).setitem(indexer, value)

/usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _can_hold_element(self, element)
    620         """require the same dtype as ourselves"""
    621         element = extract_array(element, extract_numpy=True)
--> 622         return can_hold_element(self.values, element)
    623 
    624     @final

/usr/local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in can_hold_element(arr, element)
   2181         if tipo is not None:
   2182             if tipo.kind not in ["i", "u"]:
-> 2183                 if is_float(element) and element.is_integer():
   2184                     return True
   2185                 # Anything other than integer we cannot hold

AttributeError: 'numpy.float32' object has no attribute 'is_integer'

Unreasonable training time when I make a simple change

So I am training a CHAID decision tree for multiclass classification, and the target variable is a string. Other than the target, I have 4 other features, two of which I want to be string type. When I train the model with only one feature as string, training takes about 15 minutes. But when I convert the other feature I wish to be treated as categorical to string, training takes forever (entire day and no result).

What could be causing this?

'Series' object has no attribute 'Decision'

When running the golf example:

df = pd.read_csv("data/golf.txt")
config = {'algorithm': 'C4.5'}
model = chef.fit(df, config = config, target_label = 'Decision')

I get the following error:

[INFO]:  10 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_28440\547795482.py in ?()
     10 import pandas as pd
     11 
     12 df = pd.read_csv("data/golf.txt")
     13 config = {'algorithm': 'C4.5'}
---> 14 model = chef.fit(df, config = config, target_label = 'Decision')

C:\Lib\site-packages\chefboost\Chefboost.py in ?(df, config, target_label, validation_df)
    209                 if enableParallelism == True:
    210                         json_file = "outputs/rules/rules.json"
    211                         functions.createFile(json_file, "[\n")
    212 
--> 213 		trees = Training.buildDecisionTree(df, root = root, file = file, config = config
    214                                 , dataset_features = dataset_features
    215 				, parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)
    216 

C:\Lib\site-packages\chefboost\\chefboost\training\Training.py in ?(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
    432                 pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
    433                 pivot = pivot.rename(columns = {"Decision": "Instances","index": "Decision"})
    434                 pivot = pivot.sort_values(by = ["Instances"], ascending = False).reset_index()
    435 
--> 436                 else_decision = "return '%s'" % (pivot.iloc[0].Decision)
    437 
    438                 if enableParallelism != True:
    439                         functions.storeRule(file,(functions.formatRule(root), "else:"))

C:\Lib\site-packages\chefboost\Lib\site-packages\pandas\core\generic.py in ?(self, name)
   5985             and name not in self._accessors
   5986             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987         ):
   5988             return self[name]
-> 5989         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'Decision'

I think someone run into the same issue on stackoverflow.

Any Tree Traversal API or Example?

I am interested in plotting chef trees, particularly decision path for a sample.

A generic traversal iterator call would allow users to dump rule in different formats or create various plots with networkx/pygraphviz/matplotlib/dtreeviz/treeinterpreter ex https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/39772170

  1. Is there an example of DFS/BFS generator for traversing the nodes?

ex sklearn DFS via structure & decision path
https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html

ex sklearn BFS generator
https://stackoverflow.com/questions/61203080/traversal-of-sklearn-decision-tree

  1. does chef have anything like decision_path() in scikit?

decision_path()
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.decision_path

I don't mind adding this, looking for a guide to the internals of chefboost - reconstructRules might be the closes to a traversal?

Reference
#20
#2

findDecision incorrect?

I have a CSV with pre-calculated cosine distance between face embeddings of people images in my dataset like this:

       Person1     Person2  Idx1  Idx2  Distance Decision
0   Aaron Paul  Aaron Paul     0     1    0.3245      Yes
1   Aaron Paul  Aaron Paul     0     2    0.2281      Yes
2   Aaron Paul  Aaron Paul     0     3    0.4737      Yes
3   Aaron Paul  Aaron Paul     0     4    0.4103      Yes
4   Aaron Paul  Aaron Paul     0     5    0.3236      Yes
5   Aaron Paul  Aaron Paul     0     6    0.3270      Yes
6   Aaron Paul  Aaron Paul     0     7    0.4873      Yes
7   Aaron Paul  Aaron Paul     0     8    0.3988      Yes
8   Aaron Paul  Aaron Paul     1     2    0.2357      Yes
9   Aaron Paul  Aaron Paul     1     3    0.2613      Yes
10  Aaron Paul  Aaron Paul     1     4    0.3827      Yes
11  Aaron Paul  Aaron Paul     1     5    0.2221      Yes
12  Aaron Paul  Aaron Paul     1     6    0.2183      Yes
13  Aaron Paul  Aaron Paul     1     7    0.4568      Yes
14  Aaron Paul  Aaron Paul     1     8    0.2391      Yes
15  Aaron Paul  Aaron Paul     2     3    0.4439      Yes
16  Aaron Paul  Aaron Paul     2     4    0.4086      Yes
17  Aaron Paul  Aaron Paul     2     5    0.2592      Yes
18  Aaron Paul  Aaron Paul     2     6    0.2863      Yes
19  Aaron Paul  Aaron Paul     2     7    0.4588      Yes

And I use this script to calculate findDecision tree:

import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()

if __name__ == '__main__':
	##############################################################################
	# Leer CSV para determinar el mejor threshold...
	df = pd.read_csv(R"\\10.15.20.109\e$\MODELS\ProtecFR\Model\faces2.csv", encoding='UTF8')
	print(df.head(20))

	df1 = df[df['Decision'] == "Yes"]['Distance'].copy()
	df2 = df[df['Decision'] == "No"]['Distance'].copy()
	print(f"Count Yes: {df1.count()}")
	print(f"Average Yes: {round(df1.mean(), 4)}")
	print(f"Std. deviation Yes: {round(df1.std(), 4)}")
	print(f"Min Yes: {round(df1.min(), 4)}")
	print(f"Max Yes: {round(df1.max(), 4)}")
	print(f"Mode Yes: {round(df1.mode()[0], 4)}")

	print(f"Count No: {df2.count()}")
	print(f"Average No: {round(df2.mean(), 4)}")
	print(f"Std. deviation No: {round(df2.std(), 4)}")
	print(f"Min No: {round(df2.min(), 4)}")
	print(f"Max No: {round(df2.max(), 4)}")
	print(f"Mode No: {round(df2.mode()[0], 4)}")

	df1.plot.kde()
	df2.plot.kde()
	plt.legend(["Yes", "No"])
	plt.grid()
	plt.axhline(0,color='red')
	plt.axvline(0,color='red')
	plt.show()

	from chefboost import Chefboost as chef
	config = {'algorithm': 'C4.5'}

	tmp_df = df[['Distance', 'Decision']].copy()
	model = chef.fit(tmp_df, config)
	print (model)

The results I get are:

Count Yes: 108285
Average Yes: 0.4496
Std. deviation Yes: 0.1557
Min Yes: 0.0
Max Yes: 1.0644
Mode Yes: 0.3465

Count No: 59793700
Average No: 0.7976
Std. deviation No: 0.1112
Min No: 0.0
Max No: 1.2973
Mode No: 0.8114

[INFO]:  8 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
-------------------------
finished in  135.35767483711243  seconds
-------------------------
Evaluate  train set
-------------------------
Accuracy:  99.81929981118321 % on  59901985  instances
Labels:  ['Yes' 'No']
Confusion matrix:  [[43, 1], [108242, 59793699]]
Precision:  97.7273 %, Recall:  0.0397 %, F1:  0.0794 %
{'trees': [<module 'outputs/rules/rules' from 'c:\\DESARROLLOS\\Python\\VID\\outputs/rules/rules.py'>], 'alphas': [], 'config': {'algorithm': 'C4.5', 'enableRandomForest': False, 'num_of_trees': 5, 'enableMultitasking': False, 'enableGBM': False, 'epochs': 10, 'learning_rate': 1, 'max_depth': 3, 'enableAdaboost': False, 'num_of_weak_classifier': 4, 'enableParallelism': True, 'num_cores': 8}, 'nan_values': [['Distance', None]]}

The plot is:

ArcFace-cosine

and outputs/rules/rules.py:

def findDecision(obj): #obj[0]: Distance
	# {"feature": "Distance", "instances": 59901985, "metric_value": 0.0191, "depth": 1}
	if obj[0]>0.0:
		return 'No'
	elif obj[0]<=0.0:
		return 'Yes'
	else: return 'Yes'

As you can see, it gives me a 0.0 threshold when it should be around 0.68.

Am I doing something wrong?

Regards

Only Regression Tree is Built

No matter what configuration I give, it seems to always default to building a Regression Tree.
I've tried putting in a non-existent value as well, but there are no issues there as it just throws an error.

Code:

from chefboost import Chefboost as cb
import pandas as pd

df = pd.read_csv("~/Downloads/kmodes_fillna_cluster.csv")

config = {'algorithm': 'C4.5'}
model = cb.fit(df, config)

Output:

Regression  tree is going to be built...
MAE:  0.10815602836879432
RMSE:  0.2325467999874373
Mean:  0.2872340425531915
MAE / Mean:  37.654320987654316 %
RMSE / Mean:  80.96073777340409 %
finished in  1.4060499668121338  seconds

How to visualize

THX for chefboost,It help me a lot,but I want to know How to visualize the decision tree by chefboost,or how to know the number of the leaf.

Indentation error

While using C4.5 i get indentation error

config = {'algorithm': 'C4.5'}
model = chef.fit(df, config = config, target_label = 'party')

[INFO]: 1 CPU cores will be allocated in parallel running
C4.5 tree is going to be built...
File "outputs/rules/rules.py", line 37
else: return 'no'
^
IndentationError: expected an indented block

Error for model code

I followed the instructions in the README, but I encounter an error when running the model code. Why might this be happening?
my python version is 3.11.4.

24-04-12 13:57:40 - [INFO]: 16 CPU cores will be allocated in parallel running
24-04-12 13:57:40 - C4.5 tree is going to be built...

ImportError Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = chef.fit(df, config = config, target_label = 'Decision')

File c:\Users\heaop\Documents\GitHub\chefboost\Chefboost.py:275, in fit(df, config, target_label, validation_df, silent)
272 json_file = "outputs/rules/rules.json"
273 functions.createFile(json_file, "[\n")
--> 275 trees = Training.buildDecisionTree(
276 df,
277 root=root,
278 file=file,
279 config=config,
280 dataset_features=dataset_features,
281 parent_level=0,
282 leaf_id=0,
283 parents="root",
284 validation_df=validation_df,
285 main_process_id=process_id,
286 )
288 if silent is False:
289 logger.info("-------------------------")

File c:\Users\heaop\Documents\GitHub\chefboost\training\Training.py:712, in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
703 if (
...
---> 20 raise ImportError(f"Module '{module_name}' not found")
22 module = importlib.util.module_from_spec(spec)
23 spec.loader.exec_module(module)

ImportError: Module 'outputs/rules/rules' not found

Parallelism do not seems to be working

Hi,

I've used the following code:

if __name__ == '__main__':
    config = {'algorithm': 'Regression', 'enableParallelism' : True, 'enableGBM': True, 'epochs': 10, 'learning_rate': 0.01}
    model = chef.fit(df_tree_train, config)

and when I check my CPU usage I see that only one core is being used. Why aren't all my cores being used?

Q: Are feature engineering tools mixed in for BT, RF, and GB?

From This Scikit Learn tutorial it is easy to see that for data that is not orthogonal to one another, often times produce subpar results.
There has been tools that help with this through mixing different columns in the dataset through "feature engineering".
Some notable ones include the following libraries:

Q: can this feature builder be used to create classifiers in tests/outputs/rules with engineered features?

Cannot install it

Hello!

I am trying to install chefboost in Windows without any success...

image

max_depth parameter networking

The max_depth parameter seems to not be working. The fit function fits a tree with maximal possible depth regardless of setting.

UnboundLocalError: local variable 'subdataset' referenced before assignment

Hi,

I am facing below error while running the CHAID. Though, when used ID3, it run successfully.

[INFO]: 4 CPU cores will be allocated in parallel running
CHAID tree is going to be built...

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\Rahul.Chandel\Anaconda3\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 533, in buildDecisionTree
sub_results = createBranchWrapper(createBranch, input_param)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 533, in buildDecisionTree
sub_results = createBranchWrapper(createBranch, input_param)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 432, in buildDecisionTree
pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
UnboundLocalError: local variable 'subdataset' referenced before assignment
"""

The above exception was the direct cause of the following exception:

UnboundLocalError Traceback (most recent call last)
in
7
8 df = df.drop('fill_ratio', axis =1)
----> 9 model = cb.fit(df, config = config)

~\Anaconda3\lib\site-packages\chefboost\Chefboost.py in fit(df, config, target_label, validation_df)
211 functions.createFile(json_file, "[\n")
212
--> 213 trees = Training.buildDecisionTree(df, root = root, file = file, config = config
214 , dataset_features = dataset_features
215 , parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
519
520 for f in funclist:
--> 521 branch_results = f.get(timeout = 100000)
522
523 for branch_result in branch_results:

~\Anaconda3\lib\multiprocessing\pool.py in get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
772
773 def _set(self, i, obj):

UnboundLocalError: local variable 'subdataset' referenced before assignment

What about plotting?

That.

What about plotting dear friend?. You have made an awesome work but still missing this great functionallity.

Im new on Python. Are the chefboost outputs prepared to be plotted and show on a picture the tree, for example?

Thanks in advance!

feature_importance incorrect?

when I checked the function feature_importance(rules) in Chefboost.py, I found the process to caculate the child_score of a node is through "if child_rule["depth"] == current_depth + 1:", I dont know if I misunderstand the meaning of child_score, but I think the child node may not refer to the nodes at the same depth.

here is some of my data: (Only in this section does WBC appear)

{"feature": "WBC", "instances": 21, "metric_value": 0.7025, "depth": 5}
if obj[24]<=13.55:
{"feature": "HGB", "instances": 20, "metric_value": 0.6098, "depth": 6}
if obj[26]<=65.0:
{"feature": "Infusion volume", "instances": 19, "metric_value": 0.4855, "depth": 7}
if obj[58]>500.0:
{"feature": "MV_A", "instances": 18, "metric_value": 0.3095, "depth": 8}
................
...............
elif obj[58]<=500.0:
return 'yes'
else: return 'yes'
elif obj[26]>65.0:
return 'yes'
else: return 'yes'
elif obj[24]>13.55:
return 'yes'
else: return 'yes'

I think the feature_importance of WBC (before normalize) should be caculate by this:
WBC = 21 * 0.7025 - 20 * 0.6098

but in fact Chefboost caculate by this:
{"feature": "WBC", "instances": 21, "metric_value": 0.7025, "depth": 5}
child: {'feature': 'HGB', 'instances': 667, 'metric_value': 0.0295, 'depth': 6}
child: {'feature': 'Infusion volume', 'instances': 16, 'metric_value': 0.3373, 'depth': 6}
child: {'feature': 'HGB', 'instances': 20, 'metric_value': 0.6098, 'depth': 6}
score: -22.516799999999996

WBC = 210.7025 - 6670.0295 - 160.3373 - 200.6098 = -22.51679 ( even a negative value )

as you can see, HGB( instances 667) and Infusion volume are be consider into child to caculate, so I wonder that
which one is right?

Chaid model result always return 0 accuracy

I'am using chefboost for CHAID algorithm. Dataset contains 10000 rows and 7 columns and fit object always return 0 accuracy. What could be causing this, can you help me?
I also want to visualize tree graph how can I do this?

config={"algorithm":"CHAID",'enableParallelism': False}
model=cb.fit(df.loc[:10000,independent_variable_columns],config,target_label='Decision')

CHAID tree is going to be built...

finished in 6.883694887161255 seconds

Evaluate train set

Accuracy: 0.0 % on 10001 instances
Labels: [0 1]
Confusion matrix: [[0, 0], [0, 0]]
Precision: 0.0 %, Recall: 0.0 %, F1: 0.0 %

Python 3.12 issue (no imp module)

When trying chefboost with python 3.12, it gives the issue of no imp module.


..../lib/python3.12/site-packages/chefboost/Chefboost.py", line 5, in <module>
    import imp
ModuleNotFoundError: No module named 'imp'

Getting None as predicted values

I am getting None as a predicted output, what would be the reason for it?

environment:
pandas==0.25.1
numpy==1.17.2
tqdm==4.36.1
Python 3.7.4

train data
test data

code:
chefboost_c45.txt
(unable to attach .py as Github doesn't allow, hence added .txt)

output:
C4.5 tree is going to be built...
Accuracy: 79.16666666666667 % on 24 instances
finished in 0.41808056831359863 seconds
Win
Win
Win
None
Win
Win
Win
Win
Win
Lose
Win
Lose

Also, does the chefboost have support to get precision, recall, and f1 score?

classification returns irrelevant results in else conditions

For the configuration

config = {
        "algorithm": "ID3",
        # "enableRandomForest": True,
        # "num_of_trees": 3,
    }

I am getting the following tree for car.data dataset.

def findDecision(obj): #obj[0]: buying, obj[1]: maint, obj[2]: doors, obj[3]: persons, obj[4]: lug_boot, obj[5]: safety
	# {"feature": "safety", "instances": 1728, "metric_value": 1.2057, "depth": 1}
	if obj[5] == 'low':
		return 'unacc'
	elif obj[5] == 'med':
		# {"feature": "persons", "instances": 576, "metric_value": 1.2152, "depth": 2}
		if obj[3] == '2':
			return 'unacc'
		elif obj[3] == '4':
			# {"feature": "buying", "instances": 192, "metric_value": 1.3543, "depth": 3}
			if obj[0] == 'vhigh':
				# {"feature": "maint", "instances": 48, "metric_value": 0.8113, "depth": 4}
				if obj[1] == 'vhigh':
					return 'unacc'
				elif obj[1] == 'high':
					return 'unacc'
				elif obj[1] == 'med':
					# {"feature": "lug_boot", "instances": 12, "metric_value": 1.0, "depth": 5}
					if obj[4] == 'small':
						return 'unacc'
					elif obj[4] == 'med':
						return 'unacc'
					elif obj[4] == 'big':
						return 'acc'
					else: return '4'
				elif obj[1] == 'low':
					# {"feature": "lug_boot", "instances": 12, "metric_value": 1.0, "depth": 5}
					if obj[4] == 'small':
						return 'unacc'
					elif obj[4] == 'med':
						return 'unacc'
					elif obj[4] == 'big':
						return 'acc'
					else: return '4'
				else: return '6'

As seen, results should be nominal but in else conditions it is returning numbers somehow.

Cheefbost result is not returned

I used this library like in documentation using the same dataset but the result didn't return.
Just showing below text but process didn't complete.

from chefboost import Chefboost as chef
config = {'algorithm': 'CHAID'}
model = cb.fit(df, config)

[INFO]: 40 CPU cores will be allocated in parallel running
CHAID tree is going to be built...

Python version 3.8

Error while fitting the model

Following is the dataframe:
image
and following is the additional code:

df.rename(columns={'result': 'Decision'}, inplace=True)

Output:

Index(['Date', 'Country', 'League', 'Season', 'HomeTeam', 'AwayTeam',
       'home_goal', 'away_goal', 'Decision'],
      dtype='object')
config = {"algorithm" : "C4.5"}
model = chef.fit(df, config,  target_label = "Decision")

I am getting error:

[INFO]:  4 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_22413/130574452.py in ?()
----> 1 model = chef.fit(df, config,  target_label = "Decision")

~/anaconda3/envs/rover/lib/python3.10/site-packages/chefboost/Chefboost.py in ?(df, config, target_label, validation_df)
    209                 if enableParallelism == True:
    210                         json_file = "outputs/rules/rules.json"
    211                         functions.createFile(json_file, "[\n")
    212 
--> 213 		trees = Training.buildDecisionTree(df, root = root, file = file, config = config
    214                                 , dataset_features = dataset_features
    215 				, parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)
    216 

~/anaconda3/envs/rover/lib/python3.10/site-packages/chefboost/training/Training.py in ?(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
    432                 pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
    433                 pivot = pivot.rename(columns = {"Decision": "Instances","index": "Decision"})
    434                 pivot = pivot.sort_values(by = ["Instances"], ascending = False).reset_index()
    435 
--> 436                 else_decision = "return '%s'" % (pivot.iloc[0].Decision)
    437 
    438                 if enableParallelism != True:
    439                         functions.storeRule(file,(functions.formatRule(root), "else:"))

~/anaconda3/envs/rover/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, name)
   6200             and name not in self._accessors
   6201             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6202         ):
   6203             return self[name]
-> 6204         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'Decision'

Even If I do not rename it doesn't matter. It always throws this error.

Target label type

Is it true that the Decision column of input training dataset should be string type?
I tried to feed integer array at first and got 0 accuracy. But converting to a string array works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.