cmh342 / 4741project Goto Github PK

Predicting Company Bankruptcy

4741project's Introduction

4741Project

Predicting Company Bankruptcy

cmh342, md794

Our project aims to predict whether or not a company will go bankrupt. Predictions of bankruptcy are crucial to the financial industry to make better credit and loan assessments. These predictions can have huge impacts on the rest of the world -- from the business owners and financial industry to the rest of the society. Being able to accurately and efficiently predict bankruptcies has immense value, and the dataset that we are working with can give us valuable insight into which factors play key roles in influencing the markets and the financial industry as a whole.

4741project's People

Contributors

Watchers

4741project's Issues

Peer Review from bsp73

Summary:

The question you are answering is whether or not it is possible to predict whether a company will go bankrupt based on a number of financial features. To do this, you are using a dataset containing information about companies in Taiwan and whether or not they went bankrupt. If it is possible to make accurate predictions, the model will be very useful for financial institutions in making loan decisions.

What I liked:

I like that the question is whether we can predict bankruptcies and not how we can predict them because it may not be possible to predict them well. If you can predict them well, you will have a very useful model
With over 6000 records and 96 features, the dataset will be very good for feature extraction as there are so many features to base predictions on and a fair amount of companies.
The classification algorithms you mentioned seem like great options and it will be interesting to compare them.

Areas for improvement:

I'm slightly confused by your paragraph about false positives and negatives. You explain why it's necessary to balance them, but don't really explain your specific strategy for doing so.
You could explain a few more details about the dataset. For example, I'm not sure when the data for each business was recorded.
Is it a problem that you are only looking at data from Taiwan? Are businesses run differently or is different data collected than in other regions?

Final Review

Predicting Company Bankruptcy
=============================

Summary
-------

This project is about predicting the bankruptcy of companies 
given a set of financial information about the company. The 
project uses data collected from the Taiwan Economic Journal 
within the 10-year period starting in 1999. The team seeks to 
use the results to develop a wider understanding of financial 
statement analysis.

Things I liked
--------------

1. There were definitely a lot of really good visualizations 
used in the report! They made understanding the data significantly 
easier.

2. The model methods were explained very well, as were the 
diagrams which corresponded to them.

3. In the model explanations, the upsides and downsides insofar 
as the use of the model was concerned (e.g., false positives for 
higher class weights) were well-explained.


Areas for improvement
---------------------

1. The correlation heatmap for all 96 features was pretty hard 
to read; the one with only 9 features would probably have been 
sufficient (and a better use of space). Also, it would have made 
sense to not include very many figures detailing statistics about 
the data, as those could have been included in the project proposal 
rather than the final report.

2. Although fairness was discussed, it was not done through the 
explicit lens of whether or not the model constituted a "Weapon 
of Math Destruction".

3. In the Conclusion and Future Work section, there aren't really 
any specific recommendations given regarding what businesses should 
do to prevent themselves from going bankrupt, or any other 
recommendations (e.g., to someone looking to short a company's stock) 
to more broadly contextualize the work done in the report in light 
of the results.

Overall great job!

Final Review

Summary:
This project's objective is to predict whether a company will go bankrupt or not based on data of bankruptcies from the Taiwan Economic Journal between 1999 and 2009. The project tries to find the best predictors of bankruptcy while balancing the false negatives and false positives. The project uses multiple models and visualizations to understand which predictors we should be the most concerned about.

Things I like:

I like this idea in general as it could be used for small companies to measure their growth and check that they are on the right path.
The addition of finding a balance between the false positives and false negatives shows a good process of how we want to think about the models. The explanation of what would happen given false negative or false positive was really thought out, though I think it should have been mentioned sooner in the introduction.
Most of the visualization and graphs add meaning to the project and allow the team to focus on the important aspects of the figures.

Areas for improvement:

The first heatmap is really hard to read. I understand it is there as a visual, but since you already stated it was too small and too convoluted to understand and interpret, I feel like you should have not included it in the report.
I think there should be a more standardized way you are sizing the figures, as some are hard to read and others could be smaller.
In the conclusion and future work section, more could have been described in what can be done to limit incorrect identifications. In general, the future work part could have included more like some questions that maybe came up while doing the project that is separate from the initially posed question or something that the group may have wanted to try in order to limit incorrect identifications.

Midterm Review

This project will look at bankruptcy and financial data from 1999 to 2009 to learn to predict whether or not a company will be bankrupt. There were 6819 observations and 95 features and all companies were in Taiwan.

(i) I really enjoyed the format and thought the report was very clear. (ii) An explanation about the importance of the project was given, and this helped contextualize the problem. I also enjoyed that the report talked about traditional, non-machine learning methods that are used. (iii) The plots and preliminary analyses made sense and they seemed advanced (using a polynomial basis expansion, elastic net penalty, etc.)

(i) The "F1 measure" was not explained, along with the "Precision" and "Recall." Explaining these would help make sense of the plots. (ii) Maybe including some more basic techniques (least squares regression, etc.) before the advanced ones would help show why more advanced models should be used. (iii) It might be useful to talk more about why some of the techniques were used and chosen, especially the more advanced ones (why they're better or better tailored for this problem).

Peer Review

Summary of project:
The project is about predicting which company will go bankrupt based on features related to financial information and the financial ratios of each firm. They are using data provided by the Taiwan Stock Exchange including 6819 records of Taiwanese companies, including 96 features. Their objective is to train a classification model to make binary predictions on whether the company will go bankrupt or not.

Things I like:

The reason for the importance of the problem is very decent and motivative. The project considers the meaning of predicting bankruptcy from various directions, including benefiting the financial industry (credit system), the stockholders, the society and economics. This makes me feel the problem is very applicable to the real world and worse observing.
The project is very specific about how they will approach the project. After going over the dataset and features, they already decided their potential approaches in training the data which includes using SVM, LR or Tree.
The project had detailed consideration about the balancing of false positive and false negative cases, and how to use this information to perform better training. This is a very interesting perspective and I’m really curious to see how the ratio will influence the result.

Areas for improvement:

Considering the uniqueness of the dataset which is mainly focusing on Taiwan’s market, is there any possible bias that may possibly arise during training? Will the final model be applicable to companies outside Taiwan? Will be interesting to see if outside of training on this dataset, the error rate on a dataset from other country’s market.
Have you considered the possibility of overfitting? Originally within the dataset, there are already 96 features. If you will apply feature engineering to getting more transformed features, the size of the dataset might be a little bit small. If such a case happens, how will you filter the best features, and how will you prevent overfitting?
How to avoid biases when you separate the dataset into training data and testing data when the size of the firms is not consistent? For example, if after random assignment, most of the test data are very small companies while data used for training consists of relatively large companies. Will this become a problem?

Final Peer Review

This project uses a dataset with accounting information from companies in Taiwan. The main goal of the project is to predict company bankruptcy. However, the project also aims to balance the classifier's false positive and false negative rates. In particular, the project focuses on identifying companies that are most likely to go bankrupt, which requires a careful balancing of the false positive rate.

The group tested a logistic regression model as well as an AutoML model. Ultimately, the logistic regression model performed better, as it does a better job balancing false positives and negatives. There were only five features left in the logistic regression model. Interestingly, only one feature (Debt Ratio %) indicated an increased probability of bankruptcy; the other four features indicated a decreased probability.

Things I liked:

I thought it was great that you opted for ROC curves, the F1-measure, and a confusion matrix instead of using accuracy. This shows that you put a lot of thought into the distribution of labels in your dataset.
Even though the simpler logistic regression model ended up working better, I still thought it was really cool that you used AutoML here, and you did a great job explaining how H2O works.
I thought it was very interesting how you discussed fairness in your project when your data didn’t contain any protected attributes. Discussing the impacts of false positives and negatives was quite fascinating, especially since balancing those rates are very important for your model.

Areas for improvement/suggestions:

I think you probably could’ve omitted the correlation heatmap for all the features, since it is quite difficult to see.
Perhaps adding some intuition behind the coefficients in the logistic regression model would be useful (e.g., whether or not they make financial sense).
You briefly talked about class weights with regard to logistic regression and balancing false positive and negative rates. I think it would’ve been interesting to showcase different models with different weights and explain how those models could be used if someone, for instance, cared more about false positives.

Overall, this was an amazing project! I enjoyed reading it and thought you did a great job using the tools learned in the class.

Midterm Peer Review

Predicting Company Bankruptcy
=============================

Summary
-------

This project is about predicting the bankruptcy of companies 
given a set of financial information about the company. The 
project uses data collected from the Taiwan Economic Journal 
within the 10-year period starting in 1999. The team seeks to 
use the results to develop a wider understanding of financial 
statement analysis.

Things I liked
--------------

1. I liked that the introduction was thorough, and described 
not only the analysis that was to be done but also its purpose. 
The importance of the questions being answered by the 
analysis was very evident in the language used.

2. I also liked the decription of the dataset. The precision in 
the quantity of the observations and the total number of 
bankruptcy gave the report an air of legitimacy.

3. The separation of categories from each other was good to 
see, and made it easy to parse through exactly where the 
requirements for the midterm report were met.


Areas for improvement
---------------------

1. The font scaling on the graphics (the histogram and the 
matrices) was off, and was somewhat jarring to the eyes. 
The label font on the histogram was too small and the label
font on the matrices was too large, compared to that of the 
rest of the report. (They were also sans serif instead of serif.)

2. It seemed like there was room for more elaboration. 
Although the report did seem thorough, a change in the 
overall scaling of the project could have made it seem 
more complete in order to fit the page limit.

3. Not sure what the policy is on cover pages, but the report 
was supposed to be at or fewer than 3 pages.

Overall great job!

Midterm Peer Review

Summary:
The project intends to use data collected from the Taiwan Economic Journal to predict if a company will go bankrupt in the industry. The project is interested in high-risk companies which may have a high probability of going bankrupt.
What I like about report:
Well explained overfitting analysis with effective steps to avoid over and underfitting
Good visualizations for data variability and outliers
Well detailed feature explanation with thorough data cleaning
Suggestions:
The project states the next steps are to continue with modeling, but what specific models will be implemented?
Could feature expansion be implemented to reveal any correlation between features and bankruptcy?
How would validation be conducted on these future models?

Midterm Review

Things I like:

I like the analysis being done on the correlation between each feature and the label. I think this is a really useful way to get a good sense of your data and how your model will work, and it also helped identify that there were two identical columns, which we’ve seen can often be problematic.
You’ve carefully thought about how the model should actually be evaluated, based on the fact that your label is very skewed. I think that the idea of evaluating the model on an F1 score rather than a traditional accuracy is a really good idea.
I hadn’t heard of the elastic net penalty before, but it does seem like an interesting approach. The preliminary analyses are interesting and encouraging, and I think there’s a clear idea of what the next steps will be.

Areas for improvement:

I think that it would be useful to explicitly state the hypothesis set that you are considering, and also mention techniques to avoid under/overfitting that are specific to each model. You mentioned that you may use regularization or you may reduce the number of trees, but I think it would be good to specifically outline which model you are going to try first, and why.
A bit more detail in the last section might be useful. How are you going to reduce the false positives, and why do you think that a tree ensemble / an SVM would help you achieve this goal? Also, when you say “best” model in the last sentence, could you be more specific about what that means?
You mentioned in the last section that several of the features were correlated — I think going into a bit more detail there/showing a visualization such as a correlation map would be useful. In the future, if/when you do a tree model, I also think it would be good to include the feature importance plot.

Proposal Peer Review PredictingCompanyBankruptcy

Proposal Peer Review PredictingCompanyBankruptcy.docx

final peer review

This project is about predicting company bankruptcy based on data collected on bankruptcies from Taiwan Economic Journal. The data ranged from 1999 - 2009. The specific objectives in clues: whether it’s possible to accurately predict given the financial information, balancing the ratios of false positive and negatives to achieve better results, identifying high-risk companies, and finding valuable predictors.
Things I liked:

I think the overall logic flow of this report is very clear, everything is well-connected and extremely intuitive to read through. Since I will be doing finance in the future, this report is very interesting to me as the content is very real-world related and compelling.
I like that you started with some the regular approach which is logistic regression (which works better) but also tried out other models and the explanations for those are also illustrated well.
I think the discussion about the false positives and negatives are interesting, I haven’t seen those in other reports yet and this demonstrates your work throughout the semester
Suggestions for improvements:
Graphic wise: figure 1 and figure 3-11 might be something you want to omit since those can be described easily by words and are hard to see/take a lot of space
Is there a time dependency on the data? Since there are some years during the decade (i.e. 2003, 2008) which has experienced some financial crisis.
Maybe including more reasoning for why the models you are choosing to use make sense in the real world, not only within math
Overall great job! Love learning about your report

Final Review

This project seeks to predict company Baknrupcy from a collection of data collected from the Taiwan Economic Journal

What I Liked:
(i) Well thought out data cleaning section which set up regressions nicely.
(ii) Thorough tuning of hyperparameters with all tests clearly visualized for the reader
(iii) Well-written report with clear evidence of time and effort.

Suggestions:
(i) Could any models have been further improved with some of your findings? Could hyperparameters be tuned even further?
(ii) What other models would you try if there was more time for this project? Could nay provide a more predictive prediction? For example, could PCA work in this scenario?
(iii) Could more tests to be done to better assess the accuracy of the models created?

midterm review

This project aims at predicting a bankruptcy based on bankruptcies from the Taiwan Economic Journal. Using data from the above agency, the group tries to figure out whether it is possible to accurately predict whether a company is going to have bankrupt given the financial information they have about the companies in their dataset. At this stage, the group performed initial data cleaning with insightful exploratory data analysis and tested data with several supervised classification model.

Things I liked:

I like how you use cross-validation to try different models on your dataset to avoid overfitting and under-fitting.
I like the way you figure out that including polynomial feature or interaction effects does not make a big difference.
I like how you deal with missing data and the way you find correlation between your data set.

Things I am concerned/can be improved:

You data mainly comes from Taiwan Economic Journal, which is from a specific region in Asia. I would suggest using data from multiple sources to give you a wider insight.
I might use bagging or boosting as well.
What models will you specifically implement in the next step?

Overall great job!

Peer Review

This project seeks to given information on a company, will it go bankrupt or not. The authors plan to use data on around 7000 different Taiwanese companies' bankruptcies collected by the Taiwan Economic Journal between 1999 and 2009. Using features derived from this dataset as well as features further engineered, the authors plan to try out several different types of classifiers to gain insight into which financial indicators best predict bankruptcy and eventually make a prediction.

What I like about this project:

I think that this is a very important and useful question to answer in the financial world. A model that could accurately predict bankruptcies would be very interesting to investors or companies in making financial decisions since a company going bankrupt would have major financial repercussions for them.
I like the discussion considering false positives and false negatives and the weights and relative importance placed on both of them.. I think it will be something very useful to consider when deciding on which loss function to use, especially given how many possibilities there are, I think it is good that you are already thinking about them.
I think that, as you mentioned, it is a good idea to try out different types of classification models (SVDs, linear regression, and decision trees) since one might fit the question better than others.
I also think it's great that there is so much information available on the companies (96 features) in the dataset, it will definitely give you a lot of options when engineering features.

What I would add:

I think the question of "will a company go bankrupt or not" is still a little vague. I think there is a big difference in predicting whether a company will go bankrupt within the next few months vs. within 10 years or so. It may be better to refine the question a bit since depending of which timeframe you want to predict in, the features could look very different (and different features may be more useful in one time span than another).
Are you planning to generalize the model to countries other than Taiwan? If yes, it may be good to look for other datasets for other countries as well, since there may be other different information available on companies based on which country they are based in. Otherwise you may be able to do some research about which of the features in your Taiwan dataset are available for other countries' companies and try to only use those for your model.
Adding onto the previous point, the dataset seems a little small (7000 companies), especially when 20% should be reserved for your test set. It might help to find some additional data sets for better results. If not, you could always consider using cross-validation with a large number of splits for your train/validation sets.
Since this is almost a type of timeseries-type prediction, I think it would be good to consider how you want to split up your test and training sets as well as how much data (i.e. how far back) you want to go when engineering features for each company. I think this also relates back to my first point about deciding how far out in the future do you want to predict a companies' bankruptcy.

Peer Review

Concretely, your comments should begin with a one paragraph (at least three sentence) summary of the project you're reviewing. What's it about? What data are they using?

This project is a binary classification problem on whether a company will go bankrupt or not. The dataset that they are using is information on about 7000 companies from the Taiwan Economic Journal between 1999 and 2009. The objective of the project is to predict whether a company will go bankrupt by selecting a subset of 96 features provided by the dataset, as well as generally gaining insight into the factors in determining whether a company will go bankrupt.

Then detail at least three things you like about the proposal, and three areas for improvement (at least one sentence each). Make sure to back up your subjective assessments with reasoned, detailed explanations.

Things I like:

I think that the problem that you are trying to solve is extremely useful, and being successful in this project can give great insight into the determining factors of bankruptcy.
The dataset that is being used has a lot of features, so there is a lot of potential for the hypothesis set and feature engineering. The group can play around with these features a lot and potentially evaluate on different types of classification models (e.g. SVMs or logistic regression).
The group also considered various error metrics. Instead of simply trying to figure out the number of correctly classified examples, they also took into account the false positive and false negative rates, which I think are important evaluation metrics in these types of problems.

Questions/areas for improvement:

A question that I would pose is how would you take into account certain biases with respect to the time frame? For example, the examples in the dataset between 2007-2009 might look a lot different due to the recession, so I think this is something that could definitely affect the model.
Is the dataset is large enough to tackle such a problem? If you do, for example, a 70-20-10 train-validation-test split on the data, will you have enough examples to train on/enough examples to properly validate/evaluate the model?
I couldn't figure out how to access the dataset myself, but are there a lot of missing values in the dataset, and if so, how will you deal with these missing values? Going off the previous bullet point, dropping these values would make the dataset even smaller, so could you impute these values somehow?
Do you think that this Taiwan dataset will be predictive of datasets taken from other locations? You mentioned that Taiwan will have a specific set of rules for bankruptcy, so do you think that the insights you gain from this data will similarly apply to companies in the US or some other country?

Midterm Peer Review

This is a well-written and thorough proposal which does a good job highlighting the steps you have taken to pursue your project. Your project aims to use financial data from Taiwan in the years 1999 to 2009 to predict whether a company will go bankrupt. The borrowing dependency in particular looks like it might not be a good indicator because most companies are in the lowest bin, but it may be more useful in conjunction with other metrics. If you could display the frequency of bankruptcies in the bins chosen for your three graphs, that would be quite cool, and give a hint as to how a linear classifier might do! Good idea using a F1 estimator to help deal with the small number of positives in your dataset. It is interesting, although not particularly surprising that a polynomial set doesn't affect your model's accuracy very much. Did you fit your models over every feature, or just the ones listed above? I am also not sure whether false positives at a reasonable rate are a huge issue for a management team, because there is only limited money to invest with anyway. Having some predictor to help filter good from bad investments is already a useful tool. Is there time dependency in your data? That might be skewing results, as well.

cmh342 / 4741project Goto Github PK

4741project's Introduction

4741Project

4741project's People

Contributors

Watchers

4741project's Issues

Recommend Projects

Recommend Topics

Recommend Org