rayrayweng / orie-4741-project Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 479 KB

Jupyter Notebook 100.00%

orie-4741-project's Introduction

ORIE-4741-Project

In this repo, we explore predicting IPO performance using big messy data.

orie-4741-project's People

Contributors

Watchers

orie-4741-project's Issues

Final Report Review

Dear Subordinates,

I have been waiting anxiously for your final report. Your proposed project to develop a model that predicts the futures of an IPO based on a number of predictable features has numerous applications within our company if successful. After a thorough analysis of your report I am very happy to say the wait was well worth it.

Likes:

The way you cleaned your data and explained it in layman's terms was great. It helped me fully understand the amount of thought and effort that was put into this project.
The explanations of your models were very thorough and diligent. I feel like I understand why you made the choices you did.
I appreciate your analysis of the models in terms of ‘fairness.’ This ensures that our company will not open itself to lawsuits and is acting morally by using the proposed algorithms.

Areas of Improvements:
-A feature analysis portion of the report would have been very interesting. Knowing which aspects of a company contribute most to their stock price is something that can be leveraged/ can be invaluable in our business.
-I would suggest using various Auto ML algorithms to fine tune your model parameters further.
-The confusion matrices can be confusing at times, a simple preds vs. actual plot would have provided more easily digestible insight into the success of your model.

Overall, I thank you for embarking on this project, you have done your company proud. I look forward to discussing the implementation of your findings more in the future.

Sincerely,
Your boss

Peer Review

The project is about predicting the IPO performance of companies. The data is from Kaggle, and the name of the dataset is Stocks IPO information & results. The objective of this project is to determine the percentage change the stock will have after the first date.

What I like about the proposal

The objective is straightforward to understand.
Justification of the problem is apparent as I assume many people would want to know the answer to the proposed question.
The dataset seems to consist of a mix of numerical and categorical variables.

Areas for improvement

In the dataset portion, it is explained that the group will look at seemingly irrelevant factors, but it may have been better if the group could provide some examples.
The dataset consists of around 1600 columns - some of the columns can probably be dropped without too much analysis - which can make feature engineering very difficult.
Perhaps convert the percentage change to a binary outcome model, which can simplify the process as there is no need to feature transform the outcome variable.

midterm peer review

The project is about exploring IPO performance using big messy data. The dataset they are using contains 3762 company IPOs with 165 different columns. Their goal is to predict the price of an IPO of a company from different factors and investigate the factors that influence the success of the stock over a long time period.

Things I like:

I really like the histograms and tables provided in the report, they help readers better understand the data and the methods used in this project.
I also like the data cleaning process because each step is clearly stated and contains lots of detail. Specifically, I really like the way they dealt with missing price data on the 1st day. The new column that indicates whether the price goes up or down could deal with the missing data and play an important role in predicting the model.
Lastly, I really appreciate how each process is really well explained, and it is easy for readers to follow.

Three things to improve upon:

-I think it would be helpful to add an introduction that explains what the project is about, and you may also include how you guys are going to deal with underfitting.
-It would be helpful to include why you guys choose those models over other models(advantages of the models chosen) and how you perform each model and the parameters you used.
-For the missing values, I think dropping all companies with those selected features missing is kind of abrupt since the amount of data in this dataset isn’t a lot. You may miss some important information by just dropping the missing data. I think a better way could be predicting some of them with other columns or adding a new column that indicates whether the value is missing.

Overall, it is a well written report. Can’t wait to see the results!

Peer review 3

The project aims to discover what factors would lend up to day one IPO success in order to predict the percentage increase or decrease of stock on day one. The dataset that they will use is from Kaggle, consisting of information across 3000 companies. This project would definitely be extremely useful in the finance world to be able to predict day one stock prices.

LIKE:

How the team decides to focus on day one trend specifically, would make the prediction more precise.
Look further ahead and examined the scope of impact this project can create
Large dataset, it will be messy

Improvements:

Suggest some of the features that you decided to look into, examples of the columns in the dataset
Explain the feasibility of the project, although I understand that there has already been a lot of analysis done on IPO
Explain how your team might approach the project differently when only looking at day one

Midterm Peer Review

What's it about?
The project is about predicting the price of an IPO of a company (specifically how its price changes on the first day of its offering).

What data are they using?
They are using a dataset of 3762 company IPOs with 165 different columns.

What's their objective?
Their objective is to predict the price of a company’s IPO based on different factors related to the company as investing in IPOs can be quite profitable.

Three things I like:

I like their explanations as this report was very easy to read and understand, and had a good mix of technical writing and readability.
I like the flow of their report as the order of each section made sense within their report and their references where to find figures and facts was quite helpful.
Lastly, I like how transparent their report was. Their data cleaning was difficult and I like that they shared that. If I didn’t know why their dataset only ended up being the size that it was, I would have been a lot more surprised or concerned, but their transparency was helpful and honest which I really appreciated.

Three things to improve upon:

My first possibility of improvement is to try to find other datasets to join on yours. 10 columns and 2,000 examples doesn’t feel like a lot, especially for this course, so I would try to find additional information about the companies, or their locations, or average price of similar companies. More data would go a long way I suspect, and I would definitely try to pursue that.
Next, the dataset is on a country level, but I would expect much more informative insights to appear on a city level. While this may not be easy (or possible) to find, I feel like certain US cities may have traditionally more successful IPOs (Silicon Valley type cities), and if this information could be easily acquired I think the extra granularity would greatly increase the prediction power.
Lastly, I think the inclusion of more detail would be helpful. My suspicion as to why you got such an extremely good Random Forest training error is that the tree was remarkably deep, but that it wasn’t a highly generalizable tree. However, you didn’t mention the parameters used for this tree, so it’s hard to give any helpful advice for improvement. If you just used the default settings, perhaps try a more shallow tree, or do some parameter hypertuning to choose the best ones and I think this would increase the model’s generalizability.

Proposal Peer Review

Summary:
The team hopes to create a model that can accurately predict a stock's success based on IPO information. The objective is to create a highly profitable system of investing based on these early factors. The dataset used contains a variety of properties about stocks during their IPO period and includes about 3000 companies.

What I like about the proposal:

The introduction highlights why such predictions can be difficult (Uber vs Snowflake comparison).
The team seems to be trying to find original results rather than correlations that other's have already found.
Financial instruments and their markets seem difficult to predict so the proposal is appropriate for the project.

Suggestions:

A better explanation or rewording on why it is worth considering the "seemingly irrelevant" factors would help.
For someone not in ORIE 4741, it might be worth explaining what the "impressive arsenal of data analysis techniques" includes.

Final Peer Review

This project aims to predict whether or not the price of a new company's stock will increase on its first day. The group uses a dataset on 3762 companies with information on their stock prices, sectors, revenue, etc. for this analysis. Ultimately, the goal of this project is to be able to predict which companies' IPOs are worth investing in.

Some things I liked:

I liked that you used different metrics to analyze the performance of your models. I especially think that the third metric, profit per stock buy, is helpful since ultimately your goal is to use this model to earn profits from investing in the "right" IPOs.
I think it is good that you used multiple different models, as well as hyperparameter tuning for them. Different types of models are better suited for different problems so it is always good to use multiple to be sure that you are using the best model possible.
I think your feature importance analysis for Random Forests was interesting and yielded some surprising results. It might have been good to include a sentence or so explaining why you think sector and whether a company is US-based were so important for prediction.

Some things to add:

I think a figure showing your preprocessing (for example when you talked about correcting features using histograms) would have been helpful to better understand your approach.
For the confusion matrix, I think it would have been clearer if had labelled them with "price increased (predicted)" vs "price increased (actual)" or something along those lines, instead of just 0 or 1. I think this would have made your figure more understandable for the reader.
Since your dataset had price data for 262 days, it would have been interesting to see if your model could have been trained/applied to a different day in the dataset and still be able to predict stock prices.

Final Peer Review

This project aims to determine how company factors influence the company's performances in the stock market.

Things I liked:

I liked the initial description and analysis of the dataset. The table 1 feature description was especially useful.
Good use of k-fold cross validation to balance the challenges of under- and over-fitting the data.
Interesting measures of model performance, specifically calculating the profit per stock buy. I think this was a very useful addition to the simpler measures of accuracy, since this is ultimately

Areas of Improvement:

The confusion matrices are a bit hard to read. It feels more natural to have the "true labels" axis increasing, kind of like a typical coordinate plane. Furthermore, the colors are hard to distinguish on the righthand half in your random forest and SVM graphs.
I think it could have been helpful to use some of the data as a validation set, so that after you trained and tested using the validation sets, then you could produce a final model evaluation using the testing set.
Since your dataset includes all different company sectors and is not a very even distribution of companies per sector, it may be beneficial to split the data by sector and develop separate models for separate sectors. This could improve model accuracy.

Interesting project!

Midterm Peer Review

The project is about predicting the IPO performance of companies. The dataset they are using contains information about 3762 company IPOs, with 1665 different columns. The project objective is to determine what factors influence day 1 success.

Things I like:

I like that you have added many plots and tables. They can help the reader to better understand the data you are analyzing.
The objective is clear and the report is well structured.
I like that you have already used three methods taught in class – logistic regression, random forest, and linear regression, and that you have compared the results from those models.

Areas for Improvement:

I think it might be better if you add an Introduction, explaining what the project is about.
You did not mention where the dataset comes from. You only mentioned that it contains information about the IPOs of 3762 companies.
Maybe explain why you chose to use those models and what are the advantages of using them.

Midterm Peer Review

The project is about exploring IPO performance by using big messy data. The dataset contains 3762 company IPOs with 165 different columns. Their objective is to predict the price of a company’s IPO based on different factors related to the company as investing in IPOs can be quite profitable.

Things I like:

They interpreted the data very well, by using different plots in different factors. Also the words are very easy to understand.
The data clean process done very well, they get rid of useless data successfully.
The midterm report tried various ML models that we learned from calss, such as logistic regression, random forest, and linear regression, and they have compared the results from those models.

Things needs to improve:

It might be better the report has an introduction part, which explaining what the project is about and why you choose it.
Evaluating a company is a very complex thing, I think 10 columns and 2,000 examples may not enough for develop a machine learing model.
Also IPO is a process may be affected by human factors, how to get rid these factors is very important.

Midterm Report

Dear Subordinates,

I've been looking forward to your midterm review since your initial proposal. Your proposed project to develop a model that predicts the futures of an IPO greatly intrigued me and I was exited to see how’d you’d excite this rather lofty goal. After reading your report I can say that overall, I am quite pleased with the progress you have made.

Firstly, allow me to address how you cleaned your data:
I would like to start by stating that I like how you did a lot of investigation into your given dataset. Checking your given values against the Yahoo ticker prices was a smart move that possible saved the integrity of your model. In addition, when you choose to fill in ‘nan’ values and when you choose to just discard the feature completely was well thought out and executed. I am marginally concerned with the predictive powers of your remaining features. It might be worth looking deeper into the features you abandoned to see if you could self enter some values that where missing. I believe that if your features become too generic the model will not function as well as one would like.

Now allow me to provide insight into your data analysis:
Your insight into your feature space and the accompanying plots helped me visualize your data a lot better. Checking your “Y- value” column against your assumptions is always a powerful tool that can help identify messy aspects of your dataset; as well as it being a good self check that everything is okay. I very much appreciate that you took the time to do this job. I would like to see further analysis however on the other features you choose to keep and the names of the features you choose to discard. Its hard to get a handle on the predictive power lost by your dataset cleaning with out this information.

Finally, let me state how I feel about your preliminary models/future plans:
I like the models you choose to train initially. They both act as a good lower bound for what can be accomplished. Providing the chart on feature importance also gives me further visual insight into your dataset which I appreciate. I like your plan to try to further develop your dataset and it is smart to state outright that you are not expecting a model with 80% accuracy. I think it would be worth investigating how K-fold verification effects your models predictive power as well, as this can be a great tool to prevent overfitting. It would also be worth your time to investigate gate forecasting as this can be a powerful model to predicting stocks.

Overall, I think your group is making good progress and I agree with the plan you have to move forward.

Sincerely,
Your Boss

Another Peer Review

Summary
Ray Weng and Dhruv Girgenti are undergoing a data exploration project seeking to predict day-one IPO performance. They are using a dataset that summarizes various factors for over 3,000 stocks. Their angle of approach is to explore seemingly irrelevant factors to find potential relationships in a company’s IPO success.

Compliments
One thing I like about your proposal is how specific your idea is. Taking on stock market data is a hard thing to do, and the more focused you are, i.e., choosing to only predict IPO performance, the better your odds of finding something valuable. I also like that you acknowledge how chaotic stock market data can be since that’s definitely a good thing to be aware of when using such a dataset. Lastly, it’s nice that you plan to look for correlations in seemingly irrelevant data since that’s much more interesting than finding obvious correlations, such as the connection between annual revenue and stock price.

Areas for Improvement
If you find the opportunity to, I would recommend making this project even more specific, if you can. The more focused your area of application, the more likely you are to make sense of the chaos. Since you don’t mention domain knowledge in your reasons for likely success, I’ll also mention that domain knowledge is SUPER important when dealing with stock market analysis, so the more articles you can read along your journey, the better. Lastly, keep in mind that there are many external factors that happen in the economy and politics that can affect the stock market, so perhaps some kind of controlling for IPO performance normalized against average performance in that sector for that day could improve your predictive ability.

Final Project Review

The project examines a data set containing information about initial public offerings (IPOs) for companies and aims at identifying the key factors that lead to a succesful IPO. The dataset contains data on 3,762 companies with 1665 features. It includes data for opening, closing, high and low prices, and volume from opening day to the 262nd day after the IPO.

Things i like:

Like your problem statement and that the purpose is not only predictive but also aims at identifying factors and effects
Liked your feature engineering and data cleaning - deriving new features from existing etc. Good overview of the features you had left.
Clear and simply visualisation
Good presentation of results. Maybe

Questions/improvements?

Perhaps the logistic regression could have been explored a bit more. Did you consider adding a regularization term (L1/L2) to logistic regression?
How did you estimate the error of the models? Seems like you used a CV for both model selection and to estimate the error. You should not be using the same error for both model selection and error estimation - a seperate hold out or a nested CV is required to get an unbiased estimate. Not sure i would trust the conclusion of the important features due to the low performance - especially not if a seperate test set wasnt used which would mean the presented error is biased.

Overall i liked the report and think you did a good job!

Peer Review 5

This project will investigate the predictors of day one IPO performance by analyzing publicly available data from over 3000 historical stock IPO's.

This is a well structured problem with a clear motivation. The initial dataset identified for this project also contains a huge array of features for each datapoint, which should provide plenty of opportunities for the team to leverage tools from 4741. The focus on so called "irrelevant factors" is also quite promising, as that does appear to be the space in which the most interesting relationships are likely to be found.

By that same token, I think that this project might benefit from an examination of even more "irrelevant" data. Much of their chosen dataset focuses on factors that are clearly related to the inner workings of the business—factors which, from an adversarial perspective, are almost certainly taken into consideration by the organizations setting their initial prices. I expect that the vast majority of the features in the dataset will be much better predictors of IPO's initial price than of its movement throughout the first day of trading.

Final Review

Summary:
The project's objective is to predict which companies' IPOs are worth investing in on one day of trading. The project looks at the factors of the companies that have a day one IPO success. It specifically uses logistic regression, decision trees, and random forests to understand which factors play into the success of an IPO. We also get more understanding of the models through confusion matrices for each model.

Things I like:

I like the level of detail you made with working on your dataset even before making EDAs. Some people don't realize that some of the data is wrong, and if they do, they just take it out of the dataset. But the team realized what was wrong and fixed it in the dataset so that they can not lose any data.
I like the table used to describe the features used in the data analysis as it also the reader to stop looking at dense paragraphs and notice what is important.
The fairness portion brought up a very interesting discussion on making the wealthy even wealthier, and how the project affects this. The overall usefulness of the model is good for first-time investors.

Areas for improvement:

The size of the confusion matrices in comparison to the other figures is unnerving. There is no reason why the confusion matrices should be that big while all the bar graphs are that small.
Although it may be in your proposal or midterm report, you should still say where the dataset you are using is from as it would alert the reader that datasets from there may have misinformation as stated with the checks that the group had made with Yahoo Finance.
In addition to the confusion matrix, I feel as if there could be a discussion about the false positives and false negatives as false positives would cause bad investments.

Final Project Peer Review

Summary

The purpose of this project to determine which factors lead to IPO success to help traders identify which companies are worth investing in. This ended up being a binary classification problem with 1 being buy the stock when it opens and 0 being pass on the stock, since it will fall in value at closing. They used three classification models: logistic regression, random forest, and support vector machines. In the end, the best model was SVM with a testing accuracy of 0.62 and a average profit per stock buy to be 0.74 cents.

Things I Like

The data set seemed challenging to work with, since there were so man missing values and inconsistencies. I'm glad you guys were up for the challenge.
The random forest model was interesting. I like how you noticed it was overfitting and then corrected using validation and hyper parameter tuning.
I like that you were aware of factors regarding fairness and wealth disparities and made sure your analysis was legally fair, with limited statistical bias.

Areas of Improvement

I was hoping for more justification behind choosing the classification models based on your data analysis, but it seems like they were chosen only because they are interesting to experiment with.
Some of the graphs seemed unnecessarily large, it made scrolling through the report somewhat inconvenient.
It would be nice if you expanded upon the parameters used in the SVM model. I am not sure why you decided to use an ref kernel for example and I'm wondering if other kernels may perform better if you tested it out with GridSearch.

Great job, your model accuracies far exceeded my expectations for a problem this hard!

rayrayweng / orie-4741-project Goto Github PK

orie-4741-project's Introduction

ORIE-4741-Project

orie-4741-project's People

Contributors

Watchers

orie-4741-project's Issues

Summary

Things I Like

Areas of Improvement

Recommend Projects

Recommend Topics

Recommend Org