Git Product home page Git Product logo

stock-sentiment's Introduction

Abstract

Our project is building a data search/visualization tool for novice investors (or advanced investors who are new to a particular sector) to become more informed about market conditions that may affect their portfolios. To do this, we will aggregate news datasets to extract stories that pertain to financial markets and group them by which companies they apply to. We’ll apply a basic sentiment analysis to classify the events based on whether they should have a positive or negative impact on market expectations, and then join this time series with other financial data (like ticker prices). We’ll then display it in a form that is intuitive for the user and efficient in terms of compute resources used when processing queries.

Problem/Planned Requirements

We will work with the following data:

  • Multiple datasets with historical stock price data
    • we will look into combining these intelligently and organizing them offline
    • it may be difficult to get sufficient volume of data from free sources since many financial exchanges have paywalls for their datasets, but this will also allow us to practice our data cleaning and inspection skills
  • “All the News 2.0” - 2M records on news articles from major sources
    • as a stretch goal, we could try to find more news datasets and combine them
    • however, we think that the deduplication and organization would be highly non trivial with news datasets so this is a stretch goal
  • Additional datasets for sentiment analysis (Reddit, Twitter, etc.)
    • for each source, we’ll stream the data similar to what would be obtained with an API call

In terms of the search interface, users can either get databases for a particular ticker and time interval, or they can generate a “live” dashboard view which streams the historical data.

Literature Review

This paper discusses new methods for the graphical representation of financial information in a multi-dimensional manner. The representations discussed in the paper strive to enable users to get an overview of the global economy insight. The paper mentions that as with the nature of aggregate data, there is no natural and obvious physical representation, thus intuitive and comprehensible visual metaphors need to be developed. While the paper is rather dated, many of the discussed principles of visualization and insights in displaying financial information remain relevant in contemporary contexts.

For example, the iceberg metaphor discussed in the paper (and in lecture) is particularly relevant in describing economic benchmarks—such as the gross national product (GNP), tax revenues, the national debt, and the unemployment rate etc. These only represent a very macroscopic view of the overall market and represent only the visible tip of a much larger and complex economic system, much like the tip of an iceberg that's visible above the water's surface. In the context of visualization, it is important to not only present what is on the surface but provide a comprehensive understanding of the entire economic situation that uncovers the majority of an iceberg's mass that lies beneath the water.

The paper introduces two distinct visualization approaches: geographic-centric and development-centric. In our project's specific context, we foresee the development-centric approach as being more relevant. This choice stems from our anticipation that users will harness our tools for company comparisons. By providing insights into growth stages and trajectories, this approach will empower them to make more informed decisions and draw valuable conclusions.

In lecture, we delved into techniques aimed at simplifying the process of joining multiple datasets. These methods include tools like FMSketch, which aids in estimating the count of unique elements within sets, helping us pinpoint keys for dataset joins. Additionally, locality-sensitive hashing techniques like Minhash facilitate the identification of overlaps across keys and the efficient calculation of Jaccard similarity. Lastly, data structures such as Bloom Filters enable the efficient merging of rows sharing the same key while minimizing disk I/O, a common bottleneck in such operations.

This paper builds on these techniques discussed in class and explores the implementation of a new MapReduce based parallel method for set similarity joining that leverages locality sensitive hashing to speed up the similarity checking process. The paper explores many similarity functions including Jaccard similarity coefficient and others such as Sorensen–Dice coefficient and Cosine coefficient.

Furthermore, the paper offers invaluable insights by presenting both pseudo and benchmark data for the MapReduce-based set similarity join. We intended to draw insights from these findings as they provide a solid foundation upon which we can base our own set similarity join implementation, streamlining our data processing and analysis tasks.

In this paper, the authors discuss recent trends in visualizing data from 10-K filings. Companies are required to file certain information with the SEC on a quarterly basis, and the 10-K provides valuable information to investors on the status of the company. However, the information in the 10-K forms can be difficult to process or fully understand. Similar to using bulleted lists and tables, infographics and charts are important ways to convey information in a less dense manner.

The authors analyze both quantitative and qualitative charts. They find that over half of the quantitative graphics are just bar charts, but that the number of charts used varies from industry to industry. They conclude that firms elect to use infographics when the information they want to share is more complex (as measured by future stock market volatility). One of the limitations of their work is that they don’t evaluate the honesty of the graphics

In developing this project, we considered scraping data from these 10-K filings in HTML form and displaying it as part of our dashboard. But we felt that the amount of data we could easily obtain might not be sufficient, since each company only files a 10-K once per quarter, so it would be poorly suited for online/streaming implementation tasks. Nevertheless, we will use the observations from this paper in order to ensure that our graphics are easy to understand and display relevant information for investors.

In this paper, the authors perform some experiments on developing a sentiment analysis-based model for predicting stock prices. While we won’t focus on engineering the ML models to improve their accuracy, we think that this paper is relevant because it describes the theoretical foundations of sentiment analysis (which they also perform on financial news articles).

One interesting point is that sentiment analysis is parametrized on a sentiment dictionary. In this paper, the authors explore using two different dictionaries. One of the directions we can consider for scaling our system is to use multiple sentiment dictionaries, as some of them may have “blind spots” for a particular industry. We could try to combine the labels generated by each sentiment dictionary using the techniques covered in class.

Plan

We will refer to these four sources and others in order to gain the necessary background knowledge for our implementation tasks. Our work will be novel in terms of making financial data more accessible to people who are still familiarizing themselves with the basics of financial markets. Many existing financial software applications, like the desktop app for TD Ameritrade or Bloomberg Terminal, process streams of data that are known to be clean and display processed data streams that are useful for day traders. The main differences are:

  • our system will focus on easy-to-understand information (in particular, a stream of news headlines rather than a plot of a quantity like open interest in put/call options)
  • our system will be capable of processing messy data

stock-sentiment's People

Contributors

ngaijustin avatar sanjitbasker avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.