Git Product home page Git Product logo

nlp_of_company_earnings's Introduction

Performing NLP on Transcripts of Multiple Companie's Earnings Calls

Seeking Alpha is a website that provides information on the stock market. One of the stockmarket related services that they provide are transcripts of Earnings Calls made with a variety of companies. These earnings calls are updates on how the company has done in it's earnings during a period of time, usually in the form of yearly quarters.

Problem Statement

The client wanted to have the Earning Call Transcripts scraped from Finding Alpha and have them analyzed for sentiment and complexity.

Data Collection

I gathered the data by using Selenium. I started by collecting the urls of all transcripts that I wanted to collect. I then went to each transcript page individually and scraped them one at a time.

In order to avoid having the website question my code's humanity I started by changing the user agent name. When having just one user name stopped working I had the browser close and then reopen with a new user name for every pull. Eventually I had to add in a sleep timer of 20 seconds, but the full scrape of a single url took almost a full minute. This meant that collecting the number of transcripts I wanted took a few days to collect.

Data Analysis

This project gave me the opportunity to learn about the Gunning Fog formula for complexity. The full formula is as follows:

The idea is that the result is supposed to represent the number of years one would have to spend in a school to easily understand the piece of text. Thus, a score of 12 would suggest an equivalent of a high school senior. Unfortunately this doesn’t show in practice as many results can end up in the highs of 24 or more. Thus this score should best be taken as a measurement without the comparison to years in school.

Results

I was able to provide the information requested and created a column for the Complexity and Sentiment scores for both the speech and Q-and-A portions of the transcripts. Is there more info to elaborate on the results? It seems brief compared to the other sections.

Future Steps

  • Setting up an AWS instance for running the code in the cloud so the work will not require a computer on hand. Slowly scraping the remainder of the transcripts with the cloud instance. The original scraping was done via my own personal computer and with 1 transcript being scraped per second it would be best to automate this on somebody else’s machine.
  • Storing the data scraped in an SQL server. The data I scraped for 6,000 transcripts was over the 100MB limit for Github in a few of it’s different forms. Thus over 100,000 would again be best suited for somebody else’s machine. Like Amazon’s machine(s) or another cloud service.

nlp_of_company_earnings's People

Contributors

terrajriley avatar

Stargazers

Nashid Ali avatar Chris. avatar  avatar Ty Schnettler avatar  avatar  avatar

Watchers

James Cloos avatar Prashant Sridhar avatar

nlp_of_company_earnings's Issues

Captcha problem

Hi Terra,

Thank you for sharing your great code! I adjusted your code but suffer captcha problem in both scraping urls and scraping transcripts. Your design about "unique" user agent names do not work in my code. The following are my adjusted code. Do you have any idea about how to deal with the captcha issue. Thank you so much!

Import Libraries

Standard Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline

Scraping Libraries

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from time import sleep
import re

NLP Libraries

from textblob import TextBlob
import textstat

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import datetime

def Sentamentize(text):
return TextBlob(str(text)).sentiment.polarity

def open_browser(alt_user_name = 'Thank you for your website'):
opts = Options()
opts.add_argument("user-agent=" + str(alt_user_name))
path = '../Garage/chromedriver' # Path to Chromedriver
#return webdriver.Chrome(executable_path = path, options=opts)
return webdriver.Chrome(ChromeDriverManager().install(), options=opts)

Data Collection

Scraping urls of each transcript before the next step.

I suffer from captchas after only tens of urls

b_url='https://seekingalpha.com/earnings/earnings-call-transcripts/'
url_list=[]
for page_num in range(1,1001):
browser = open_browser("Y'all are great " + str(page_num))
# I put in a "unique" user agent name to help throw
# off the captcha. This worked for only so long.
current_ts_list = b_url + str(page_num)
browser.get(current_ts_list)
elements_list = browser.find_elements_by_class_name('dashboard-article-link')
urls = [el.get_attribute('href') + '?part=single' for el in elements_list]
headers = [el.text for el in elements_list]
print('scraping', page_num)
#sleep(20)
for transcript_num in range(len(urls)):
current_transcript = {
'header' : headers[transcript_num],
'url' : urls[transcript_num],
'page_list': page_num
}
url_list.append(current_transcript)
browser.close()

Scraping every individual url in the df to return the full

url_df1 = pd.DataFrame(url_list)
pd.DataFrame(url_df1).to_csv("url_df_1000.csv")

Need to create a variabe "transcript" before moving to the following steps

url_df = pd.read_csv('url_df_20.csv')

I suffer from captchas after only severa urls

for i in range(20): # This allows for timeout errors w/oo crashing
try:
for row_num in range(len(url_df['transcript'])):
if url_df['transcript'][row_num] == 0:

            # Go to url and obtain the full transcript
            browser = open_browser("Y'all are excellent " + str(row_num))
            # I put in a "unique" user agent name to help throw
            # off the captcha.  This worked for only so long.                
            browser.get(url_df['url'][row_num])
            print('1')                
            soup    = BeautifulSoup(browser.page_source)
            article = soup.find('article')
            url_df['transcript'][row_num] = [item.text for 
                item in article.find_all('p')]
            browser.close()
            sleep(20)
        else:
            pass
except:
    print('--------')
    print('ERROR', i)
    print('--------')
    sleep(100)
    browser.close()
    #### Suggestions for the future scraper.
    # if browser:
    #    browser.close()
    # else:
    #    print("""Browser was already closed.  
    #    This wasn't a timeout error""")
    # For info on catching the errors that we want try:
    # https://stackoverflow.com/questions/33239308/how-to-
    # get-exception-message-in-python-properly/33239954

url_df.head(3)
pd.DataFrame(url_df).to_csv("url_df_20_transcript.csv")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.