Performing NLP on Transcripts of Multiple Companie's Earnings Calls

Seeking Alpha is a website that provides information on the stock market. One of the stockmarket related services that they provide are transcripts of Earnings Calls made with a variety of companies. These earnings calls are updates on how the company has done in it's earnings during a period of time, usually in the form of yearly quarters.

Problem Statement

The client wanted to have the Earning Call Transcripts scraped from Finding Alpha and have them analyzed for sentiment and complexity.

Data Collection

I gathered the data by using Selenium. I started by collecting the urls of all transcripts that I wanted to collect. I then went to each transcript page individually and scraped them one at a time.

In order to avoid having the website question my code's humanity I started by changing the user agent name. When having just one user name stopped working I had the browser close and then reopen with a new user name for every pull. Eventually I had to add in a sleep timer of 20 seconds, but the full scrape of a single url took almost a full minute. This meant that collecting the number of transcripts I wanted took a few days to collect.

Data Analysis

This project gave me the opportunity to learn about the Gunning Fog formula for complexity. The full formula is as follows:

The idea is that the result is supposed to represent the number of years one would have to spend in a school to easily understand the piece of text. Thus, a score of 12 would suggest an equivalent of a high school senior. Unfortunately this doesn’t show in practice as many results can end up in the highs of 24 or more. Thus this score should best be taken as a measurement without the comparison to years in school.

Results

I was able to provide the information requested and created a column for the Complexity and Sentiment scores for both the speech and Q-and-A portions of the transcripts. Is there more info to elaborate on the results? It seems brief compared to the other sections.

Future Steps

Setting up an AWS instance for running the code in the cloud so the work will not require a computer on hand. Slowly scraping the remainder of the transcripts with the cloud instance. The original scraping was done via my own personal computer and with 1 transcript being scraped per second it would be best to automate this on somebody else’s machine.
Storing the data scraped in an SQL server. The data I scraped for 6,000 transcripts was over the 100MB limit for Github in a few of it’s different forms. Thus over 100,000 would again be best suited for somebody else’s machine. Like Amazon’s machine(s) or another cloud service.

Captcha problem

Hi Terra,

Thank you for sharing your great code! I adjusted your code but suffer captcha problem in both scraping urls and scraping transcripts. Your design about "unique" user agent names do not work in my code. The following are my adjusted code. Do you have any idea about how to deal with the captcha issue. Thank you so much!

Import Libraries

Standard Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline

Scraping Libraries

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from time import sleep
import re

NLP Libraries

from textblob import TextBlob
import textstat

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import datetime

def Sentamentize(text):
return TextBlob(str(text)).sentiment.polarity

def open_browser(alt_user_name = 'Thank you for your website'):
opts = Options()
opts.add_argument("user-agent=" + str(alt_user_name))
path = '../Garage/chromedriver' # Path to Chromedriver
#return webdriver.Chrome(executable_path = path, options=opts)
return webdriver.Chrome(ChromeDriverManager().install(), options=opts)

Data Collection

Scraping urls of each transcript before the next step.

I suffer from captchas after only tens of urls

b_url='https://seekingalpha.com/earnings/earnings-call-transcripts/'
url_list=[]
for page_num in range(1,1001):
browser = open_browser("Y'all are great " + str(page_num))
# I put in a "unique" user agent name to help throw
# off the captcha. This worked for only so long.
current_ts_list = b_url + str(page_num)
browser.get(current_ts_list)
elements_list = browser.find_elements_by_class_name('dashboard-article-link')
urls = [el.get_attribute('href') + '?part=single' for el in elements_list]
headers = [el.text for el in elements_list]
print('scraping', page_num)
#sleep(20)
for transcript_num in range(len(urls)):
current_transcript = {
'header' : headers[transcript_num],
'url' : urls[transcript_num],
'page_list': page_num
}
url_list.append(current_transcript)
browser.close()

Scraping every individual url in the df to return the full

url_df1 = pd.DataFrame(url_list)
pd.DataFrame(url_df1).to_csv("url_df_1000.csv")

Need to create a variabe "transcript" before moving to the following steps

url_df = pd.read_csv('url_df_20.csv')

I suffer from captchas after only severa urls

for i in range(20): # This allows for timeout errors w/oo crashing
try:
for row_num in range(len(url_df['transcript'])):
if url_df['transcript'][row_num] == 0:

            # Go to url and obtain the full transcript
            browser = open_browser("Y'all are excellent " + str(row_num))
            # I put in a "unique" user agent name to help throw
            # off the captcha.  This worked for only so long.                
            browser.get(url_df['url'][row_num])
            print('1')                
            soup    = BeautifulSoup(browser.page_source)
            article = soup.find('article')
            url_df['transcript'][row_num] = [item.text for 
                item in article.find_all('p')]
            browser.close()
            sleep(20)
        else:
            pass
except:
    print('--------')
    print('ERROR', i)
    print('--------')
    sleep(100)
    browser.close()
    #### Suggestions for the future scraper.
    # if browser:
    #    browser.close()
    # else:
    #    print("""Browser was already closed.  
    #    This wasn't a timeout error""")
    # For info on catching the errors that we want try:
    # https://stackoverflow.com/questions/33239308/how-to-
    # get-exception-message-in-python-properly/33239954

url_df.head(3)
pd.DataFrame(url_df).to_csv("url_df_20_transcript.csv")

terrajriley / nlp_of_company_earnings Goto Github PK

nlp_of_company_earnings's Introduction

Performing NLP on Transcripts of Multiple Companie's Earnings Calls

Problem Statement

Data Collection

Data Analysis

Results

Future Steps

nlp_of_company_earnings's People

Contributors

Stargazers

Watchers

Forkers

nlp_of_company_earnings's Issues

Import Libraries

Standard Libraries

Scraping Libraries

NLP Libraries

Data Collection

Scraping urls of each transcript before the next step.

I suffer from captchas after only tens of urls

Scraping every individual url in the df to return the full

Need to create a variabe "transcript" before moving to the following steps

I suffer from captchas after only severa urls

Recommend Projects

Recommend Topics

Recommend Org