For my Masters thesis, which involved the construction of a stock market sentiment index, the first step was to code 2 different scrapers for getting the latest data continuously. It was my first encounter of using the programming language R to do web scraping and it was a steep learning curve. When I came across Scrapy, I wondered how much effort it would be to build the 2 scrapers with the same or even better functionality.
For the thesis, I needed to collect adhoc announcements for companies traded on German exchanges. These are compulsory press releases for events such as a release of financial results, management changes, stock repurchases, M&A activities, etc. In addition, stock price data and company metadata is collected as well.
This spider collects English adhoc-announcements from dgap.de along with some data about the publishing companies such as the name, ISIN & WKN.
This spider downloads the price history for the stocks given by their ISIN, as well as company metadata, e.g. company sector & industry, country, etc.
Python>=3.4 is required.
Clone the repo and install the required libraries with:
$ pip install -r requirements.txt
Note: It was only tested with Ubuntu.
In order to run the full scraping process (announcements, stock prices & metadata), navigate to the project root directory and run the following command:
$ python run.py
When executed for the first time, this will download all the English adhoc announcements (as of 2019, ~25K) and store them into a file, adhoc.csv, in the data folder.
As soon as the scraping of all announcements is finished, it automatically starts to download the stock price data for all the companies based on the unique ISINs (as of 2019, ~1K). For each ISIN one file is written into the stocks or stocks_xetra subfolder of the data directory.
The corresponding company meta data for each ISIN is written into adhoc_stocks_metadata.csv in the data folder.
In consecutive runs, all new announcements are scraped until the last announcement stored in adhoc.csv is encountered. The new announcements are appended to the existing CSV file.
After finishing scraping the new announcements, the stock price data for them is downloaded, as well as for all announcements published up to 30 business days before the timestamp of the previously last announcement.
If there are not yet existing ISINs in the announcements, their company meta data will be appended to the existing metadata CSV file, adhoc_stocks_metadata.csv.
This way you can easily execute the scraper every day or week with a cronjob.
adhoc.csv:
companyID | company_name | country | headline | isin | newsID | text | timestamp | url | wkn |
---|---|---|---|---|---|---|---|---|---|
372698 | OSRAM Licht AG | Deutschland | OSRAM Licht AG: OSRAM clears way for ams takeover | DE000LED4000 | 1186915 | The Managing Board of Osram Licht AG (Osram) has ... | 2019-08-21 20:30:00 | https://dgap.de/... | LED400 |
adhoc_stocks_metadata.csv:
arivaID | country | exchangeID | file_urls | files | foundingyear | industry | isin | listingdate | sector | security_name | stocktype | ticker |
---|---|---|---|---|---|---|---|---|---|---|---|---|
111942885 | Deutschland | 6 | http://www.ariva.de/quote/historic/historic.csv?secu=111942885&... | [{'url': 'http://www.ariva.de/quote/historic/historic.csv?secu=111942885&boerse_id=6&clean_split=1&clean_payout=1&clean_bezug=1&min_time=2000-01-01&max_time=2019-08-14&trenner=%3B&go=Download', 'path': 'ISIN_DE000LED4000.parquet', 'checksum': 'c450f18613637aa8f6deadaaaeaf68b9'}] | 1919 | Diversifizierte Industrieunternehmen | DE000LED4000 | 08.07.2013 | Industrie | osram_licht-aktie | Inlandsaktie | OSR |
ISIN_XXXXXXX.parquet: German header & format - delimited by semicolon & decimal comma
Datum, ISIN (MultiIndex) | Erster | Hoch | Tief | Schlusskurs | Volumen |
---|---|---|---|---|---|
2019-08-23, DE000LED4000 | 36.99 | 37.00 | 36.60 | 36.63 | 16109238 |
English translations:
Date, ISIN , Open, High, Low, Close, Volume
Note: Some files might also include a "Stücke" (Shares) column.
After getting used to inner workings and all the possible extensions of Scrapy, you'll always want to do your web scraping with it.
Compared to writing a scraper in R, it took less time and the resulting scrapers are faster, more robust and easily extensible.
Replacing the CSV and Parquet pipelines with a database pipeline would require only a few small adjustments. For my current needs it's sufficient to store the data locally as CSV and Parquet files.