opensourceap / crosssection Goto Github PK

View Code? Open in Web Editor NEW

607.0 34.0 211.0 4.81 MB

Code to accompany our paper Chen and Zimmermann (2020), "Open source cross-sectional asset pricing"

Home Page: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3604626

License: GNU General Public License v2.0

R 26.51% SAS 9.32% Stata 63.40% Shell 0.78%

crosssection's Introduction

Open source cross sectional asset pricing

This repo accompanies our paper: Chen and Zimmermann (2021), "Open source cross-sectional asset pricing"

If you use data or code based on our work, please cite the paper:

@article{ChenZimmermann2021,
  title={Open Source Cross Sectional Asset Pricing},
  author={Chen, Andrew Y. and Tom Zimmermann},
  journal={Critical Finance Review},
  year={Forthcoming}
}

Data

If you are mostly interested in working with the data, we provide both stock-level signals (characteristics) and a bunch of different portfolio implementations for direct download at the dedicated data page. Please see the data page for answers to FAQs.

However, this repo may still be useful for understanding the data. For example, if you want to know exactly how we construct BrandInvest (Belo, Lin, and Vitorino 2014), you can just open up BrandInvest.do in the repo's webpage for Signals/Code/Predictors/

Code

The code is separated into three folders:

Signals/Code/ Downloads data from WRDS and elsewhere. Constructs stock-level signals (characteristics) and ouputs to Signals/Data/. Mostly written in Stata.
Portfolios/Code/ Takes in signals from Signals/Data/ and outputs portfolios to Portfolios/Data/. Entirely in R.
Shipping/Code/ You shouldn't need this. We use this to prepare data for sharing.

We separate the code so you can choose which parts you want to run. If you only want to create signals, you can run the files in Signals/Code/ and then do your thing. If you just want to create portfolios, you can skip Signals/Code/ by directly downloading its output via the data page. The whole thing is about 15,000 lines, so you might want to pick your battles.

More details are below.

1. Signals/Code/

master.do runs everything. It calls every .do file in the following folders:

DataDownloads/ downloads data from WRDS and elsewhere
Predictors/ construct stock-level predictors and outputs to Signals/Data/Predictors/
Placebos/ constructs "not predictors" and "indirect evidence" signals and outputs to Signals/Data/Placebos/

master.do employs exception handling so if any of these .do files errors out (due to lack of a subscription, code being out of date, etc), it'll keep running and output as much as it can.

The whole thing takes roughly 24 hours, but the predictors will be done much sooner, probably within 12 hours. You can keep track of how it's going by checking out the log files in Signals/Logs/.

Minimal Setup

In master.do, set pathProject to the root directory of the project (where SignalDoc.csv is located) and wrdsConnection to the name you selected for your ODBC connection to WRDS (a.k.a. dsn).

If you don't have an ODBC connection to WRDS, you'll need to set it up. WRDS provides instructions for Windows users and for WRDS cloud users. Note that wrdsConnection (name of the ODBC connection) in the WRDS cloud example is "wrds-postgres". If neither of these solutions works, please see our troubleshooting wiki.

Optional Setup

The minimal setup will allow you to produce the vast majority of signals. And due to the exception handling in master.do, the code will run even if you're not set up to produce the remainder.

But if you want signals that use IBES, 13F, OptionMetrics, FRED, or a handful of other random signals, you'll want to do the following:

For IBES, 13F, OptionMetrics, and bid-ask-spread signals: Run Signals/Code/PrepScripts/master.sh on the WRDS Cloud, and download the output to Signals/Data/Prep/. See master.sh for more details. The most important files from this optional setup are iclink.csv and oclink.csv, which allows for merging of IBES, OptionMetrics, and CRSP data. The code here relies heavily on code by Luis Palacios, Rabih Moussawi, Denys Glushkov, Stacey Jacobsen, Craig Holden, Mihail Velikov, Shane Corwin, and Paul Schultz.
For signals that use the VIX, inflation, or broker-dealer leverage, you will need to request an API key from FRED. Before you run the download scripts, save your API key in Stata (either via the context menu or via set fredkey). See this Stata blog entry for more details.
For signals that use patent citations, BEA input-output tables, or Compustat customer data, the code uses Stata to call R scripts, and thus this may need some setup. If you're on a Windows machine, you will need to point master.do to your R installation, by setting RSCRIPT_PATH to the path of Rscript.exe. If you're on linux, you will need to just make sure that the rscript command is executable from the shell.

2. Portfolios/Code/

master.R runs everything. It:

Takes in signal data located in Signals/Data/Predictors/ and Signals/Data/Placebos/
Outputs portfolio data to Portfolios/Data/Portfolios/
Outputs exhibits found in the paper to Results/

It also uses SignalDoc.csv as a guide for how to run the portfolios.

By default the code skips the daily portfolios (skipdaily = T), and takes about 8 hours, assuming you examine all 300 or so signals. However, the baseline portfolios (based on predictability results in the original papers) will be done in just 30 minutes. You can keep an eye on how it's going by checking the csvs outputted to Portfolios/Data/Portfolios/. Every 30 minutes or so the code should output another set of portfolios. Adding the daily portfolios (skipdaily = F) takes an additional 12ish hours.

Minimal Setup

All you need to do is set pathProject in master.R to the project root directory (where SignalDoc.csv is). Then master.R will create portfolios for Price, Size, and STreversal in Portfolios/Data/Portfolios/.

Probable Setup

You probably want more than Price, Size, and STreversal portfolios, and so you probably want to set up more signal data before you run master.R.

There are a couple ways to set up this signal data:

Run the code in Signals/Code/ (see above)
Download Firm Level Characteristics/Full Sets/PredictorsIndiv.zip and Firm Level Characteristics/Full Sets/PlacebosIndiv.zip via the data page and unzip to Signals/Data/Predictors/ and Signals/Data/Placebos/
Download only some selected csvs via the data page and place in Signals/Data/Predictors/ (e.g. just download BM.csv, AssetGrowth.csv, and EarningsSurprise.csv and put them in Signals/Data/Predictors/).

3. Shipping/Code/

This code zips up the data, makes some quality checks, and copies files for uploading to Gdrive. You shouldn't need to use this but we keep it with the rest of the code for replicability.

Stata and R Setup

Stata code was tested on both Windows and Linux. Linux was Ubuntu 18.04.5 running Stata 16.1.

R code was tested on

Windows 10, Rstudio Version 1.4.1106, R 4.0.5, and Rtools 4.0.0
Ubuntu 18.04.5, EMACS 26.1, ESS 17.11, and R 4.0.2

To install the Windows R setup

Download and install R from https://cran.r-project.org/bin/windows/base/old/
Download and install Rtools from https://cran.r-project.org/bin/windows/Rtools/history.html
Download and install Rstudio from https://www.rstudio.com/products/rstudio/download/
Add Rtools to path by running in R: writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con = "~/.Renviron") see https://cran.r-project.org/bin/windows/Rtools/

Git Integration

If you use RStudio, take a look at Hendrik Bruns' guide to set up version control.

As a stand-alone client for Windows, we recommend Sourcetree.

If you use Git, you should definitely add the following lines to .gitignore:

Signals/Data/**
Shipping/Data/**
Portfolios/Data/**

These folders contain a ton of data and will make Git slow to a crawl or crash.

Contribute

Please let us know if you find typos in the code or think that we should add additional signals. You can let us know about any suggested changes via pull requests for this repo. We will keep the code up to date for other researchers to use it.

crosssection's People

Contributors

Stargazers

Watchers

Forkers

matixr abyanka williamjin1992 miyama1209 zundertj chorseng rmmomin batterysnoopy randomgambit lnsongxf macrofinancehub babaoglu jiazichen111 sriramya1409 difang-huang matiarj seanahmad quinfer keblu jm-mostafa-kamal yimingpengnu npezolano clarkdrengen ahlford darrenjy yutiansut scanfyu csrvermaak txu2014 caprileo haohua tlydon ijasperyang shizelong1985 lycanthropes ellleo mhatzikonstantinou binliwhu halihaliwakaka dijunliu1995 zhangweiblan wangxk15 terrypan1987 amouhsine wizardshowing szev77 bmpalatiello hrflyfree tobiautumn knolog gen-li blucap eduardobsg braman09 tbeason rtiltins dsstudent1 fintrek mk0417 kduendar mmetz96 youngdaekang lsy617004926 humepac akha71 joukahainen egorwk4 jsliu alexanderzhang08 caozq19 pkulzb chinchiehao robertmay615 yanding zz2585 tammystr yupuzhang92 lucylulu leicell dieterfishli arthur-l-mont mustabsar champdu9 vishalbelsare joe-zh-dev pinpss ywhcuhk jlprogramming finance-k kurucan charleswwang donjon86 madquirk-hash avimec13 sammr0103 zbyraymond astraetech rikazry wiseplat vanessa3vo7

crosssection's Issues

Daily Portfolios

Hi Just exploring the dataset, for the daily portfolios, I just want to make sure that I am using the most up to date data -- this one only goes up and till early 2020 for example.

Compustat quarterly data

Signals/Code/DataDownloads/C_CompustatQuarterly.do

gen time_avail_m = mofd(datadate) + 3  // Assume data available with a 3 month lag

This may have two minor issues. If we compare the RDQ (which is the release date of earnings) and datadate:

RDQ < datadate: more than 600 obsrvations.
RDQ > datadate+90 (day): more than 50,000 observations (that are not public available with 90-day lag)

For above two scenarios, is it better to set epspxq and other quarterly accounting information to missing?

RSCRIPT_PATH

RSCRIPT_PATH does not need to be set up for linux systems provided that Rscript is set up

Settle on working directory for Portfolios Code

Add check for complete signals data at the beginning of master.R

Upload daily portfolios and vw port data

Clean up shipping code and add to repo

_temp variables not found

Hi,

I tried to run the Stata code but encountered some problems. I had these errors of variables not found: "_temp not found" for some variables. They are sd10_tempEarnings, mean6_temp, and sd60_tempEpsDiff. I got stuck here and failed to solve it. Could anyone please help?

If this issue is not actually an issue, please accept my apology. I am very new to Stata programming.

Thanks

FRED data

Thanks for your project and very appreciate your work to make it open source.

Can I ask which package are you using to extract FRED data? Is this one as below?
https://github.com/sboysel/fredr

If so, there is an open issue for this package: sboysel/fredr#75

install.packages("fredr") is currently not working and should install this package from github: devtools::install_github("sboysel/fredr")

masterSAS.sas

Enhance documentation by adding link about "ssh to WRDS" in /Signals/Code/PrepScripts/masterSAS.sas.

Better deal with lags in ShareVol.do

ShareVol.do actually drops too many observations because it does not properly deal with observations at the beginning of the sample (where there aren't enough lags of the variables).

tstk not found in B_CompustatAnnual.do

At line 64, the code set the tstk to zero is missing. But in the requesting step (line 17), only tstkp is requested.

[11_PrepareLinkingTables.do] iclink.csv?

Unable to find the following csv: $pathdata/DataRawAndrew/iclink.csv

Am I missing something?

Various Portfolios Quantiles (3,5,10)

Going through the daily returns data, the portfolios are split sometimes 3, 5, 10, was this due to some constraint?

Fama-French 48 industry - ffind

I had a look the source code offfind comand. And the behavior of ffind has a minor difference from FF48 industry from French data library. ffind groups all sic codes which are not sitting in any of the 47 industries to Other (Almost Nothing) industry.
However, FF48 defines Other industry as below:

ffind includes the following sic codes in Other industry: 900, 3990, 6797, 9995, 9997. Those sic codes are not in FF48 Other industry.

For your information, two files call ffind command: Stata/14_PrepareAnnualCS.do and Stata/30_CreateSignalsNoFilterVersion.do.

Fix "font family not found in Windows font database" in, for example, 31_CheckPredictorsExhibits.R

I am witnessing a range of empty values (daily data)

The Piotroski Score at the very bottom shows that for each row there is at least on portfolio value missing (null, NaN)

The PS score is the weirdest of them all:

Originally posted by @firmai in #43 (comment)

Close Price in OptionMetricsProcessing.R

The OptionMetrics Manual describes the close field in secprd as "If this field is positive, then it is the closing price for the security on this date. If it is negative, then it is the average of the closing bid and ask prices for the security on this date. In case there are no valid bid or ask for the day, the record does not appear in the table at all." So I think there should be an abs function for close in this script.

Make signaldocumentation definitions match code better (e.g. BidAskSpread does not match)

Python

Who is going to be the brave soul that turns this into Python code? :)

Rerun all portfolios to take on updated rebalancing freq

Faster and easier HF trading costs

@velikov-mihail and I updated our HF trading cost code so it just takes an hour to run, provided that one has access to WRDS intraday indicators:

https://github.com/chenandrewy/hf-spreads-all

we should link to this instead of the old code that takes weeks to run.

example task 1

needs to be done before task 2

Add R version and detailed R installation instructions to readme?

Delisting return adjustment

Line 43 in Signals/Code/DataDownloads/I_CRSPmonthly.do: replace ret = ret + dlret.
In this way, 62 observations of delisting adujsted ret are less than -1. The minimum ret is -1.99

Is it better to use replace ret = (1+ret) * (1+dlret) - 1 as suggested by WRDS? There will be no delisting adjusted return that is less than -1.

VOLUME in Corwin_Schultz_Edit.sas

There should be a sentence 'VOLUME=vol;' in the first block code. Otherwise, VOLUME is uninitialized.

Check BMdec: some stocks have updates every 6 months?

Add .gitignore instructions to avoid git crashing

We should just recommend adding these lines:

Signals/Data/**
Shipping/Data/**
Portfolios/Data/**

Otherwise git will crash and you'll have to guess why

Manual download - sources

In your project you have mainly two datasets needing manual download:

|---- Calls 12_PrepareOtherData.do
|---- Uses pin1983-2001.dat which is currently not being downloaded automatically
|---- Uses SinStocksHong.xlsx which is currently not being downloaded automatically

Could you please indicate from which sources you downloaded the two files and whether they are publicly available?

Thanks for the great project!

Sort data before posting

WRDS_DL_IBES.sas

I am not able to run WRDS_DL_IBES.sas without including the location of SAS macro:

%include '/wrdslin/wrdsmacros/iclink.sas';

clean up iclink checks. The user shouldn't need iclink

t-stat results data

Is there perhaps a dataset where I can pick up the t-stats with minimal effort, to reconstruct the plots as you have in your paper. Thanks in advance.

Daily CRSP download in 10_DownloadCRSP.R freezes on Andrew's home PC. We should make it download in smaller pieces.

Random order issue

I am not sure if my understanding here is correct or not. So could you please see my question below? Thanks in advance.

There might be a random order issue about the code block in line 35 to 40 from Signals/Code/DataDownloads/C_CompustatQuarterly.do.

// Prepare year-to-date items
sort gvkey fyearq fqtr
foreach v of varlist sstky prstkcy oancfy fopty {
    gen `v'q = `v' if fqtr == 1
    by gvkey fyearq: replace `v'q = `v' - `v'[_n-1] if fqtr !=1
}

There are some observations with the same gvkey, fyearq and fqtr. In those cases, the order will be random (see page 2 via this link: https://www.stata.com/manuals/dsort.pdf). So the calculation might not always use the most recent information.

When sorting by gvkey, fyearq and fqtr, the following example can happen:

gvkey	datadate	fyearq	fqtr	oancfy	oancfyq
001444	31mar2004	2004	1	-1.503	-1.503
001444	31oct2003	2004	1	-.736	-.736
001444	30jun2004	2004	2	-3.315	-2.579

For gvkey 001444 on 30jun2004, oancfyq is -2.579 which is -3.315 (30jun2004)+0.736 (31oct2003).

Below should make sure the order is correct.

sort gvkey fyearq fqtr datadate
foreach v of varlist sstky prstkcy oancfy fopty {
    gen `v'q = `v' if fqtr == 1
    by gvkey fyearq (fqtr datadate): replace `v'q = `v' - `v'[_n-1] if fqtr !=1
}

Add ez sample code for comparison with FF and add this code to data website

Check Portfolio Period in SignalDocumentation.xlsx

I found a few errors: AM, BookLeverage, MeanRankRevGrowth should be 12.

DataDownloads/ZJR_InputOutputMomentum.R xlsx filenames

This script downloads zips from the BEA and then loads the xlsx files within them. The xlsx filenames seem to change as time goes by (2019 becomes 2020) and ideally we should update elegantly.

Check details of Accruals and AnnouncementReturn

A PhD student who would like to remain anonymous writes:

I am writing to confirm the detailed definition of two signals in the SignalDocumention.csv (AddInfo).

Accruals (Sloan, 1996).

Detailed Definition: Annual change in current total assets (act) minus annual change in cash and short-term investements (che) minus annual change in current liabilities (lct) minus annual change in debt in current liabilities (dlc) minus change in income taxes (txp). All divided by average total assets (at) over this year and last year. Exclude if abs(prc) < 5.

In Sloan(1996), the two ‘minus’ seem to be ‘plus’, Also, Sloan(1996) substracts depreciation and amortization expense.

AnnouncementReturn (Chan, Jegadeesh and Lakonishok, 1996).

Detailed Definition: Get announcement date for quarterly earnings from IBES (fpi = 6). AnnouncementReturn is the sum of (ret - mktrf + rf) from one day before an earnings announcement to 2 days after the announcement.

The definition in Chan, Jegadeesh and Lakonishok (1996) seem to be ‘from two days before an earnings announcement to one day after the announcement.’

Make note on predictors going out into the future

Missing Size characteristic?

Thank you for this amazing resource.
In the Appendix, it is mentioned that the Size factor (from Banz 1981) is available (it is undoubtedly an important one).
However, I fail to find it under its name (Size) in both the base and additional files.
Has the column name changed?
Many thanks!

Organize daily portfolios code (add VW, make sure it's complete, add some checks)

Field descriptors for SignalBase.csv

Hi Guys

Thank you for a fantastic and fascinating article.

I have a few questions on the signalbase.csv dataset

Is permno the Key/UniqueID for every company?
Which field did you use for the Liquidity filter?

Kind regards

Organize value-weighted test assets code

86ea580

Daily Returns Statistics

STreversal Daily data compounded returns, turns a dollar into a few billion dollars (all the other anomalies are squeezed to the bottom in a normal range, I am wondering if STreversal, IndRetBig, and IntMom might not have some data error.

[FIX] 10_DownloadData.R Error: object 'Found' not found

Line 326-329
Reads this:

ipos = read_excel(path = tmp) %>% 
  transmute(Founding = Found,
            Offerdate = `Offer date`,
            CRSPperm = PERM)

Should Read:

ipos = read_excel(path = tmp) %>% 
  transmute(Founding = 'Founding',
            Offerdate = 'Offer date',
            CRSPperm = 'CRSP PERM')

CRSP-IBES link table

WRDS SAS macro keeps the ticker with lower score when using exchange ticker to complete the matching. However, there are four cases as below where the scores are the same.

For the SAS macro: they sort by ticker and score, and then keep the the first ticker. This can filter out ticker with lower score when the scores are different. But this will randomly select the mathcing if the scores are same.

11_PrepareLinkingTables.do requires score not greater than 2. So the only question is the ticker PD2. How to choose the matching since they have same score: further check this one or kick this ticker out?

I do not think this will change the empirical results and just for your information.

ticker   permno  score
PD2      84155       0
PD2      90999       0
permno 84155 was kept from SAS code

KRE1    65162        5
KRE1    65170        5
permno 65170 was kept from SAS code

CMD1  52652         5
CMD1  63933         5
permno 63933 was kept from SAS code

MP      13135       6
MP      55336       6
permno 55336 was kept from SAS code

opensourceap / crosssection Goto Github PK

crosssection's Introduction

Open source cross sectional asset pricing

Data

Code

1. Signals/Code/

Minimal Setup

Optional Setup

2. Portfolios/Code/

Minimal Setup

Probable Setup

3. Shipping/Code/

Stata and R Setup

Git Integration

Contribute

crosssection's People

Contributors

Stargazers

Watchers

Forkers

crosssection's Issues

Recommend Projects

Recommend Topics

Recommend Org