Git Product home page Git Product logo

crosssection's Introduction

Open source cross sectional asset pricing

This repo accompanies our paper: Chen and Zimmermann (2021), "Open source cross-sectional asset pricing"

If you use data or code based on our work, please cite the paper:

@article{ChenZimmermann2021,
  title={Open Source Cross Sectional Asset Pricing},
  author={Chen, Andrew Y. and Tom Zimmermann},
  journal={Critical Finance Review},
  year={Forthcoming}
}

Data

If you are mostly interested in working with the data, we provide both stock-level signals (characteristics) and a bunch of different portfolio implementations for direct download at the dedicated data page. Please see the data page for answers to FAQs.

However, this repo may still be useful for understanding the data. For example, if you want to know exactly how we construct BrandInvest (Belo, Lin, and Vitorino 2014), you can just open up BrandInvest.do in the repo's webpage for Signals/Code/Predictors/


Code

The code is separated into three folders:

  1. Signals/Code/ Downloads data from WRDS and elsewhere. Constructs stock-level signals (characteristics) and ouputs to Signals/Data/. Mostly written in Stata.
  2. Portfolios/Code/ Takes in signals from Signals/Data/ and outputs portfolios to Portfolios/Data/. Entirely in R.
  3. Shipping/Code/ You shouldn't need this. We use this to prepare data for sharing.

We separate the code so you can choose which parts you want to run. If you only want to create signals, you can run the files in Signals/Code/ and then do your thing. If you just want to create portfolios, you can skip Signals/Code/ by directly downloading its output via the data page. The whole thing is about 15,000 lines, so you might want to pick your battles.

More details are below.

1. Signals/Code/

master.do runs everything. It calls every .do file in the following folders:

  • DataDownloads/ downloads data from WRDS and elsewhere
  • Predictors/ construct stock-level predictors and outputs to Signals/Data/Predictors/
  • Placebos/ constructs "not predictors" and "indirect evidence" signals and outputs to Signals/Data/Placebos/

master.do employs exception handling so if any of these .do files errors out (due to lack of a subscription, code being out of date, etc), it'll keep running and output as much as it can.

The whole thing takes roughly 24 hours, but the predictors will be done much sooner, probably within 12 hours. You can keep track of how it's going by checking out the log files in Signals/Logs/.

Minimal Setup

In master.do, set pathProject to the root directory of the project (where SignalDoc.csv is located) and wrdsConnection to the name you selected for your ODBC connection to WRDS (a.k.a. dsn).

If you don't have an ODBC connection to WRDS, you'll need to set it up. WRDS provides instructions for Windows users and for WRDS cloud users. Note that wrdsConnection (name of the ODBC connection) in the WRDS cloud example is "wrds-postgres". If neither of these solutions works, please see our troubleshooting wiki.

Optional Setup

The minimal setup will allow you to produce the vast majority of signals. And due to the exception handling in master.do, the code will run even if you're not set up to produce the remainder.

But if you want signals that use IBES, 13F, OptionMetrics, FRED, or a handful of other random signals, you'll want to do the following:

  • For IBES, 13F, OptionMetrics, and bid-ask-spread signals: Run Signals/Code/PrepScripts/master.sh on the WRDS Cloud, and download the output to Signals/Data/Prep/. See master.sh for more details. The most important files from this optional setup are iclink.csv and oclink.csv, which allows for merging of IBES, OptionMetrics, and CRSP data. The code here relies heavily on code by Luis Palacios, Rabih Moussawi, Denys Glushkov, Stacey Jacobsen, Craig Holden, Mihail Velikov, Shane Corwin, and Paul Schultz.

  • For signals that use the VIX, inflation, or broker-dealer leverage, you will need to request an API key from FRED. Before you run the download scripts, save your API key in Stata (either via the context menu or via set fredkey). See this Stata blog entry for more details.

  • For signals that use patent citations, BEA input-output tables, or Compustat customer data, the code uses Stata to call R scripts, and thus this may need some setup. If you're on a Windows machine, you will need to point master.do to your R installation, by setting RSCRIPT_PATH to the path of Rscript.exe. If you're on linux, you will need to just make sure that the rscript command is executable from the shell.

2. Portfolios/Code/

master.R runs everything. It:

  1. Takes in signal data located in Signals/Data/Predictors/ and Signals/Data/Placebos/
  2. Outputs portfolio data to Portfolios/Data/Portfolios/
  3. Outputs exhibits found in the paper to Results/

It also uses SignalDoc.csv as a guide for how to run the portfolios.

By default the code skips the daily portfolios (skipdaily = T), and takes about 8 hours, assuming you examine all 300 or so signals. However, the baseline portfolios (based on predictability results in the original papers) will be done in just 30 minutes. You can keep an eye on how it's going by checking the csvs outputted to Portfolios/Data/Portfolios/. Every 30 minutes or so the code should output another set of portfolios. Adding the daily portfolios (skipdaily = F) takes an additional 12ish hours.

Minimal Setup

All you need to do is set pathProject in master.R to the project root directory (where SignalDoc.csv is). Then master.R will create portfolios for Price, Size, and STreversal in Portfolios/Data/Portfolios/.

Probable Setup

You probably want more than Price, Size, and STreversal portfolios, and so you probably want to set up more signal data before you run master.R.

There are a couple ways to set up this signal data:

  • Run the code in Signals/Code/ (see above)
  • Download Firm Level Characteristics/Full Sets/PredictorsIndiv.zip and Firm Level Characteristics/Full Sets/PlacebosIndiv.zip via the data page and unzip to Signals/Data/Predictors/ and Signals/Data/Placebos/
  • Download only some selected csvs via the data page and place in Signals/Data/Predictors/ (e.g. just download BM.csv, AssetGrowth.csv, and EarningsSurprise.csv and put them in Signals/Data/Predictors/).

3. Shipping/Code/

This code zips up the data, makes some quality checks, and copies files for uploading to Gdrive. You shouldn't need to use this but we keep it with the rest of the code for replicability.


Stata and R Setup

Stata code was tested on both Windows and Linux. Linux was Ubuntu 18.04.5 running Stata 16.1.

R code was tested on

  • Windows 10, Rstudio Version 1.4.1106, R 4.0.5, and Rtools 4.0.0
  • Ubuntu 18.04.5, EMACS 26.1, ESS 17.11, and R 4.0.2

To install the Windows R setup

  1. Download and install R from https://cran.r-project.org/bin/windows/base/old/
  2. Download and install Rtools from https://cran.r-project.org/bin/windows/Rtools/history.html
  3. Download and install Rstudio from https://www.rstudio.com/products/rstudio/download/
  4. Add Rtools to path by running in R: writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con = "~/.Renviron") see https://cran.r-project.org/bin/windows/Rtools/

Git Integration

If you use RStudio, take a look at Hendrik Bruns' guide to set up version control.

As a stand-alone client for Windows, we recommend Sourcetree.

If you use Git, you should definitely add the following lines to .gitignore:

Signals/Data/**
Shipping/Data/**
Portfolios/Data/**

These folders contain a ton of data and will make Git slow to a crawl or crash.


Contribute

Please let us know if you find typos in the code or think that we should add additional signals. You can let us know about any suggested changes via pull requests for this repo. We will keep the code up to date for other researchers to use it.

crosssection's People

Contributors

alecerb avatar antoniogilderubio avatar chenandrewy avatar lena-will avatar mk0417 avatar opensourceap avatar tomz23 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crosssection's Issues

Daily Portfolios

Hi Just exploring the dataset, for the daily portfolios, I just want to make sure that I am using the most up to date data -- this one only goes up and till early 2020 for example.

image

Compustat quarterly data

Signals/Code/DataDownloads/C_CompustatQuarterly.do

gen time_avail_m = mofd(datadate) + 3  // Assume data available with a 3 month lag

This may have two minor issues. If we compare the RDQ (which is the release date of earnings) and datadate:

  1. RDQ < datadate: more than 600 obsrvations.
  2. RDQ > datadate+90 (day): more than 50,000 observations (that are not public available with 90-day lag)

For above two scenarios, is it better to set epspxq and other quarterly accounting information to missing?

RSCRIPT_PATH

RSCRIPT_PATH does not need to be set up for linux systems provided that Rscript is set up

*_temp* variables not found

Hi,

I tried to run the Stata code but encountered some problems. I had these errors of variables not found: "_temp not found" for some variables. They are sd10_tempEarnings, mean6_temp, and sd60_tempEpsDiff. I got stuck here and failed to solve it. Could anyone please help?

If this issue is not actually an issue, please accept my apology. I am very new to Stata programming.

Thanks

FRED data

Thanks for your project and very appreciate your work to make it open source.

Can I ask which package are you using to extract FRED data? Is this one as below?
https://github.com/sboysel/fredr

If so, there is an open issue for this package: sboysel/fredr#75

install.packages("fredr") is currently not working and should install this package from github: devtools::install_github("sboysel/fredr")

masterSAS.sas

Enhance documentation by adding link about "ssh to WRDS" in /Signals/Code/PrepScripts/masterSAS.sas.

Better deal with lags in ShareVol.do

ShareVol.do actually drops too many observations because it does not properly deal with observations at the beginning of the sample (where there aren't enough lags of the variables).

Fama-French 48 industry - ffind

I had a look the source code offfind comand. And the behavior of ffind has a minor difference from FF48 industry from French data library. ffind groups all sic codes which are not sitting in any of the 47 industries to Other (Almost Nothing) industry.
However, FF48 defines Other industry as below:

4950-4959 
4960-4961 
4970-4971 
4990-4991 

ffind includes the following sic codes in Other industry: 900, 3990, 6797, 9995, 9997. Those sic codes are not in FF48 Other industry.

For your information, two files call ffind command: Stata/14_PrepareAnnualCS.do and Stata/30_CreateSignalsNoFilterVersion.do.

Close Price in OptionMetricsProcessing.R

The OptionMetrics Manual describes the close field in secprd as "If this field is positive, then it is the closing price for the security on this date. If it is negative, then it is the average of the closing bid and ask prices for the security on this date. In case there are no valid bid or ask for the day, the record does not appear in the table at all." So I think there should be an abs function for close in this script.

Python

Who is going to be the brave soul that turns this into Python code? :)

Delisting return adjustment

Line 43 in Signals/Code/DataDownloads/I_CRSPmonthly.do: replace ret = ret + dlret.
In this way, 62 observations of delisting adujsted ret are less than -1. The minimum ret is -1.99

Is it better to use replace ret = (1+ret) * (1+dlret) - 1 as suggested by WRDS? There will be no delisting adjusted return that is less than -1.

Manual download - sources

In your project you have mainly two datasets needing manual download:

|---- Calls 12_PrepareOtherData.do
|---- Uses pin1983-2001.dat which is currently not being downloaded automatically
|---- Uses SinStocksHong.xlsx which is currently not being downloaded automatically

Could you please indicate from which sources you downloaded the two files and whether they are publicly available?

Thanks for the great project!

WRDS_DL_IBES.sas

I am not able to run WRDS_DL_IBES.sas without including the location of SAS macro:

%include '/wrdslin/wrdsmacros/iclink.sas';

t-stat results data

Is there perhaps a dataset where I can pick up the t-stats with minimal effort, to reconstruct the plots as you have in your paper. Thanks in advance.

Random order issue

I am not sure if my understanding here is correct or not. So could you please see my question below? Thanks in advance.

There might be a random order issue about the code block in line 35 to 40 from Signals/Code/DataDownloads/C_CompustatQuarterly.do.

// Prepare year-to-date items
sort gvkey fyearq fqtr
foreach v of varlist sstky prstkcy oancfy fopty {
    gen `v'q = `v' if fqtr == 1
    by gvkey fyearq: replace `v'q = `v' - `v'[_n-1] if fqtr !=1
}

There are some observations with the same gvkey, fyearq and fqtr. In those cases, the order will be random (see page 2 via this link: https://www.stata.com/manuals/dsort.pdf). So the calculation might not always use the most recent information.

When sorting by gvkey, fyearq and fqtr, the following example can happen:

gvkey datadate fyearq fqtr oancfy oancfyq
001444 31mar2004 2004 1 -1.503 -1.503
001444 31oct2003 2004 1 -.736 -.736
001444 30jun2004 2004 2 -3.315 -2.579

For gvkey 001444 on 30jun2004, oancfyq is -2.579 which is -3.315 (30jun2004)+0.736 (31oct2003).

Below should make sure the order is correct.

sort gvkey fyearq fqtr datadate
foreach v of varlist sstky prstkcy oancfy fopty {
    gen `v'q = `v' if fqtr == 1
    by gvkey fyearq (fqtr datadate): replace `v'q = `v' - `v'[_n-1] if fqtr !=1
}

Check details of Accruals and AnnouncementReturn

A PhD student who would like to remain anonymous writes:

I am writing to confirm the detailed definition of two signals in the SignalDocumention.csv (AddInfo).

  1. Accruals (Sloan, 1996).

Detailed Definition: Annual change in current total assets (act) minus annual change in cash and short-term investements (che) minus annual change in current liabilities (lct) minus annual change in debt in current liabilities (dlc) minus change in income taxes (txp). All divided by average total assets (at) over this year and last year. Exclude if abs(prc) < 5.

In Sloan(1996), the two ‘minus’ seem to be ‘plus’, Also, Sloan(1996) substracts depreciation and amortization expense.

  1. AnnouncementReturn (Chan, Jegadeesh and Lakonishok, 1996).

Detailed Definition: Get announcement date for quarterly earnings from IBES (fpi = 6). AnnouncementReturn is the sum of (ret - mktrf + rf) from one day before an earnings announcement to 2 days after the announcement.

The definition in Chan, Jegadeesh and Lakonishok (1996) seem to be ‘from two days before an earnings announcement to one day after the announcement.’

Missing Size characteristic?

Thank you for this amazing resource.
In the Appendix, it is mentioned that the Size factor (from Banz 1981) is available (it is undoubtedly an important one).
However, I fail to find it under its name (Size) in both the base and additional files.
Has the column name changed?
Many thanks!

Field descriptors for SignalBase.csv

Hi Guys

Thank you for a fantastic and fascinating article.

I have a few questions on the signalbase.csv dataset

  1. Is permno the Key/UniqueID for every company?
  2. Which field did you use for the Liquidity filter?

Kind regards

Daily Returns Statistics

STreversal Daily data compounded returns, turns a dollar into a few billion dollars (all the other anomalies are squeezed to the bottom in a normal range, I am wondering if STreversal, IndRetBig, and IntMom might not have some data error.

image

[FIX] 10_DownloadData.R Error: object 'Found' not found

Line 326-329
Reads this:

ipos = read_excel(path = tmp) %>% 
  transmute(Founding = Found,
            Offerdate = `Offer date`,
            CRSPperm = PERM)

Should Read:

ipos = read_excel(path = tmp) %>% 
  transmute(Founding = 'Founding',
            Offerdate = 'Offer date',
            CRSPperm = 'CRSP PERM')

CRSP-IBES link table

WRDS SAS macro keeps the ticker with lower score when using exchange ticker to complete the matching. However, there are four cases as below where the scores are the same.

For the SAS macro: they sort by ticker and score, and then keep the the first ticker. This can filter out ticker with lower score when the scores are different. But this will randomly select the mathcing if the scores are same.

11_PrepareLinkingTables.do requires score not greater than 2. So the only question is the ticker PD2. How to choose the matching since they have same score: further check this one or kick this ticker out?

I do not think this will change the empirical results and just for your information.

ticker   permno  score
PD2      84155       0
PD2      90999       0
permno 84155 was kept from SAS code

KRE1    65162        5
KRE1    65170        5
permno 65170 was kept from SAS code

CMD1  52652         5
CMD1  63933         5
permno 63933 was kept from SAS code

MP      13135       6
MP      55336       6
permno 55336 was kept from SAS code

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.