This is the code I use to creat a ncusip (historical cusip) - cik (historical cik) mapping.
A referee asked us to incorporate historical headquarters location data. Note that Compustat only provides CURRENT CUSIPs, current CIKs, and current headquarters.
- WRDS has a (historical) CIK-CUSIP (or even GVKEY-CIK) link table. But business schools (at least in the UK) seldom subscribe it.
- This code is inspired by Leo Liu's CIK-CUSIP mapping. I also borrow some codes from him. Thanks. https://github.com/leoliu0/cik-cusip-mapping You can also find a csv. file at Liu's webpage. But date stamps are moved, which make the mapping not enough for me.
- There is a paper studying this and creating a to make a mapping. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3530613 But I did not find their code or mapping file. It is still good to know more about this issue.
- Compustat only provides CURRENT CUSIPs, current CIKs, and current headquarters
- Historical headquarters data is available here, however, CIK (historical CIK) is the only identifier. https://sraf.nd.edu/sec-edgar-data/10-x-header-data/
- Therefore, the requirement is to build a link from Compustat GVKEY/CUSIP (Current CUSIP) to historical CUSIP (e.g., CRSP NCUSIP) to historical CIK to historical HQ locations
- ( WRDS has a (historical) CIK-CUSIP (or even GVKEY-CIK) link table)
- Manually, we can use 13D and 13G reportings to collect the mapping. As this is also how WRDS creates the mapping. WRDS: "This web query provides the historical link between a company's CIK and GVKEY. We create this link by first getting the CUSIP from a company's Schedule 13D/G. We then use the CUSIP to link the CIK from the header of the Schedule 13D/G to GVKEY in the Compustat tables."
- Get all disclosures from SEC Edgar. => "Download_SEC_File_List_CIK.py"
- Select only 13D and 13G files. => "Download_SEC_13D13G_CUSIP-CIK_Mapping.py"
- Use regular expression (RE) to obtain the CUSIP and CIK for each 13D/13G. This is a pair and all pairs make a mapping. => "Download_SEC_13D13G_CUSIP-CIK_Mapping.py"
-
The code costs time very much.
1.1 I try to only use the first 200 lines of each report and remove all 13D/A and 13G/A (additional files)
1.2 It still takes me more than 10 hours to run it locally, with two laptops (one M2 chip Macbook pro and one 16G memory AMD R9 Win)
-
I see some missing values there, but it is not a big issue for me meanwhile. I will fix this later on.
-
The code is difficult to deploy on the server.
3.1 I tried to deploy it on AWS EC2 but failed every way. It seems EDGAR will detect this and ban the scraping. I will try to fix this later on.
-
When using the mapping, it is notable that the date is not continuous, as these dates are only when 13D or 13G is published. So filling up is necessary.