David Byrne, Robert Goodhead, Michael McMahon, Conor Parle
This is a readme file, designed to allow researchers to apply the approach of Byrne et al. (2022a). The files here represent a minimal working example (MWE) and will parse 9 text files containing policy statements by the Federal Reserve in 2019.
You will need the following programmes on your system, and accessible from your system path: Python, Java JDK, Maeven, Git Bash.
Our approach benefits from the use of multiple algorithms from the natural language processing (NLP) literature. In particular,
- the TMV algorithm of Ramm et al. (2017);
- the SUTime algorithm of Chang and Manning (2012);
- the MATE parser of Björkelund et al. (2010).
We explain clearly how and where to apply the codes of other researchers in the present readme. However, we do not provide access to these tools in the present repository, and refer users of our codes to the respective websites of these authors for relevant codes.
The file structure for the working directory needs to be as follows:
- ./data
- ./mate_tools_working
- ./stanford-corenlp-4.0.0
- ./tmv_tool
The file structure for ./data/data_MWE needs to be as follows:
- ./data/data_MWE
- /corpus
- /corpus_prepared_for_mate
- /mate_parsed_corpus
- /mate_parsed_corpus_backup
- /sent_corpus
- /sent_dates_corpus
- /sent_sutime_corpus
- /tmv
The correct file structures for ./mate_tools_working, ./stanford-corenlp-4.0.0, and ./tmv_tool will be discussed subsequently.
Ensure the folder ./data/data_MWE/corpus includes the 9 text files of interest. However, any text file containing English language sentences, will be parsed if they are placed in this folder. It is necessary that they follow the naming convention “filename_YYYYMMDD.txt” however, where YYYYMMDD indicates the reference date of the document.
You will need to run the following python codes in sequence to pre-process the data:
- run_MWE_1a_preprocessing_minimal.py
- run_MWE_1c_preprocessing_for_mate.py
- run_MWE_1ca_dates_preprocessing_for_sutime.py
Before one can run the TMV tool, one needs to set up the MATE parser.
- Download the file anna-3.61.jar from here and put it in a folder called ./mate_tools_working/anna.
- Download transition-1.30.jar from here and put it in a folder called ./mate_tools_working/transition.
- Download the parser + tagger from here, under “English Models”, which is a .mdl file called per-eng-S2b-40.mdl and put it in a folder called ./mate_tools_working/parser_tagger/.
- Download the parser .csh script from here, which is a tiny example script called parse-eng, and put it in ./mate_tools_working/parser_tagger/.
- Download the TMV tools (from the GitHub repository here) into this folder and unzip inside ./tmv_tool. There should be folders called europarl, example-outputs, and tmv-annotator-tool. There should also be .gitignore, LICENSE, README.md.
- Note that you need the de-bugged version of the English variant of the tool. You should place the file TMV-EN_ecb_test_david.py (found in ./supplements) in the directory (./data/tmv-annotator-tool).
- You also need to add the file TMVtoHTML_ecb.py (found in ./supplements) in the directory (./data/tmv-annotator-tool).
You will need a version of Stanford CoreNLP on your system. In our applications, we used version 4.0.0, which is available here. While our codes may function with more recent versions, we cannot guarantee this will be the case.
The folder ./stanford-corenlp-4.0.0 should have the following file structure:
- ./stanford-corenlp-4.0.0
- /.idea
- /jars
- /patterns
- /sutime
- /target
- /tokensregex
Next one needs to add two additional rules files, that are bespoke to this paper and are not included in the core distribution of SUTime. To do this, move the files defs2.sutime.txt and english2.sutime.txt (found in ./supplements) to the folder ./stanford-corenlp-4.0.0/sutime.
One now needs to add a .java file, designed to run extract the reference date from text file names, before applying SUTime to this reference date. To do this move the two files run_sutime_on_MWE.java and run_sutime_on_MWE.class into ./stanford-corenlp-4.0.0. These two files are both found in ./supplements.
To parse the data with TMV, one needs to run the following two files in sequence:
- bash_run_mate_on_MWE.sh
- bash_run_tmv_on_MWE.sh
To parse the data with SUTime, one needs to run the following file:
- bash_run_sutime_on_MWE.sh
These routines apply a few cleaning operations, as detailed in Byrne et al. (2022a). These routines are specific to our investigations, and any individual cleaning decision we leave for future researchers to remove or modify as they please.
- run_MWE_2a_tmv_1_data_input
- run_MWE_2a_tmv_1a_tempoword
- run_MWE_2b_sutime_datainput
- run_MWE_3a_tmv_preparation
- run_MWE_3b_sutime_preparation
The cleaned SUTime and TMV parsed data can be respectively found stored in the following .pkl files:
- ./data/data_MWE/data_TMV_cleaned.pkl
- ./data/data_MWE/data_SUTime_cleaned.pkl
Björkelund, Anders, Bernd Bohnet, Love Hafdell and Pierre Nuges (2010), "A High-Performance Sytactic and Semantic Dependency Parser", Coling 2010, Demonstrations.
Byrne, David, Robert Goodhead, Michael McMahon, and Conor Parle (2022a), "Measuring the Temporal Dimension of Text: An Application to Policymaker Speeches", mimeo
Byrne, David, Robert Goodhead, Michael McMahon, and Conor Parle (2022b), "Measuring the Temporal Dimension of Text: An Application to Policymaker Speeches", mimeo
Chang, Angel X. and Christopher D. Manning (2012), "SUTime: A Library for Recognizing and Normalizing Time Expressions", 8th International Conference on Language Resources and Evaluation
Ramm, Anita, Sharid Loáiciga, Annemarie Friedrich, and Alexander Fraser (2017), "Annotating Tense, Mood and Voice for English, French and German", Proceedings of ACL 2017, System Demonstrations.