Git Product home page Git Product logo

crystal-ball-minimal-working-example's Introduction

Crystal-Ball-Minimal-Working-Example

David Byrne, Robert Goodhead, Michael McMahon, Conor Parle

This is a readme file, designed to allow researchers to apply the approach of Byrne et al. (2022a). The files here represent a minimal working example (MWE) and will parse 9 text files containing policy statements by the Federal Reserve in 2019.

You will need the following programmes on your system, and accessible from your system path: Python, Java JDK, Maeven, Git Bash.

Our approach benefits from the use of multiple algorithms from the natural language processing (NLP) literature. In particular,

  • the TMV algorithm of Ramm et al. (2017);
  • the SUTime algorithm of Chang and Manning (2012);
  • the MATE parser of Björkelund et al. (2010).

We explain clearly how and where to apply the codes of other researchers in the present readme. However, we do not provide access to these tools in the present repository, and refer users of our codes to the respective websites of these authors for relevant codes.

Step 1: Ensure the file structure is correct

The file structure for the working directory needs to be as follows:

  • ./data
  • ./mate_tools_working
  • ./stanford-corenlp-4.0.0
  • ./tmv_tool

The file structure for ./data/data_MWE needs to be as follows:

  • ./data/data_MWE
    • /corpus
    • /corpus_prepared_for_mate
    • /mate_parsed_corpus
    • /mate_parsed_corpus_backup
    • /sent_corpus
    • /sent_dates_corpus
    • /sent_sutime_corpus
    • /tmv

The correct file structures for ./mate_tools_working, ./stanford-corenlp-4.0.0, and ./tmv_tool will be discussed subsequently.

Ensure the folder ./data/data_MWE/corpus includes the 9 text files of interest. However, any text file containing English language sentences, will be parsed if they are placed in this folder. It is necessary that they follow the naming convention “filename_YYYYMMDD.txt” however, where YYYYMMDD indicates the reference date of the document.

Step 2: Pre-process the textual data

You will need to run the following python codes in sequence to pre-process the data:

  • run_MWE_1a_preprocessing_minimal.py
  • run_MWE_1c_preprocessing_for_mate.py
  • run_MWE_1ca_dates_preprocessing_for_sutime.py

Step 3: Set up the TMV tool

Before one can run the TMV tool, one needs to set up the MATE parser.

3a) Set up MATE

  • Download the file anna-3.61.jar from here and put it in a folder called ./mate_tools_working/anna.
  • Download transition-1.30.jar from here and put it in a folder called ./mate_tools_working/transition.
  • Download the parser + tagger from here, under “English Models”, which is a .mdl file called per-eng-S2b-40.mdl and put it in a folder called ./mate_tools_working/parser_tagger/.
  • Download the parser .csh script from here, which is a tiny example script called parse-eng, and put it in ./mate_tools_working/parser_tagger/.

3b) Set up TMV

  • Download the TMV tools (from the GitHub repository here) into this folder and unzip inside ./tmv_tool. There should be folders called europarl, example-outputs, and tmv-annotator-tool. There should also be .gitignore, LICENSE, README.md.
  • Note that you need the de-bugged version of the English variant of the tool. You should place the file TMV-EN_ecb_test_david.py (found in ./supplements) in the directory (./data/tmv-annotator-tool).
  • You also need to add the file TMVtoHTML_ecb.py (found in ./supplements) in the directory (./data/tmv-annotator-tool).

Step 4: Getting SUTime working

You will need a version of Stanford CoreNLP on your system. In our applications, we used version 4.0.0, which is available here. While our codes may function with more recent versions, we cannot guarantee this will be the case.

The folder ./stanford-corenlp-4.0.0 should have the following file structure:

  • ./stanford-corenlp-4.0.0
    • /.idea
    • /jars
    • /patterns
    • /sutime
    • /target
    • /tokensregex

Next one needs to add two additional rules files, that are bespoke to this paper and are not included in the core distribution of SUTime. To do this, move the files defs2.sutime.txt and english2.sutime.txt (found in ./supplements) to the folder ./stanford-corenlp-4.0.0/sutime.

One now needs to add a .java file, designed to run extract the reference date from text file names, before applying SUTime to this reference date. To do this move the two files run_sutime_on_MWE.java and run_sutime_on_MWE.class into ./stanford-corenlp-4.0.0. These two files are both found in ./supplements.

Step 5: Parse the data

To parse the data with TMV, one needs to run the following two files in sequence:

  • bash_run_mate_on_MWE.sh
  • bash_run_tmv_on_MWE.sh

To parse the data with SUTime, one needs to run the following file:

  • bash_run_sutime_on_MWE.sh

Step 6: Additional preparation routines

These routines apply a few cleaning operations, as detailed in Byrne et al. (2022a). These routines are specific to our investigations, and any individual cleaning decision we leave for future researchers to remove or modify as they please.

  • run_MWE_2a_tmv_1_data_input
  • run_MWE_2a_tmv_1a_tempoword
  • run_MWE_2b_sutime_datainput
  • run_MWE_3a_tmv_preparation
  • run_MWE_3b_sutime_preparation

The cleaned SUTime and TMV parsed data can be respectively found stored in the following .pkl files:

  • ./data/data_MWE/data_TMV_cleaned.pkl
  • ./data/data_MWE/data_SUTime_cleaned.pkl

References

Björkelund, Anders, Bernd Bohnet, Love Hafdell and Pierre Nuges (2010), "A High-Performance Sytactic and Semantic Dependency Parser", Coling 2010, Demonstrations.

Byrne, David, Robert Goodhead, Michael McMahon, and Conor Parle (2022a), "Measuring the Temporal Dimension of Text: An Application to Policymaker Speeches", mimeo

Byrne, David, Robert Goodhead, Michael McMahon, and Conor Parle (2022b), "Measuring the Temporal Dimension of Text: An Application to Policymaker Speeches", mimeo

Chang, Angel X. and Christopher D. Manning (2012), "SUTime: A Library for Recognizing and Normalizing Time Expressions", 8th International Conference on Language Resources and Evaluation

Ramm, Anita, Sharid Loáiciga, Annemarie Friedrich, and Alexander Fraser (2017), "Annotating Tense, Mood and Voice for English, French and German", Proceedings of ACL 2017, System Demonstrations.

crystal-ball-minimal-working-example's People

Contributors

rgoodhead avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.