Git Product home page Git Product logo

tie_weblm's Introduction

TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages

Topological Information Enhanced (TIE) model leverages the informative topological structures of the web pages to tackle the web base Structure Reading Comprehension (SRC) task, and achieves the SOTA results on WebSRC dataset at the time of writing. This repository is the full implementation of our TIE model. For more details, please refer to our paper:

TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages

Requirements

The required python packages is listed in "requirements.txt". You can install them by

pip install -r requirements.txt

or

conda install --file requirements.txt

Data preparing

First, please following the data pre-processing guidelines in the WebSRC office repository. Then, in order to form the NPR graph efficiently afterwards, we calculate and store the NPR relations between valid tags of each web page in a dictionary format. To achieve this, run

python src/data_preprocess.py --root_dir ./data --task rect_mask

The resulting dictionary for each web page will be placed in the same directory as the corresponding html file while the name of the resulting file has an additional suffix .relation.json

Training

After completing the data preparing steps, TIE can be trained by running the train.sh file in the folder script/{backbone-PLM-for-CE}. As you can see, the backbone model used for the Content Encoder of TIE is specified in the directory of the bash files. For example, to train TIE with MarkupLM as its Content Encoder, run

bash ./script/MarkupLM/train.sh

Moreover, to reproduce the experiments in ablation study, you can use the argument --mask to specify the GAT masks used in TIE and the argument --direction to specify the relations used in NPR graph.

Evaluation

Similarly, the bash file for evaluation can be found in the same directory as the bash file for training. Specifically, the corresponding eval_stage_1.sh file evaluates the quality of TIE's tag predictions all the saved checkpoints on the development set, while eval_stage_2.sh file evaluates the final answer span predictions on the development set where an additional token-level QA model with its model type and a checkpoint of TIE need to be specified. For example, to evaluate the tag prediction quality of all the checkpoints saving by the previous example command, run

bash ./script/MarkupLM/eval_stage_1.sh

Then, for answer refining stage, suppose that we use MarkupLM which is stored in folder ./token_QA as the additional token-level QA model and the checkpoint we want to evaluate is located at ./result/MarkupLM/checkpoint-27000. Note that the previous example command will store the n best answer tag prediction in a corresponding json file, in this case, the json file will be ./result/MarkupLM/nbest_predictions_27000.json. Therefore, to evaluate the final performance, run

bash ./script/MarkupLM/eval_stage_2.sh markuplm ./token_QA ./result/MarkupLM/nbest_prediction_27000.json

Reference

If you find TIE useful or inspiring in your research, please cite the corresponding paper. The bibtex are listed below:

@article{zhao-etal-2022-tie,
  author    = {Zihan Zhao and
               Lu Chen and
               Ruisheng Cao and
               Hongshen Xu and
               Xingyu Chen and
               Kai Yu},
  title     = {{TIE:} Topological Information Enhanced Structural Reading Comprehension
               on Web Pages},
  journal   = {CoRR},
  volume    = {abs/2205.06435},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2205.06435},
  doi       = {10.48550/arXiv.2205.06435},
  eprinttype = {arXiv},
  eprint    = {2205.06435}
}

License

This project is licensed under the license found in the LICENSE file. Portions of the source code are based on the official code of WebSRC and MarkupLM

tie_weblm's People

Contributors

travelleralone avatar importpandas avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.