Git Product home page Git Product logo

fastkate's Introduction

Fast top-K Area Topics Extraction (FastKATE)

This repository contains the source code, data and API used in our recent paper: Fast Top-k Area Topics Extraction.

Prerequisites

The following dependencies are required and must be installed separately:

  • Python 3 (used to run our programs)
  • Aria2 (used to speed up downloading Wikipedia dumps)
  • WikiExtractor (used to extract plain text from Wikipedia dumps)

Then run git clone https://github.com/thuzhf/FastKATE.git to download this repository to your computer (with the same name). For convenience, please put WikiExtractor and FastKATE under the same parent directory, and we denote this parent directory as <PARENT> in the following steps.

Download and Preprocess Wikipedia Dumps

Since our model utilizes Wikipedia dumps, thus we need to download these data first. We choose Wikipedia dumps of timestamp 20170901 as our example in the following steps. Available timestamps can be found here.

  1. Run cd <PARENT> to enter into the parent directory of FastKATE.

  2. Run python3 -m FastKATE.src.wiki_downloader 20170901 ./wikidata/ all will help you download all possibly needed data of Wikipedia of timestamp 20170901 into the directory ./wikidata/. For quick help, run python3 -m FastKATE.src.wiki_downloader -h.

  3. Decompress all downloaded Wikipedia dumps to the ./wikidata/ with the same name (without suffixes such as .gz and .bz2).

  4. Run python3 wikiextractor/WikiExtractor.py -o ./wikidata/preprocessed/ -b 64M --no-templates ./wikidata/enwiki-20170901-pages-articles-multistream.xml to preprocess downloaded wikidata. For quick help, run python3 wikiextractor/WikiExtractor.py -h.

Generate Topic Embeddings

  1. Run cd <PARENT> to enter into the parent directory of FastKATE.

  2. Run python3 -m FastKATE.src.topic_embeddings 20170901 ./wikidata/ to extract candidate topics (in the form of phrases) from wikidump data and generate vector representations of each topic. For quick help, run: python3 -m FastKATE.src.topic_embeddings -h.

  3. A pretrained topic embeddings model (which is trained using the wikidump of timestamp 20161201 and used in our paper) can be downloaded here (including 3 files; you should download all 3 files and put them in the same folder if you want to use the pretrained model).

  4. Actually our code can be easily modified to train topic embeddings on different datasets other than Wikipedia used here. For those who really want to do this, please refer to the source code for more details.

Extract Category Structure from Wikipedia

  1. Run cd <PARENT> to enter into the parent directory of FastKATE.

  2. Run python3 -m FastKATE.src.taxonomy 20170901 ./wikidata/ to extract category structure from Wikipedia. For quick help, run: python3 -m FastKATE.src.taxonomy -h.

  3. A file containing extracted category structure can be downloaded here (which is used in our paper).

Fast top-K Area Topics Extraction (FastKATE) and its API

  1. Run cd <PARENT> to enter into the parent directory of FastKATE.

  2. Run python3 -m FastKATE.src.api ./wikidata/ to run the extraction algorithm and set up the API. For quick help, run: python3 -m FastKATE.src.api -h. A currently running API can be visited here (slightly different from the original paper now because we have integrated MAG and ACM CCS data to further improve original results).

    • The inputs of the API are:

      • area: area name; should be lowercase; spaces should be replaced by _.
      • k: the number of topics needed to be extracted; should be a positive integer.
    • The output of the API is a dict in JSON format, which consists of:

      • area: the same as the input.
      • result: top-k extracted topics of the given area, accompanied and ranked (in descending order) by their relevance to the given area.
      • time: consumed time (in seconds).

fastkate's People

Contributors

luogan1234 avatar thuzhf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

jingfei-han

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.