acl-org / acl-anthology Goto Github PK

View Code? Open in Web Editor NEW

379.0 21.0 253.0 310.92 MB

Data and software for building the ACL Anthology.

Home Page: https://aclanthology.org

License: Apache License 2.0

Shell 1.49% HTML 7.92% Perl 0.47% Python 87.40% Makefile 1.45% TeX 0.15% SCSS 0.80% Just 0.32%

acl acl-anthology computational-linguistics natural-language-processing library library-management-system

acl-anthology's Introduction

ACL Anthology

This repository contains:

Metadata for all papers, authors, and venues on the ACL Anthology website.
Code and instructions for generating the website.
A Python package for accessing the metadata, also available on PyPI.

The official home of this repository is https://github.com/acl-org/acl-anthology.

Using the acl-anthology-py Python package

Please see the separate README for the Python package for detailed information.

Generating the Anthology website

These are basic instructions on generating the ACL Anthology website as seen on https://aclanthology.org/.

Prerequisites

To build the Anthology website, you will need:

Python 3.8 or higher
Python packages listed in bin/requirements.txt; to install, run pip -r bin/requirements.txt
Hugo 0.58.3 or higher (can be downloaded directly from their repo; the extended version is required!)
bibutils for creating non-BibTeX citation formats (not strictly required to build the website, but without them you need to invoke the build steps manually as laid out in the detailed README)
optional: If you install libyaml-dev and Cython before running make the first time, the libyaml C library will be used instead of a python implementation, speeding up the build.

Building and deployment with GitHub

There is a GitHub actions action performing deployment directly from GitHub. To use this, you need to define this variable in your repository settings (web interface: settings -> secrets):

PUBLISH_SSH_KEY: the secret key in standard pem format for authentication (without a passphrase)

GitHub will then automatically build and deploy the current master whenever the master branch changes. This is done via the upload target in the Makefile.

Cloning

Clone the Anthology repo to your local machine:

$ git clone https://github.com/acl-org/acl-anthology

Generating

Provided you have correctly installed all requirements, building the website should be as simple running make from the directory to which you cloned the repo.

The fully generated website will be in build/anthology afterwards. If any errors occur during this step, you can consult the detailed README for more information on the individual steps performed to build the site. You can see the resulting website by launching a local webserver with make serve, which will serve it at http://localhost:8000.

Note that building the website is quite a resource-intensive process; particularly the last step, invoking Hugo, uses about 18~GB of system memory. Building the anthology takes about 10 minutes on a laptop with an SSD.

(Note: This does not mean you need this amount of RAM in your system; in fact, the website builds fine on a laptop with 8 GB of RAM. The system might temporarily slow down due to swapping, however. The figure of approx. 18 GB is the maximum RAM usage reported when running hugo --minify --stepAnalysis.)

The anthology can be viewed locally by running hugo server in the hugo/ directory. Note that it rebuilds the site and therefore takes about a minute to start.

Hosting a mirror of the ACL anthology

First, creating a mirror is slow and stresses the ACL Anthology infrastructure because on initial setup you have to download every single file of the anthology from the official webserver. This can take up to 8 hours no matter how fast your connection is. So please don't play around with this just for fun.

If you want to host a mirror, you have to set two environment variables:

ANTHOLOGY_PREFIX the http prefix your mirror will be reachable under e.g. https://example.com/my-awesome-mirror or http://aclanthology.lst.uni-saarland.de (Notice that there is no slash at the end!)
ANTHOLOGYFILES the directory under which papers, attachments etc. will reside on your webserver. This directory needs to be readable by your webserver (obviously) but should not be a subdirectory of the anthology mirror directory.

With these variables set, you run make to create the pages and make mirror to mirror all additional files into the build/anthology-files directory. If you created a mirror before already, it will only download the missing files.

If you want to mirror the papers but not all attachments, you can run make mirror-no-attachments instead.

You then rsync the build/website/ directory to your webserver or, if you serve the mirror in a subdirectory FOO, you mirror build/website/FOO. The build/anthology-files directory needs to be rsync-ed to the ANTHOLOGYFILES directory of your webserver.

As you probably want to keep the mirror up to date, you can modify the shell script bin/acl-mirror-cronjob.sh to your needs.

You will need this software on the server

rsync
git
python3
hugo > 0.58
python3-venv

If you want the build process to be fast, install cython3 and libyaml-dev (see above).

Note that generating the anthology takes quite a bit of RAM, so make sure it is available on your machine.

Contributing

If you'd like to contribute to the ACL Anthology, please take a look at:

our Github issues page
the detailed README which contains more in-depth information about generating and modifying the website.

History

This repo was originally wing-nus/acl and has been transferred over to acl-org as of 5 June 2017.

License

The code for building the ACL Anthology is distributed under the Apache License, v2.0.

acl-anthology's People

Contributors

Stargazers

Watchers

Forkers

danielgildea mjpost chbrown bnmin robvanderg rspeer davidweichiang jtrmal akoehn aryamccarthy villalbamartin danyaljj kilian-gebhardt mayhewsw nschneid knmnyn politzerahles pkolachi drelhaj olamyy bryant1410 dbonadiman talschuster anidhi hajic jplalor akornilo sashank06 bowbowbow tushaargvs ibraheemtuffaha manuelciosici guyaglionby ddua accreator mayankjobanputra edwinzhng spencerwhitehead databill86 sjmielke jia-zh mardub1635 chchenhui drvenabili fagan2888 strubell mitcho jowagner rossanacunha bigaidream-experiments mattshardlow nikitasrivatsan franck-dernoncourt christophalt marcschulder witty-kitty azurah andreasvc namrathaurs tzawa dragomirradev fmmb belindal chrizba kawine dojoteef liulinlin90 nelson-liu jetrunner vered1986 timoeller pmarcis indexfziq andyweizhao mk-nakano jonmay brucewlee zsquaredz svirpioj rmalouf shimorina gcelano hadyelsahar mstelling mikewangwzhl ir-anthology heddayam xinru1414 potthast hfxunlp sarmilaupadhyaya sonvx neubig adamjankaczmarek davidstap ceramisch huyen144 yashalshakti esha-sg akash418

acl-anthology's Issues

Bibtex are @Misc instead of @InProceedings

Reported via email.

The bibtex entries are shown as @misc instead of @InProceedings:
https://aclanthology.info/papers/W13-2206/w13-2206

NAACL 2016 proceedings URL is 404

On http://aclanthology.info/events/naacl-2016, the NAACL 2016 proceedings URL (http://aclweb.org/anthology/N16-1) is 404.

Handle Errata

S/N: 9
Title: Handle Errata

Ingestion of new proceedings causes inconsistency in the database

When adding data for a new conference, and under some not-yet-determined circumstances, the database can end up in an inconsistent state. This is evidenced by error pages in the search functionality, as seen in issues #32, #39, and others.

INLG 2017 proceedings not listed for SIGGEN

I know the INLG 2017 proceedings are in the anthology (pdf, bib), but they're not listed on the SIGGEN page and don't appear in the search results.

Docker Container with minimal install

S/N: 23
Priority: Low
Title: Docker Container with minimal install
Proposer: Nitin Madnani
Notes: In progress

Handle video links from old anthology

S/N: 16
Priority: High
Title: Handle video links from old anthology

Move issues over to Github issues

S/N: 14
Title: Move issues over to Github issues
Notes: Assigned to Christian Federmann

OAI-PMH functionality

S/N: 22
Priority: Low
Title: OAI-PMH functionality

Show DOIs in single BibTex file

S/N: 18
Priority: Medium
Title: Show DOIs in single BibTex file

Number of items per page doesn't work if query contains number

Compare:

Queries for words without numbers in them work well (e.g. "framenet"), and I can select the number of items per page. But I cannot choose the number of items per page for queries like "flickr30k" without the website breaking down:

The page you were looking for doesn't exist.

You may have mistyped the address or the page may have moved.

If you are the application owner check the logs for more information.

CL titles not properly appearing in the Rails app.

Slightly related text:
I still see one problem with the CL articles:
in input files J10.xml through J16.xml, the journal title
"Computational Linguistics" is missing in the
header and therefore in the generated .bib files.

DBLP update

S/N: 4
Title: DBLP update

coli.uni-saarland.de VM server needs more disk space

We are at over 97% full on the server.

@villalbamartin, @CTNLP can you requisition more server space for the extra PDF files?

Move ingestion Q to Github Issues

S/N: 24
Priority: Low
Title: Move ingestion Q to Github Issues
Notes: Assigned to Christian Federmann

Try to get ACL, etc. indexed again

S/N: 15
Priority: high
Title: Try to get ACL, etc. indexed again

"Page doesn't exist" errors when searching by author name

I doubt this issue is not know but I couldn't find it in the issue list here on github:

When I search for an author by name, quite often I get "The page you were looking for doesn't exist." errors even though, there are papers by this author in the anthology. E.g. when looking for Hannah Rashkin (see Screenshot) I get this error, but there is a paper by her: http://aclanthology.info/papers/connotation-frames-a-data-driven-investigation.bib

ACM update

S/N: 5
Title: ACM update

Non-ACL events prior to 2000 are 404

links to pdfs give a not found error, for example:
https://aclanthology.info/pdf/C/C98/C98-1001.pdf

connection with github does not work on the production machine

I can't seem to pull or push from the production machine in Saarlands with the aclanthology user.
Can we debug this problem?

Presentation / Poster handling

S/N: 7
Title: Presentation / Poster handling

rake:import_xml undesired behavior when using quotes

As seen in bug #51, giving a command-line argument for rake:import_xml that it's not exactly 3 characters long deletes and re-imports the entire database from scratch. This is technically correct, but wrong for all practical purposes.

Expected behavior: the command should give an error message, something like "that file does not exist, and therefore cannot be imported".
Actual behavior: the database is deleted and re-imported from scratch.

Get author and paper statistics automated

S/N: 20
Date: 2016-09-09
Priority: Low
Title: Get author and paper statistics automated

ACL 2017: Bibtex and style files for auto inclusion of DOIs

S/N: 21
Date: 2016-09-09
Priority: Medium
Title: ACL 2017: Bibtex and style files for auto inclusion of DOIs

Ingestion Pipeline documentation

Documentation for ingesting new documents from STARTv2 into the Rails app shall be put up on the GitHub Wiki.

Move anthology PDFs over to aclanthology.info to get coverage by Scholar

S/N: 12
Priority: high
Title: Move anthology PDFs over to aclanthology.info to get coverage by Scholar
Proposer: Darcy Dapra

Presentation / Poster Link handling

S/N: 8
Title: Presentation / Poster Link handling

Paper pdf returning 403

The paper Assessing the Challenge of Fine-Grained Named Entity Recognition and Classification is returning a 403 when accessed through the Anthology interface. This happens because the paper's PDF has been moved, from http://www.aclweb.org/anthology/W10-2415.pdf to http://www.aclweb.org/anthology/W10-2415.old.pdf.

The paper can be still accessed through Google, but not through the Anthology. The person who pointed this out to me also mentioned that there's a revised version of it, available here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.357.7217&rep=rep1&type=pdf .

Should we replace the PDF link?

Character ń appears as ? in export files

The diacritic character ń is causing problems while generating export files. I have just checked my publications list:

http://aclanthology.info/catalog?utf8=%E2%9C%93&search_field=all_fields&q=Agnieszka+Falenska

The three papers which were published under surname "Faleńska" and not "Falenska" have errors in bibtex files (but also all other export files). There is "Fale?ska" without n on the authors list (it should be \'{n} ).

It might cause problems for anybody who would like to cite those papers. And also for any other authors who have ń in their surnames.

Search not looking into content of PDF files

The current search does not explore the text of the papers themselves, but only their metadata. As it has been pointed out in issue #39, this is less than ideal.

The search functionality would be greatly enhanced if we could look into the content of the PDF files themselves. This is likely to be quite complicated.

I'm opening this issue as a feature proposal, in order to collect ideas.

Auto create all types of bib files

S/N: 6
Date: 2016-06-30
Title: Auto create all types of bib files
Proposer: Mark Steedman

Add users to trial VM

We Now have a trial virtual machine with the computer linguistics group in Saarbrücken. All the volunteers should get accounts for this machine.

ERROR: cannot truncate a table referenced in a foreign key constraint

Did something change with respect to ingestion?

I didn't have this problem with re-ingesting proceedings but now it is happening. This is for W17.xml to change one author in W17-74.

aclanthology@aclanthology:~/acl-anthology$ rake import:xml["W17"] --trace
** Invoke import:xml (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute import:xml
PG::FeatureNotSupported: ERROR: cannot truncate a table referenced in a foreign key constraint
DETAIL: Table "papers_people" references "people".
HINT: Truncate table "papers_people" at the same time, or use TRUNCATE ... CASCADE.
: TRUNCATE TABLE people RESTART IDENTITY;
rake aborted!
ActiveRecord::StatementInvalid: PG::FeatureNotSupported: ERROR: cannot truncate a table referenced in a foreign key constraint
DETAIL: Table "papers_people" references "people".
HINT: Truncate table "papers_people" at the same time, or use TRUNCATE ... CASCADE.

Fix old anthology ingestion to handle multi line authors

S/N: 11
Title: Fix old anthology ingestion to handle multi line authors

ACL 2017 not indexing correctly on Google scholar.

The dates and the publisher information is not being picked up by google scholar. Also, a lot of the references are not indexed with the correct metadata (i.e., they don't get added to the list of citations of the papers they cite).

Handle problem with single bibtex encoding

S/N: 17
Priority: Medium
Title: Handle problem with single bibtex encoding
Proposer: Mark Steedman

Inconsistencies of entry types of workshop paper

I noticed that for papers in workshop proceedings the bib-file sometimes uses the entry type "inbook" (which from my point of view doesn't make sense) while for other "inproceedings" is used.

Compare http://aclanthology.info/papers/feelings-from-the-past-adapting-affective-lexicons-for-historical-emotion-analysis.bib and http://aclanthology.info/papers/semeval-2007-task-14-affective-text.bib

Handle two word last names

S/N: 1
Title: Add support for two-word last names
Proposed by: Benjamin Van Durme

rake import:sigs[true] misses updating semitic

Go fix import of sigs to deal with other named semitic.yaml

XML for J series missing title of journal

I still see one problem with the CL articles:
in input files J10.xml through J16.xml, the journal title
"Computational Linguistics" is missing in the
header and therefore in the generated .bib files.

I guess this is a problem with the web scraper?

Search via ... engines need to be integrated with current anthology

Suggested by Zeerak Waseem @zeerakw , @leondz and @evanmiltenburg @emilybender. See Twitter thread: https://twitter.com/evanmiltenburg/status/950367985889398790

Broken pdf links and incorrect url in bib files

The links to the individual pdf files referred to from https://aclanthology.info/volumes/proceedings-of-the-first-international-workshop-on-tree-adjoining-grammar-and-related-frameworks-tag-1 are broken.

The prefix that is used is "https://aclanthology.info/pdf/" (e.g., https://aclanthology.info/pdf/W/W90/W90-0200.pdf) while it should be (at least it's a link that works) "http://www.aclweb.org/anthology/" (e.g., http://www.aclweb.org/anthology/W/W90/W90-0200.pdf, or http://www.aclweb.org/anthology/W90-0200).

Moreover, in the bib files, the prefix that is used is http://aclanthology.coli.uni-saarland.de/pdf (e.g., http://aclanthology.coli.uni-saarland.de/pdf/W/W90/W90-0200.pdf)

Add stats and permission for supplementary materials

S/N: 10
Title: Add stats and permission for supplementary materials
Proposer: Diane Litman

Create a plan for mirrors.

In the not too distant future we should establish a number of mirrors that ensure that the core information of the anthology is always available. We will have to discuss exactly what these mirrors should do and where to get the resources for them.

P17-1.bib

Hi Min-Yen,

I think you are the maintainer of the ACL Anthology? I found an error in the ACL 2017 bibtex file which makes it unparseable. There is missing comma at the end of line 4491 of this file:

    http://aclweb.org/anthology/P/P17/P17-1.bib

Sincerely,

Ingest Volume, Issue for journal articles

S/N: 19
Date: 2016-09-09
Title: Ingest Volume, Issue for journal articles
Proposer: Dan Gildea
Notes: Assigned to Dan Gildea

LREC 2014 proceedings are 404

On http://aclanthology.info/events/lrec-2014, the LREC 2014 proceedings URL (http://aclweb.org/anthology/L14-1) is 404.

On the same page, http://aclweb.org/anthology/L14-1000 is also 404.

Author order incorrect just on paper list page (D17 specifically)

via email.

The author order seems likely to be a UI issue in the new anthology.
When I click on each of the papers, I see a different (correct) order of authors on the paper details page.
I also compared this to the old anthology. The old anthology appears to have the authors in the correct order too.

My guess is that the metadata has the correct order, but the web rendering is somehow reordering the authors on the page (only in paper list in the new anthology).

See screenshots:

DOI update

S/N: 3
Title: DOI update
Notes: synced at 16 Dec 2015