Git Product home page Git Product logo

arxiv-browse's Introduction

arxiv-browse

Running Browse with the Flask development server

You can run the browse app directly.

make venv

or

python --version
# 3.10.x
python -m venv ./venv
source ./venv/bin/activate
pip install poetry==1.3.2
poetry install
python main.py

Then go to http://127.0.0.1:8080/abs/0906.5132

This will monitor for any changes to the Python code and restart the server. Unfortunately static files and templates are not monitored, so you'll have to manually restart to see those changes take effect.

By default, the application will use the directory trees in tests/data/abs_files and tests/data/cache and when looking for the document metadata and PDF files. These paths can be overridden via environment variables (see browse/config.py).

Running Browse with .env file

First, you'd need to create the '.env' file somewhere. Using tests/.env is suggested.

export GOOGLE_APPLICATION_CREDENTIALS=<Your SA credential>
export BROWSE_SQLALCHEMY_DATABASE_URI="mysql://browse:<BROWSE_PASSWORD>@127.0.0.1:1234/arXiv"
export DOCUMENT_ABSTRACT_SERVICE=browse.services.documents.db_docs
export ABS_PATH_ROOT=gs://arxiv-production-data
export DOCUMENT_CACHE_PATH=gs://arxiv-production-data/ps_cache
export DOCUMENT_LISTING_PATH=gs://arxiv-production-data/ftp
export DISSEMINATION_STORAGE_PREFIX=gs://arxiv-production-data
export LATEXML_ENABLED=True
export LATEXML_INSTANCE_CONNECTION_NAME=arxiv-production:us-central1:latexml-db
export LATEXML_BASE_URL=https://browse.arxiv.org/latexml
export FLASKS3_ACTIVE=1

You need a SA cred to access the db, and the cloud-sql-proxy running. For LATEXML_INSTANCE_CONNECTION_NAME, you may need to ask someone who knows the db. (This is obviously a production DB.)

You can find the browse password here: https://console.cloud.google.com/security/secret-manager/secret/browse-sqlalchemy-db-uri/versions?project=arxiv-production

If you have a PyCharm, script: main.py Enable env files Add tests/.env

docs/development/pycharm-run-setting.png

SA Credentials

Your SA needs followings:

  • Cloud SQL Client
  • Secret Manager Secret Accessor
  • Storage Object Viewer

Save the private key somewhere on your local machine. Optionally save it in 1password.

Running cloud-sql-proxy

Once you have the google SA private key, you can run the cloud-sql-proxy.

main proxy

NOTE: cloud_sql_proxy and cloud-sql-proxy (new) have different options. In this, only describes the new as you probably don't have the old one.

cloud-sql-proxy --address 0.0.0.0 --port 1234 arxiv-production:us-east4:arxiv-production-rep4

If the proxy is working, you can use mysql client to connect to the db.

mysql -u browse -p --host 127.0.0.1 --port 1234 arXiv
Enter password: 
...
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show tables;
+------------------------------------------+
| Tables_in_arXiv                          |
+------------------------------------------+
| Subscription_UniversalInstitution        |

Test suite

Run the main test suite with the following command:

pytest tests

Running Browse in Docker

You can also run the browse app in Docker. The following commands will build and run browse using defaults for the configuration parameters and will use the test data from tests/data. Install Docker if you haven't already, then run the following:

script/start

This command will build the docker image and run it. If all goes well, http://localhost:8000/ will render the home page.

Configuration Parameters

See browse/config.py for configuration parameters and defaults). Any of these can be overridden with environment variables.

Serving static files on S3

We use Flask-S3 to serve static files via S3.

After looking up the AWS keys and region and bucket:

cd arxiv-browse
git pull
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x \
 AWS_REGION=us-east-1 FLASKS3_BUCKET_NAME=arxiv-web-static1 \
 pipenv run python upload_static_assets.py

In AWS -> CloudFront, select the static.arxiv.org distribution, -> Invalidations -> Create invalidation, and enter a list of url file paths, eg: /static/browse/0.3.4/css/arXiv.css.

It may be help to use a web browser's inspect->network to find the active release version.

Tests and linting for PRs

There is a github action that runs on PRs that merge to develop. PRs for which these tests fail will be blocked. It is the equivalent of running:

# if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

pytest tests

Settinp up pytest in PyCharm

docs/development/pycharm-run-setting.png

arxiv-browse's People

Contributors

aliabid94 avatar alisonhofer avatar antonfefilov avatar axtonpitt avatar bbarker avatar bdc34 avatar bej9038 avatar bmaltzan avatar cbf66 avatar davidlfielding avatar dependabot[bot] avatar domenicrosati avatar eawoods avatar erickpeirson avatar gragtah avatar jaimiemurdock avatar jimentwood avatar jonathanhyoung avatar jweiskoff avatar kanhari avatar kyokukou avatar ludoviofb avatar mhl10 avatar mnazzaro avatar ntai-arxiv avatar osanseviero avatar rstojnic avatar sbbcornell avatar shinminjeong avatar zeke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arxiv-browse's Issues

Add doi and url to the bibtex export of preprints

Is your feature request related to a problem? Please describe.
Many journal bibtex style files (.bst) do not support arXiv preprints but most support doi or url fields. When citing an arXiv preprint for some journal publications, I thus have to add these fields manually.
Here are examples of bibtex exports before and after these extra fields have been added:
bibtex_export_after.txt
bibtex_export_before.txt

Describe the solution you'd like
Add the arXiv doi and url to fields in the bibtex export for preprints.

Additional context
I've created a corresponding pull request at #436

CORE Recommender links

As reported by @jimentwood:

When you right click on a paper recommended in the "Recommenders and Search Tools" (Part of arxiv labs under an article), and click "open link in new tab", it will reopen the same url you're currently on. Expectation is that it opens the URL of the actual recommended paper. If you left click on the recommended paper directly, everything behaves as expected and the recommended paper opens in new tab.

Another way to test for this is to right click, then use:"Copy link location" in context menu and you will see that it is the same link as the original.

Define canonical URIs of categories

Is your feature request related to a problem? Please describe.

I'd like to express arXiv categories in RDF (using SKOS) to express links between categories and publications and links between categories and other taxonomies as RDF triples. By now there is no official URI form for identifiers such as cs and cs.AI.

Describe the solution you'd like

  1. Define clean and durable URIs such as https://arxiv.org/category/cs.AI or https://arxiv.org/category_taxonomy/cs.AI
  2. Let those URIs resolve to anything but 404 (e.g. redirect with HTTP Status code 302 to https://arxiv.org/list/cs.AI/recent)
  3. Maybe later add dedicated pages for each category URI and return RDF if requested (low priority)

The solution (1+2) is likely less then 10 lines of code:

@blueprint.route("category_taxonomy/<string:category_id>", methods=["GET"])
def category_uri(category_id: str) -> Any:
    if is_known_category(category_id): # TODO: define is_known_category to check whether category exists
        return redirect("list/{id}/recent".format(id=category_id))
    raise NotFound

Describe alternatives you've considered

  • Use listing pages as https://arxiv.org/list/cs.AI/recent as URIs: but these reference lists of publications not the categories
  • Define my own, incompatible URI schema such as http://example.org/arxiv-categories/cs.AI

Additional context

In Wikidata there is https://www.wikidata.org/wiki/Property:P820 to express arXiv category but there is no https://www.wikidata.org/wiki/Property:P1921 like used for other identifiers.

Getting the development environment running

I'm working on a new arXiv Labs integration for Replicate.com, but having a bit of trouble getting the development environment set up.

Environment

I'm on an M1 Mac running Big Sur 11.5.2

$ uname -a
Darwin ezekiels-mbp.lan 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64

What I've tried

I created a new virtual environment using pyenv:

pyenv virtualenv arxiv
pyenv activate arxiv

Next I installed pipenv in that virtualenv

pip install pipenv

Next I looked in the Pipfile and saw this:

[requires]
python_version = "3.6"

Next I attempted to install Python 3.6:

pyenv install 3.6.15

That failed though, as there are no distributions of Python 3.6 for the ARM64 architecture M1 Macs. Also no surprise since Python 3.6 is now End of Life


I tried bumping the Python version in the Pipfile to 3.9 and re-running pipenv install but then started getting errors about mysqlclient not being installed.


I removed Pipfile.lock before running pipenv install. Still no luck.


I created a fresh virtualenv running Python 3.7, but that failed with a different error ModuleNotFoundError: No module named '_ctypes'


Before I dig in further, I figured I'd open an issue here and see if anyone might have some suggestions on how I can unblock myself. Thanks for any help!

`citation_doi` meta tag missing from head

Describe the bug
The citation_doi meta tag missing from head

To Reproduce

  1. View an article and note there is an assigned DOI
  2. View page source and note citation_doi meta tag is not defined

Expected behavior
There should be a citation_doi meta tag in the head.

Screenshots
image

Desktop (please complete the following information):

  • OS: macOS 13.4.1
  • Browser: Chrome
  • Version: 115.0.5790.114

Additional context
Possibly related to #330? citation_doi should be defined in metatags.py but isn't.

deep link to labs tabs with URL fragments

Is your feature request related to a problem? Please describe.

I would like to share a link with someone to show them the new Replicate Demos integration. Currently I have to say something like "visit this URL, scroll down, and click the Demos tab".

Describe the solution you'd like

I would like to be able to tack #demos onto the end of a URL and have the site automatically open that tab and scroll it into view.

Screen Shot 2022-03-09 at 11 41 26 AM

I'm happy to implement this but wanted to get a sense first whether this would be a welcome change.

cc @mhl10 @SBBCornell

Bibliographic Explorer - Semantic Scholar request fails - CORS Mission allow origin

Describe the bug
The request to Semantic Scholar is missing a header causing the request to fail.

To Reproduce
Steps to reproduce the behavior:

  1. Go to any paper's url (example)
  2. Ensure 'Bibliographic Explorer' is on
  3. Observe error:
    image

My browser's dev console notes "CORS Missing Allow Origin" on the preflight request. Adding Access-Control-Allow-Origin: * to the request manually allows the request to successfully complete.

Desktop (please complete the following information):

  • OS: Window 10
  • Browser: Firefox 91, Chrome 92

urlize for comments with urls containing `?` works differently in list and single view

When looking at the list and individual view of a recently published arxiv of mine, i noticed that the comment gets urlized in different ways.

This is the raw comment:

8 pages, 15 figures. This research was originally conducted as part of the Master's practical seminar "Computational Ethics" at LMU Munich in the winter term of 2021, supervised by Prof. Dr. Francois Bry. Significant contributions by Matthias Fruth are acknowledged. The model available on the web under https://netlogoweb.org/web?https://neothethird.gitlab.io/ceth-seminar/model.nlogo

In the individual way, the url gets parsed correctly:

image

In the list view, the ? is not understood correctly:

image

Can't load dependencies

Describe the bug
I think I am missing a weird dependency. I keep running it app.py and adding dependencies as they are found but have gotten stuck here:

Traceback (most recent call last):
  File "/.../arxiv-browse-0.3.2.5/app.py", line 2, in <module>
    from browse.factory import create_web_app
  File "/..../arxiv-browse-0.3.2.5/browse/factory.py", line 15, in <module>
    from arxiv.users.auth import Auth
ModuleNotFoundError: No module named 'arxiv.users'

Process finished with exit code 1

When I try to add pip install arxiv-auth
Collecting arxiv-auth Using cached arxiv-auth-0.4.2.tar.gz (35 kB) Collecting pycountry Using cached pycountry-20.7.3.tar.gz (10.1 MB) Requirement already satisfied: sqlalchemy in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from arxiv-auth) (1.3.19) Collecting mimesis Using cached mimesis-4.1.2.tar.gz (2.8 MB) Collecting mysqlclient Using cached mysqlclient-2.0.1.tar.gz (87 kB) ERROR: Command errored out with exit status 1: command: /usr/local/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/xh/xfrbbd5d74g0p6wcxx2ymq3m0000gn/T/pip-install-kj4hzv7p/mysqlclient/setup.py'"'"'; __file__='"'"'/private/var/folders/xh/xfrbbd5d74g0p6wcxx2ymq3m0000gn/T/pip-install-kj4hzv7p/mysqlclient/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/xh/xfrbbd5d74g0p6wcxx2ymq3m0000gn/T/pip-pip-egg-info-f06mv1uj cwd: /private/var/folders/xh/xfrbbd5d74g0p6wcxx2ymq3m0000gn/T/pip-install-kj4hzv7p/mysqlclient/ Complete output (12 lines): /bin/sh: mysql_config: command not found /bin/sh: mariadb_config: command not found /bin/sh: mysql_config: command not found Traceback (most recent call last): File "<string>", line 1, in <module> File "/private/var/folders/xh/xfrbbd5d74g0p6wcxx2ymq3m0000gn/T/pip-install-kj4hzv7p/mysqlclient/setup.py", line 15, in <module> metadata, options = get_config() File "/private/var/folders/xh/xfrbbd5d74g0p6wcxx2ymq3m0000gn/T/pip-install-kj4hzv7p/mysqlclient/setup_posix.py", line 65, in get_config libs = mysql_config("libs") File "/private/var/folders/xh/xfrbbd5d74g0p6wcxx2ymq3m0000gn/T/pip-install-kj4hzv7p/mysqlclient/setup_posix.py", line 31, in mysql_config raise OSError("{} not found".format(_mysql_config_path)) OSError: mysql_config not found ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Note that I am using PyCharm but I don't think that makes a differnce.

To Reproduce
Steps to reproduce the behavior:
Download zip file
Uncompress
Make as project
run/install missing dependence/repeat

Expected behavior
Run sample program

Screenshots
N/A

Desktop (please complete the following information):

  • OS: OSX
  • Browser N/A
  • Python Version 3.8.2
  • PyCharm 2020.2
    Other versions:
    image

image

Add a contributing guide

Is your feature request related to a problem? Please describe.

I'm a first-time contributor, part of the team at @replicate (the folks who built @arxiv-vanity). ๐Ÿ‘‹๐Ÿผ

I'm starting to work on an arXiv Labs integration and trying to get my dev env set up. I looked for a CONTRIBUTING.md at the top-level of the repo, but don't see one.

Describe the solution you'd like

I think it would be great to have a contributing guide with details about the project code of conduct, how to make contributions, how to get a development environment up and running, etc.

The existing README here could be used as a starting point for describing the development environment.

Describe alternatives you've considered

The contributing guide could also live in the README, but I think it would be preferable to keep the README short and sweet, with more of an overview of what this repo actually is/does/contains, plus links to things like the contributing guide, the org-wide contributing guide, etc.

Additional context

I'd be happy to help on this once I've got my development environment working. See #253

cc @erickpeirson @mhl10 @bdc34, because I see you three have worked on these docs in the past.

Inline Commenting

Not sure if this is the right place to post, but I was wondering if there was any talk of a commenting feature on papers. Specifically, the ability to highlight a passage and leave a comment, much like in Google Docs.

Thanks,
Bryan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.