Git Product home page Git Product logo

allofplos's People

Contributors

dependabot[bot] avatar egh avatar eseiver avatar evarghese avatar jgarst avatar maxdrohde avatar mpacer avatar napsternxg avatar sbassi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

allofplos's Issues

create csv file of article metadata

create a csv file that contains several basic metadata fields of all PLOS articles. this can be queried directly for a number of functions instead of iterating over XML, and requires less overhead than the full planned sqlite database. it probably will be too big to store directly on GH, but could essentially be another "allofplos" object that's downloaded separately and also maintained by the internal PLOS server. will work on this in a new branch.

Add DOI field check for article.doi

Right now, self.doi in the Article class (article_class.py) relies on the user entering the DOI correctly & the accompanying local filename matching. It would be good to also check that the DOI in local XML DOI field matches. That field is located in an element at the xpath location /article/front/article-meta/article-id with element.attrib = {"pub-id-type": "doi"} in element.text.

Article.filename is confusingly named

I would have thought Article.filename would have been the filename not the filepath, but right now it appears to be the filepath.

I think we should introduce a new property on Article to represent the filename as opposed to the file path.

We also then might want to have a way to set set the value of an Article from a filename rather than a filepath… but also it matters a lot less if we're discouraging people from creating Article objects directly.

To be consistent with the naming in Corpus, we could change filename to filepath, we could call this file instead of filename.

Use git tags to designate releases on GitHub as well

GitHub has a notion of releases that people (e.g., me) will sometimes go to when I want to look at the history of specific versions of the packages.

You can access this feature (I believe) by giving a tag with a semver type version number to the PR that is equivalent to the release commit.

for example:

git checkout d732feb2
git tag 0.8.1
git push --tags

would create a tagged release attached to the code base at the commit with the d732feb2 hash (the merge commit for #21).

uncorrected proofs should be in subdirectory of corpus dir

if allofplos is pip installed, updating it will erase the uncorrected proofs list and potentially the corpus itself. one thing to help with this is make the uncorrected proofs list into a . system file in the corpus directory that is explicitly ignored otherwise.
edit: won't solve this problem, but a subdirectory instead of a list would eliminate the need for said list.

check requirements file

Hi @sbassi ,
I used pipreqs allofplos (source) and it gave me a very different list of requirements than our current list:

requests==2.18.4
nbformat==4.4.0
numpy==1.13.1
lxml==3.8.0
tqdm==4.15.0
download==0.3.1
Could you verify which third party pkgs are actually needed?
Thanks!

Issues with running script as per instructions

This came up over an email with @sbassi and @eseiver

Once we use explicit relative imports or absolute imports, you can no longer run plos_corpus.py as a script. It instead needs to be run as a module.

Even then, it likely will not work if run from inside the package itself.

More ways of doing this work if we use explicit relative imports rather than absolute imports (and that from what I can tell is better practice anyway).

So we should change the README, but also use explicit relative imports.

Some resources leading to this way of thinking:

Guido van Rossum on scripts inside module directories:

The only use case [for this feature] seems to be running scripts that happen to be living inside a module's directory, which I've always seen as an antipattern.

Nick Cochlan's excellent post about issues with python import statements:

Unfortunately, [importing a package twice] is still a really easy guideline to violate, as it happens automatically if you attempt to run a module inside a package from the command line by filename rather than using the -m switch.

python -m project.example.tests.test_foo
python -c "from project.example.tests.test_foo import main; main()"

Note that if the project exclusively uses explicit relative imports for intra-package references, the … two commands shown may actually work for Python 3.3 and later versions. Any absolute imports that expect “example” to be a top level package will still break though.

Since Python 2.6, however, the following also works properly:

# working directory: project
python -m example.tests.test_foo

This last approach is actually how I prefer to use my shell when programming in Python - leave my working directory set to the project directory, and then use the -m switch to execute relevant submodules like tests or command line tools. If I need to work in a different directory for some reason, well, that’s why I also like to have multiple shell sessions open.

progressbar custom error msg

In case someone doesn't have progressbar installed, we should make a custom try/except for ModuleNotFoundError that tells ppl specifically to pip install progressbar2 since the import statement doesn't match the repo name

Bug in __str__ method for Article

I just ran into an issue while working on #89:

specifically when you try to print the article for '10.1371/journal.pbio.0030408'
you get

In [9]: print(corp[2])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-46ca5218206f> in <module>()
----> 1 print(corp[2])

~/jupyter/allofplos/allofplos/article_class.py in __str__(self, exclude_refs)
    122             root = tree.getroot()
    123             back = tree.xpath('./back')
--> 124             root.remove(back[0])
    125         local_xml = et.tostring(tree,
    126                                 method='xml',

IndexError: list index out of range

It's __repr__ is fine:

DOI: 10.1371/journal.pbio.0030408
Title: Stimulating the Brain Makes the Fingers More Sensitive

but it looks like the __str__ method assu assumes that tree.xpath('./back') will be nonempty when sometimes it will be empty.

I don't know if this is an error in the content but it's worth noting.

test error in Debian fresh install

The package was installed with pip install allofplos
(not by cloning the repo)

Also installed the test part with:

pip install -U allofplos[test]

root@7881ab553829:~# python -c "from allofplos import get_corpus_dir; print(get_corpus_dir())"
/usr/local/lib/python3.6/site-packages/allofplos/allofplos_xml

When I try to run the tests I get:

root@7881ab553829:~# pytest --pyargs allofplos
========================================================= test session starts =========================================================
platform linux -- Python 3.6.5, pytest-3.5.0, py-1.5.3, pluggy-0.6.0
rootdir: /root, inifile:
collected 20 items                                                                                                                    

allofplos/tests/test_corpus.py .FFLocal article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2002354.xml
FLocal article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2002354.xml
FLocal article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2002354.xml
FLocal article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2002354.xml
FFFFFF                                                                                     [ 60%]
allofplos/tests/test_unittests.py ...Local article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pone.0185809.xml
FLocal article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2001413.xml
FLocal article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/plos.correction.3155a3e9-5fbe-435c-a07a-e9a4846ec0b6.xml
FLocal article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2002399.xml
F.                                                                                      [100%]

============================================================== FAILURES ===============================================================
___________________________________________________________ test_corpus_len ___________________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f60581bd9e8>

    def test_corpus_len(corpus):
>       assert len(corpus) == 5

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:33: in __len__
    return len(self.dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f60581bd9e8>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
__________________________________________________________ test_corpus_iter_ __________________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efbe208>

    def test_corpus_iter_(corpus):
>       article_dois = {article.doi for article in corpus}

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:29: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:36: in __iter__
    return (article for article in self.random_article_generator)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:132: in random_article_generator
    for doi in self.iter_random_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:152: in iter_random_dois
    return (doi for doi in self.random.sample(self.dois, len(self)))
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efbe208>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
____________________________________________________ test_corpus_contains_article _____________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efb2668>
no_article = DOI: 10.1371/journal.pmed.0030132
Title: Bigger and Better: How Pfizer Redefined Erectile Dysfunction
yes_article = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604efb2b38>

    def test_corpus_contains_article(corpus, no_article, yes_article):
>       assert yes_article in corpus

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:39: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:62: in __contains__
    is_in = value.doi in self.dois and value.directory == self.directory
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efb2668>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
______________________________________________________ test_corpus_contains_doi _______________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efaf208>
no_article = DOI: 10.1371/journal.pmed.0030132
Title: Bigger and Better: How Pfizer Redefined Erectile Dysfunction
yes_article = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604efaf2e8>

    def test_corpus_contains_doi(corpus, no_article, yes_article):
>       assert yes_article.doi in corpus

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:64: in __contains__
    doi_in = value in self.dois
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efaf208>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
____________________________________________________ test_corpus_contains_filepath ____________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efcc0f0>
no_article = DOI: 10.1371/journal.pmed.0030132
Title: Bigger and Better: How Pfizer Redefined Erectile Dysfunction
yes_article = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604efcc1d0>

    def test_corpus_contains_filepath(corpus, no_article, yes_article):
        ## check for filepath, which is currently called filename on Article
>       assert yes_article.filename in corpus

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:48: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:64: in __contains__
    doi_in = value in self.dois
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efcc0f0>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
______________________________________________________ test_corpus_contains_file ______________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efe04a8>
no_article = DOI: 10.1371/journal.pmed.0030132
Title: Bigger and Better: How Pfizer Redefined Erectile Dysfunction
yes_article = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604efe0ba8>

    def test_corpus_contains_file(corpus, no_article, yes_article):
        ## check for filename, which is currently unavailable on Article
>       assert os.path.basename(yes_article.filename) in corpus

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:53: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:64: in __contains__
    doi_in = value in self.dois
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efe04a8>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
_____________________________________________________ test_corpus_random_article ______________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efd8f28>

    def test_corpus_random_article(corpus):
>       article = corpus.random_article

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:57: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:136: in random_article
    return next(self.random_article_generator)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:132: in random_article_generator
    for doi in self.iter_random_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:152: in iter_random_dois
    return (doi for doi in self.random.sample(self.dois, len(self)))
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efd8f28>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
________________________________________________________ test_corpus_indexing _________________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efe0400>

    def test_corpus_indexing(corpus):
>       assert corpus["10.1371/journal.pbio.2001413"] == corpus[0]

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:61: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:45: in __getitem__
    elif key not in self.dois:
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:115: in dois
    return list(self.iter_dois)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:98: in iter_dois
    return (x[1] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efe0400>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
_________________________________________________________ test_iter_file_doi __________________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efabf28>

    def test_iter_file_doi(corpus):
        expected = {
         'journal.pbio.2001413.xml': '10.1371/journal.pbio.2001413',
         'journal.pbio.2002354.xml': '10.1371/journal.pbio.2002354',
         'journal.pbio.2002399.xml': '10.1371/journal.pbio.2002399',
         'journal.pone.0185809.xml': '10.1371/journal.pone.0185809',
         'plos.correction.3155a3e9-5fbe-435c-a07a-e9a4846ec0b6.xml':
             '10.1371/annotation/3155a3e9-5fbe-435c-a07a-e9a4846ec0b6',
         }
>       assert expected == {f:doi for f, doi in corpus.iter_file_doi}

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:74: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efabf28>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
___________________________________________________________ test_filepaths ____________________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efd8240>

    def test_filepaths(corpus):
>       assert set(corpus.filepaths) == set(listdir_nohidden(TESTDATADIR))

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:78: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:120: in filepaths
    return list(self.iter_filepaths)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:103: in iter_filepaths
    return (os.path.join(self.directory, fname) for fname in self.iter_files)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:89: in iter_files
    return (x[0] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efd8240>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
_____________________________________________________________ test_files ______________________________________________________________

corpus = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efd0a20>

    def test_files(corpus):
        annote_file = 'plos.correction.3155a3e9-5fbe-435c-a07a-e9a4846ec0b6.xml'
>       assert annote_file in corpus.files

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_corpus.py:82: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:109: in files
    return list(self.iter_files)
/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:89: in iter_files
    return (x[0] for x in self.iter_file_doi)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[FileNotFoundError("[Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'") raised in repr()] Corpus object at 0x7f604efd0a20>

    @property
    def iter_file_doi(self):
        """Generator that returns filename, doi tuples for every file in the corpus.
    
            Used to generate both DOI and file generators for the corpus.
            """
        return ((file_, filename_to_doi(file_))
>               for file_ in sorted(os.listdir(self.directory))
                if file_.endswith(self.extension) and 'DS_Store' not in file_)
E       FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/site-packages/allofplos/tests/testdata'

/usr/local/lib/python3.6/site-packages/allofplos/corpus/corpus.py:77: FileNotFoundError
__________________________________________________ TestArticleClass.test_class_doi1 ___________________________________________________

self = <allofplos.tests.test_unittests.TestArticleClass testMethod=test_class_doi1>

    def test_class_doi1(self):
        """Tests the methods and properties of the Article class
            Test article DOI: 10.1371/journal.pone.0185809
            TODO: there is a socket warning from requests module. See https://github.com/requests/requests/issues/3912
            XML file is in test directory
            """
        article = Article(class_doi, directory=TESTDATADIR)
        # self.assertEqual(article.check_if_doi_resolves(), "works", 'check_if_doi_resolves does not transform correctly for {}'.format(article.doi))
        # self.assertEqual(article.check_if_link_works(), True, 'check_if_link_works does not transform correctly for {}'.format(article.doi))
>       self.assertEqual(article.amendment, False, 'amendment does not transform correctly for {}'.format(article.doi))

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_unittests.py:77: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:1180: in amendment
    if self.type_ in ['correction', 'retraction', 'expression-of-concern']:
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:1095: in type_
    "article"])
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:201: in get_element_xpath
    root = self.root
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604efe8438>

    @property
    def root(self):
        """Get the root (base) element of an article.
            """
>       return self.tree.getroot()
E       AttributeError: 'NoneType' object has no attribute 'getroot'

/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:807: AttributeError
-------------------------------------------------------- Captured stdout call ---------------------------------------------------------
Local article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pone.0185809.xml
__________________________________________________ TestArticleClass.test_example_doi __________________________________________________

self = <allofplos.tests.test_unittests.TestArticleClass testMethod=test_example_doi>

    def test_example_doi(self):
        """Tests the methods and properties of the Article class
            Test article DOI: 10.1371/journal.pbio.2001413
            XML file is in test directory
            """
        article = Article(example_doi, directory=TESTDATADIR)
        # self.assertEqual(article.check_if_doi_resolves(), "works", 'check_if_doi_resolves does not transform correctly for {}'.format(article.doi))
        # self.assertEqual(article.check_if_link_works(), True, 'check_if_link_works does not transform correctly for {}'.format(article.doi))
>       self.assertEqual(article.amendment, False, 'amendment does not transform correctly for {}'.format(article.doi))

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_unittests.py:118: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:1180: in amendment
    if self.type_ in ['correction', 'retraction', 'expression-of-concern']:
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:1095: in type_
    "article"])
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:201: in get_element_xpath
    root = self.root
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604eee9208>

    @property
    def root(self):
        """Get the root (base) element of an article.
            """
>       return self.tree.getroot()
E       AttributeError: 'NoneType' object has no attribute 'getroot'

/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:807: AttributeError
-------------------------------------------------------- Captured stdout call ---------------------------------------------------------
Local article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2001413.xml
_________________________________________________ TestArticleClass.test_example_doi2 __________________________________________________

self = <allofplos.tests.test_unittests.TestArticleClass testMethod=test_example_doi2>

    def test_example_doi2(self):
        """Tests the methods and properties of the Article class
            Test article DOI: 10.1371/annotation/3155a3e9-5fbe-435c-a07a-e9a4846ec0b6
            XML file is in test directory
            """
        article = Article(example_doi2, directory=TESTDATADIR)
        # self.assertEqual(article.check_if_doi_resolves(), "works", 'check_if_doi_resolves does not transform correctly for {}'.format(article.doi))
        # self.assertEqual(article.check_if_link_works(), True, 'check_if_link_works does not transform correctly for {}'.format(article.doi))
>       self.assertEqual(article.amendment, True, 'amendment does not transform correctly for {}'.format(article.doi))

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_unittests.py:157: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:1180: in amendment
    if self.type_ in ['correction', 'retraction', 'expression-of-concern']:
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:1095: in type_
    "article"])
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:201: in get_element_xpath
    root = self.root
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604eeeb908>

    @property
    def root(self):
        """Get the root (base) element of an article.
            """
>       return self.tree.getroot()
E       AttributeError: 'NoneType' object has no attribute 'getroot'

/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:807: AttributeError
-------------------------------------------------------- Captured stdout call ---------------------------------------------------------
Local article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/plos.correction.3155a3e9-5fbe-435c-a07a-e9a4846ec0b6.xml
____________________________________________________ TestArticleClass.test_proofs _____________________________________________________

self = <allofplos.tests.test_unittests.TestArticleClass testMethod=test_proofs>

    def test_proofs(self):
        """Tests whether uncorrected proofs and VOR updates are being detected correctly."""
        os.environ['PLOS_CORPUS'] = TESTDATADIR
        article = Article(example_uncorrected_doi)
>       self.assertTrue(article.proof == 'uncorrected_proof')

/usr/local/lib/python3.6/site-packages/allofplos/tests/test_unittests.py:193: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:897: in proof
    xpath_results = self.get_element_xpath()
/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:201: in get_element_xpath
    root = self.root
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'NoneType' object has no attribute 'getroot'") raised in repr()] Article object at 0x7f604ebc84a8>

    @property
    def root(self):
        """Get the root (base) element of an article.
            """
>       return self.tree.getroot()
E       AttributeError: 'NoneType' object has no attribute 'getroot'

/usr/local/lib/python3.6/site-packages/allofplos/article_class.py:807: AttributeError
-------------------------------------------------------- Captured stdout call ---------------------------------------------------------
Local article file not found: /usr/local/lib/python3.6/site-packages/allofplos/tests/testdata/journal.pbio.2002399.xml
================================================ 15 failed, 5 passed in 26.18 seconds =================================================
root@7881ab553829:~# 

The problem is that the testdata directory is not installed. Maybe it is because there is no __init__.py file inside?

parameter for corpusdir location in plos_corpus.py

In preparation for pip install allofplos, there needs to be a way for the user to specify the path to corpusdir. Right now, the xml folder is created within the repository, and most people won't want to store datasets with executables. Some ideas:

  • default corpusdir location that would work across platforms, or change depending on OS (~/Documents/ ?)
  • request and store user input for the path to corpusdir, default to cwd/pwd

Extend unit tests

Right now, the unit tests do simple transformations from DOIs to URLs to file paths. What might be other unit tests to include? Ideas:

  • whether the example url links to XML are valid & the downloaded file is valid
  • whether an example solr query returns expected results
  • whether XML parsing functions work as expected

Need to also allow people to test the corpus even if they haven't downloaded the corpus and if they don't have internet access. This should involve a very small directory of testing files included in the repo

Add archiving parameter for old versions of articles

Expanding on @mpacer's thoughts about allowing multiple corpora, create a default setting where when a new version of an article is downloaded, it automatically archives the old version in a different folder. This would probably be best as a sub-directory of corpus.
If a user wanted to go back into the past and get old versions of all articles (e.g. all uncorrected proofs), there would need to be a way to do that as well.

Use python_requires to force pip to not download if on python2

You need to use python_requires in your setup.py in order to restrict builds that occur via pip which issues a query to PyPI.

Below is a long explanation of what's happening and what you need to do, but the tldr is:

I installed allofplos on python2 with no difficulty, that's not supposed to happen.

What's wrong

I understand based on

if sys.version_info.major < 3:
    sys.exit('Sorry, Python < 3.4 is not supported')
elif sys.version_info.minor < 4:
    sys.exit('Sorry, Python < 3.4 is not supported')

That installing on python 2 (or even 3.3) is not supposed to be possible.

Unfortunately, it is. E.g., I installed allofplos on python 2:

(dev2) ~/jupyter/eg_notebooks $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(dev2) ~/jupyter/eg_notebooks $ pip --version
pip 9.0.1 from ~/anaconda3/envs/dev2/lib/python2.7/site-packages (python 2.7)
(dev2) ~/jupyter/eg_notebooks $ pip install allofplos
Collecting allofplos
  Downloading allofplos-0.8.1-py2.py3-none-any.whl
⋮
Successfully installed allofplos-0.8.1 certifi-2017.7.27.1 chardet-3.0.4 idna-2.6 lxml-4.0.0 progressbar2-3.34.3 python-utils-2.2.0 requests-2.18.4 tqdm-4.17.1 urllib3-1.22

A PR is forthcoming, specifically to add python_requires >= 3.4 to your setup(…) args & fix a couple of other things.

Why python_requires works

When the package is being pulled down from PyPI if someone has pip>=9 then it will check the python version before it downloads the file and if their version does not not match python_requires.

Why your current approach fails

Currently your only approach doesn't work for a subtle reason: it only catches people who run python setup.py directly or who have a really old version of pip. As mentioned in the talk linked below, everyone should be discouraged from running python setup.py directly, but you've got the right thought to try to catch those that do catches for those who do).

This also means you need another catch inside the setup.py in case someone tried to download it with pip but their pip version is way less than 9.0.0.

This is how we handle the pip version problem in IPython's setup.py:

if sys.version_info < (3, 3):
    pip_message = 'This may be due to an out of date pip. Make sure you have pip >= 9.0.1.'
    try:
        import pip
        pip_version = tuple([int(x) for x in pip.__version__.split('.')[:3]])
        if pip_version < (9, 0, 1) :
            pip_message = 'Your pip version is out of date, please install pip >= 9.0.1. '\
            'pip {} detected.'.format(pip.__version__)
        else:
            # pip is new enough - it must be something else
            pip_message = ''
    except Exception:
        pass

This also requires that anyone building the package from source has a setuptools >= 24.2.0. If you don't want to make that a requirement on the actual library, then you'll need a different way to communicate that dependency (possibly with a dev_requirements.txt and a CONTRIBUTING.md?).

Resources for learning more

I mentioned this to @eseiver, but in case she didn't pass it along there's a great talk on this here: https://www.youtube.com/watch?v=2DkfPzWWC2Q. 😉

If you'd prefer to read the docs though:

https://packaging.python.org/tutorials/distributing-packages/#python-requires

Python 2 pip install should not work

allofplos only works with Python 3, however, a Python 2 environment is able to install it.

To Reproduce:

$ mkvirtualenv 2plos
(2plos) $ python --version
Python 2.7.15
(2plos) $ pip install allofplos
pip install allofplos
Collecting allofplos
  Downloading https://files.pythonhosted.org/packages/88/a9/1c0e4f8bffc137f4aad5097682d5fa23504f97d50b1333e3658800c1b0db/allofplos-0.8.1-py2.py3-none-any.whl
Collecting certifi==2017.7.27.1 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/40/66/06130724e8205fc8c105db7edb92871c7fff7d31324d7f4405c762624a43/certifi-2017.7.27.1-py2.py3-none-any.whl
Collecting chardet>=3.0.4 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting python-utils>=2.2.0 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/eb/a0/19119d8b7c05be49baf6c593f11c432d571b70d805f2fe94c0585e55e4c8/python_utils-2.3.0-py2.py3-none-any.whl
Collecting urllib3==1.22 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/63/cb/6965947c13a94236f6d4b8223e21beb4d576dc72e8130bd7880f600839b8/urllib3-1.22-py2.py3-none-any.whl
Collecting progressbar2>=3.34.3 (from allofplos)
  Downloading https://files.pythonhosted.org/packages/4f/6f/acb2dd76f2c77527584bd3a4c2509782bb35c481c610521fc3656de5a9e0/progressbar2-3.38.0-py2.py3-none-any.whl
Collecting idna>=2.6 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl
Collecting tqdm==4.17.1 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/eb/90/123ff39f7e454566bcd7482244ca9893df92c28364351a3308b7091effa9/tqdm-4.17.1-py2.py3-none-any.whl
Collecting lxml>=4.0.0 (from allofplos)
  Downloading https://files.pythonhosted.org/packages/af/09/cdb478d8b0392edd4047c5d1f7e6a1fb5e0e7a2f8f14fcf05c6e9ae9edff/lxml-4.2.3-cp27-cp27mu-manylinux1_x86_64.whl (5.8MB)
    100% |████████████████████████████████| 5.8MB 3.1MB/s 
Collecting six>=1.11.0 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting requests>=2.18.4 (from allofplos)
  Using cached https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl
Installing collected packages: certifi, chardet, six, python-utils, urllib3, progressbar2, idna, tqdm, lxml, requests, allofplos
Successfully installed allofplos-0.8.1 certifi-2017.7.27.1 chardet-3.0.4 idna-2.7 lxml-4.2.3 progressbar2-3.38.0 python-utils-2.3.0 requests-2.19.1 six-1.11.0 tqdm-4.17.1 urllib3-1.22

remove dependence on internal URLs

all queries should go through EXT_URL, even for the internal allofplos server, per PLOS teams. also, change the download function from an etree-modulated unicode string to a requests string where the header request specifies allofplos.

Workflow needed for no -NOR repubs on JIRA

[From old repo, in progress]

Because we currently can't track republications with Solr, and some repubs aren't accompanied by a notice of correction, the Production team has created a label for JIRA tickets where there will be a repub with no notice (NOR). Using JIRA's API, it should be possible to track this label and check those for partial DOIs to then check crepo for new versions of these articles: https://jira.plos.org/jira/browse/PUBSERV-1031?filter=15858
This relies on access to internal PLOS servers and should not be a default option for non-PLOS users of this repo.

Running `python -mallofplos.plos_corpus` fails.

Hi,

When running python -mallofplos.plos_corpus this morning, I received the following error:

λ python -mallofplos.plos_corpus
Checking for new articles...
4405 new articles to download.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3637/3637 [08:40<00:00,  6.99it/s]

4405 new articles downloaded.
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/egh/.virtualenvs/allofplos/lib/python3.5/site-packages/allofplos/plos_corpus.py", line 845, in <module>
    main()
  File "/home/egh/.virtualenvs/allofplos/lib/python3.5/site-packages/allofplos/plos_corpus.py", line 841, in main
    plos_network=plos_network)
  File "/home/egh/.virtualenvs/allofplos/lib/python3.5/site-packages/allofplos/plos_corpus.py", line 588, in download_check_and_move
    amended_articles = check_for_amended_articles(directory=tempdir)
  File "/home/egh/.virtualenvs/allofplos/lib/python3.5/site-packages/allofplos/plos_corpus.py", line 353, in check_for_amended_articles
    article = Article.from_filename(article_file)
  File "/home/egh/.virtualenvs/allofplos/lib/python3.5/site-packages/allofplos/article_class.py", line 1272, in from_filename
    return cls(filename_to_doi(filename))
  File "/home/egh/.virtualenvs/allofplos/lib/python3.5/site-packages/allofplos/article_class.py", line 42, in __init__
    self.doi = doi
  File "/home/egh/.virtualenvs/allofplos/lib/python3.5/site-packages/allofplos/article_class.py", line 112, in doi
    raise Exception("Invalid format for PLOS DOI")
Exception: Invalid format for PLOS DOI

Document different pathnames for articles string formats

Regarding the pathname formats described here:

https://github.com/mpacer/allofplos/blob/e8a9aee1c5eb587936cdf922f7ccb3fd01f10dc8/allofplos/plos_regex.py#L44-L53

We should be better about describing that, specifically:

We could also use a formal spec in the docs that says what the actual different formats are… I don't know if we should really have a single pair of examples here as opposed to a few examples in its own doc page about said spec.
~ @mpacer #70 (comment)

This should probably happen after the initial Docs PR (#66) is merged.

New workflow needed for incremental updates

[From old repo]

right now, the download_check_and_move function has three methods of looking for updated article XML after entirely new articles have been downloaded to the temporary download directory:

  1. check for new corrections articles in the temp directory & see if the accompanying corrected articles have updated XML, downloading any corrected articles with new versions
  2. check Solr for version-of-record (VOR) updates to uncorrected proofs (status: Not working)
  3. check all XML directly for updated uncorrected proofs in uncorrected_proofs_list.txt.

If an article's XML is updated for any reason other than corrections or VOR, it currently cannot be detected by searching Solr. The only way to be sure is to check every article's XML manually, as in the revisiondate_sanity_check function in corpus_analysis.py, which is time-consuming and hits journals.plos.org pretty inefficiently. JIRA no-NOR ticket labels can help with this to some degree (see #20), but that doesn't work outside of PLOS.

One solution would be using a hashtable, as in https://github.com/PLOS/allofplos_upload/issues/6. Is there any other way @sbassi?

Unable to run plos_corpus

Because of the import structure in the allofplos package I am unable to run the plos_corpus as a script file exposed using setup.py install.

New Article element types

I've been moving us toward having functionality for different types of elements inside Articles inside their own classes rather than having such a monumental Article class.

This will help in a lot of ways but the biggest things have to do with bringing new people into the codebase. Specifically by localising the functionality, it'll be easier for people to trace down bugs, know where to introduce new functionality, test, write documentation for, &c..

So I figured that we should start compiling a list of element classes that should be created as we realise that they exist in the article itself:

  • License (complete w/ #81)
  • Journal (complete w/#81)
  • Dates
  • Doi
  • Urls
    • Url
  • Filename
  • Contributors (collection)
    • Contributor(base_class, raise NotImplementedErrors in init.py)
      • Author(Contributor)
      • Editor(Contributor)
  • Counts (maybe… not sure)

In the case of doi, url and filename the methods currently used to transform them could be contained within the classes, eliminating the need for the transformations scripts and tying the validation directly to the particular kinds of elements.

And there may be others but those are the ones that jump out at me at the moment.

@eseiver @sbassi

update py files to include Article class

  • remove duplicate functions covered by the Article class from samples.corpus_analysis and plos_corpus
  • refactor functions in those classes to include Article class, if any

add Article().body property

It would be great for people who want to see the full text of the article without all the metadata to be able to get a string of just the content in the body. It should probably exclude references, captions, etc and things that don't appear in the main body of the PDF, but keep section headers (Methods, Results, etc).

include 'starter pack' XML files in repository

without article files, allofplos can't do anything. we should ship with a small number of XML files directly in the repo so that people can use Article class etc right out of the box. they could be in a dedicated sample folder with a variable in plos_regex or transformations for the corpus location (sampledir, maybe?)
each article is about 100KB, so 25 articles would be 2.5 MB or 50 articles would be 5 MB.
this would work well with a new Corpus class.

Cannot get corpus dir

According to the README, I can get the corpus dir. But when I run the recommended command, I get an error

λ python -c "from allofplos.plos_regex import corpusdir; print(corpusdir)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'corpusdir'

Environment:

λ python --version
Python 3.5.2
λ pip freeze
allofplos==0.10.2
certifi==2018.1.18
chardet==3.0.4
idna==2.6
lxml==4.1.1
peewee==3.1.0
pkg-resources==0.0.0
python-utils==2.3.0
requests==2.18.4
six==1.11.0
tqdm==4.19.5
Unidecode==1.0.22
urllib3==1.22

Add self.license property to Article class

in article_class.py, add a property to display the Creative Commons license on an article.
The xpath location is '/article/front/article-meta/permissions/license'
note that for NLM 3.0 and JATS 1.1d3 DTDs, these fields look different.

Update broken b/c related-doi

The latest PLOS corpus update didn't work because it couldn't parse the related DOI from the recent retraction pone.0194455. Will probably need to issue a patch.

set Article().text_viewer at corpus level

Now that the easiest way to cycle through article objects is to create a fully new one each time (instead of resetting the doi via article.doi = doi), Article().text_viewer is harder to set. If this value could be set at the Corpus level and passed to each Article object, that would be great!
cc @mpacer

Write out release instructions as a bullet pointed list

A public release how-to (usually consisting of a bullet-pointed list) allows all maintainers to have the ability to publish a release of the new version of the package by following straightforward steps.

This allows the release procedure to be evaluated & commented on by contributors.

For example:

It would be easier for me to comment on how to avoid some of the issues related I've most recently commented on in #20 if these instructions were made explicit. It's hard for me to do more than guess about what went wrong because I can't work off of that document.

Extend Solr search as class from existing python file

[From old repository]

Right now get_solr_records (to be renamed search_solr_records) is a single function with several parameters. It would be better to flesh out this functionality as a Request class in the vein of the Python 2.7 Solr wrapper from the PLOS ALMs package (Cameron Neylon's version, which is the best and most up-to-date).
It would need to be made Python 3 compatible, but I think would provide a better method of querying Solr than the current version.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.