rdmpage / biostor Goto Github PK

View Code? Open in Web Editor NEW

5.0 8.0 2.0 5.24 MB

Open access articles extracted from the Biodiversity Heritage Library

Home Page: http://biostor.org

PHP 81.66% XSLT 1.09% JavaScript 3.50% HTML 13.47% Shell 0.28%

bhl biodiversity-heritage-library biodiversity-informatics articles

biostor's Introduction

biostor

SSH keys

Use github SSH keys (see https://pagodabox.io/docs/setting_up_ssh-osx-linux). The following command puts the public key into the clipboard:

pbcopy < ~/.ssh/github_rsa.pub

Can then paste this key into Pagodabox site.

Multiple keys

I found that sometimes Pagodabox would expect the github SSH key, authorities the one I’d generated for Pagodabox, so I pasted both keys into the Pagodabox admin panel.

Pushing to Pagodabox

You need to add Pagoda as a remote repository:

git remote add pagoda [email protected]:apps/biostor.git

Then push the changes

git push pagoda --all

Nanobox

Nanobox is the replacement to Pagodabox. It requires a different boxfile. Note that this file needs specify some additional PHP extensions that others depend upon, for example, just adding xsl by itself generated linking errors until dom was also added.

run.config:
  # install php and associated runtimes
  engine: php
  # php engine configuration (php version, extensions, etc)
  engine.config:
    # sets the php version to 5.6
    runtime: php-5.6
    extensions:
      - curl
      - exif
      - gd
      - mbstring
      - dom # need this for xsl to work
      - xml # need this for utf8_decode
      - xsl
      
web.main:

  start:
    php: start-php
    apache: start-apache
    
  # the path to a logfile you want streamed to the nanobox dashboard
  log_watch:
    apache[access]: /data/var/log/apache/access.log
    apache[error]: /data/var/log/apache/error.log
    php[error]: /data/var/log/php/php_error.log
    php[fpm]: /data/var/log/php/php_fpm.log

To make a local version of BioStor, type:

nanobox deploy dry-run

To deploy to nanobox add a remote, e.g.:

nanobox remote add happy-hog

Then deploy:

nanobox deploy

Things to remember

Make sure that your DNS for your website points to the IP address (A-Record) of the nanobox app (find the A-Record in the “Network” tab).
Add New Relic key to the “CONFIG” tab.

Nanobox resources required

Initially ran on Google Cloud using f1-micro (1 vCPU, 0.6 GB memory), which Google Cloud reported was overused. Can add more resources via nanobox “scale” which sets up a new server. Need to explore why we need more resources.

Cloudflare

nanobox

Type	Name	Value
A	biostor.org	IP address from nanobox

heroku

Type	Name	Value
CNAME	biostor.org	biostor.org.herokudns.com
CNAME	www	biostor.org.herokudns.com

Cloudflare applies CNAME flattening to

Heroku

Deploy to Heroku.

New Relic on Heroku

Check it is installed:

heroku run env --app biostor | grep NEW_RELIC

CouchDB on Bitnami

Create an instance at https://google.bitnami.com/vms

Note that you need to follow the steps here https://docs.bitnami.com/google/infrastructure/couchdb/#how-to-connect-to-couchdb-from-a-different-machine in order to be able to connect. Click on “Launch ssh console” and edit the local.ini file:

sudo nano /opt/bitnami/couchdb/etc/local.ini

Change the bind_address from 127.0.0.1 to 0.0.0.0:

[chttpd]
port = 5984
bind_address = 0.0.0.0
...

[httpd]
bind_address = 0.0.0.0
...

Reboot the VM.

Firewall

Note that now we also need to add firewall access, see https://docs.bitnami.com/google/faq/administration/use-firewall/

Replicate

curl http://localhost:5984/_replicate -H 'Content-Type: application/json' -d '{ "source": "biostor", "target": "http://admin:<password>@IP-SERVER:5984/biostor"}'

Monitoring

Added New Relic key, after a while New Relic shows data for the app https://rpm.newrelic.com/accounts/691868/applications/8332767

Replication

Launch this from local machine to replicate CouchDB with Cloudant.

curl http://localhost:5984/_replicate -H 'Content-Type: application/json' -d '{ "source": "biostor", "target": "https://<username>:<password>@rdmpage.cloudant.com/biostor"}'

IBM hosted Cloudant

curl http://localhost:5984/_replicate -H 'Content-Type: application/json' -d '{ "source": "biostor", "target": "https://<username>:<password>@4c577ff8-0f3d-4292-9624-41c1693c433b-bluemix.cloudant.com/biostor" }'

Image proxy

BioStor uses CloudFlare http://cloudflare.com to provide caching, and by default CloudFlare doesn’t cache images that with dynamic URLs (i.e., it expects a URL to have a file extension). I’ve borrowed heavily from https://github.com/andrieslouw/imagesweserv to create an image proxy that fetches images from BHL, then outputs them such that CloudFlare will treat them as static images and cache them.

Future ideas

Interfaces

For a very different interface to historical texts see the UK Medical Heritage Library.

Backup

See details in “backup” folder.

biostor's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger dalavancloud

biostor's Issues

Add articles for Contributions in Science ISSN 0459-8113

Lots of articles already in BioStor, but there doesn't seem to be an easily accessible list of articles from this journal.

Annals of the Cape provincial Museums

Add Articles for Volume 30 of Zoologica, ISSN 0044-507X

The final volume of this publication made it into BHL today! I'm sending what I hope is the final set of article definitions for this title. I added BHL start page numbers to the Notes field in the RIS file. There is no 'Database Provider' for this publication.
zoologica_vol_30.txt.zip

Book chapters not displayed properly

Metadata summary for individual book chapters lacks most of the details, e.g. http://biostor.org/reference/160666 Need to add code to display book, editor(s) and publisher info.

Error message undefined pages

See reference at bottom of search http://biostor.org/search/author:%22Charles%20O%20Handley%22 http://biostor.org/reference/126620

Displays

Published in 1966 Notice: Undefined property: stdClass::$pages in /data/index.php on line 220 , on pages

More BHL items replaced by others

A few more instances of articles being attached to BHL items that have been replaced occurred during this weekend's data harvest.

Here is the list of no-longer-active item IDs and their replacements:

ID Replaced By
17635 84886
17695 84741
18270 84781
18295 84760
45725 87759

small typo in http://biostor.org/reference/105196

Hi Rod, a user let us know that the article name should be "Cephalopoden der Aru-und Kei-Inseln. Anhang: Revision der Gattung Sepioteuthis" not sepioLeuthis

2 broken links in BHL fixed. Broken links in corresponding BioStor articles remain.

Biostor part:69541 BHL part:41905 BHL page id: 32543277
Biostor part: 6002 BHL part:24303 BHL page id:753928

Replaced BHL items

Via email from @mlichtenberg:

Over time, re-scans of books have resulted in BHL items being replaced with other items. This has resulted in about 2500 articles in BHL and BioStor that are associated with about 100 inactive items. (For example, http://biostor.org/reference/134917.)

At the end of this message is a list of no-longer-active BHL item IDs along with their replacement IDs. Given this list, would it be difficult for you to reprocess the articles that are attached to those items so that they are reassigned to the correct BHL item and page IDs?

If you could do that, it would fix both BioStor and BHL. BHL should pick up the updates the next time BioStor data is harvested.

ID            Replaced By
13942    54265
13961    54271
13976    86336
13978    54725
43946    130127
13936    54259
13950    54269
13953    54267
13967    89035
14000    87380
14017    89038
14018    89037
119060  211380
13946    54274
13957    54437
13971    65863
13996    89024
14014    89030
14021    87374
14400    86341
46567    13718
13941    54264
13945    86340
13958    54743
13975    86335
13992    65873
14009    89021
13948    54272
13955    54435
13962    54732
13991    54728
13998    54735
14005    87377
14012    89029
36407    133894
13947    54273
13956    54436
13981    86337
13990    54727
14015    89031
53247    55101
13949    54270
13954    54266
13979    54721
13988    65870
14013    89405
14022    87373
81625    34159
105675  101305
119064  212127
13935    54258
13968    89036
13985    65867
13986    65868
14003    54734
13939    54262
13964    54730
13989    65872
14007    87375
13938    54261
13965    54741
13974    86342
126229  127883
13943    54717
13960    54733
13977    54724
13994    89027
14011    54744
50997    183290
149492  182036
12974    54720
13959    54742
13993    89026
13995    89028
14008    89020
14010    89022
81620    105193
105799  101303
13951    54719
13952    54268
13966    54745
14001    87379
14002    54746
46849    183041
13937    54260
13973    54723
13980    54726
13987    65869
14016    87378
25429    202453
13940    54263
13963    54731
13997    89023
13999    89032
14004    89033
14006    87376
105316  100979
13933    54249
13934    54257
13969    65861
13970    65862
13983    65865
13984    65866
14019    86338
14020    86343
20991    100809
105673  101293

Define lacking articles in Zoologica, New York Zoological Society

@rdmpage My enthusiasm about the definition of articles in Kirtlandia inspired me to finish the work that I was doing on Zoologica. I got complete article metadata from the librarians at the Wildlife Conservation Society and cleaned it up. I also removed all citations that are already defined in BHL in order to prevent duplication. I did this by pulling all articles defined in Zoologica in BHL in EndNote format using the BHL APIs and then importing those citations into EndNote. I could spot the missing articles fairly easily once everything was in EndNote. Again, there's an RIS file inside the zip.
zoologica_nodups.txt.zip
≈

Index chapters so we can display all chapters in a book

Chapters in a book are not grouped together in the way that journal articles are. For example, http://biostor.org/oclc/3483827 doesn't display anything, but there are several chapters in this book in BioStor, e.g. http://biostor.org/reference/160666 Need to model chapters from a single book properly.

Duplicate reference, both with blank start page

"Rhinobryon negrensis, a new genus and species of Characid fishes from the Rio Negro, Brazil" occurs twice, http://biostor.org/reference/78122 and http://biostor.org/reference/80006 both with blank start page suggesting pagination has changed. Also blank in Internet Archive PDF uploaded from BioStor.

BHLFEED-58218 - article in BioStor is pointing to the wrong thing in BHL

FYI Looks like this article in BioStor is pointing to the wrong thing in BHL, see http://biostor.org/reference/626. We've removed the reference from BHL as the correct one already exists at http://biodiversitylibrary.org/part/19424.

More informative <title> tag for browser window

   ===============================================
                          Bruxelles
                           2015_08_03


   Dear Roderic Page

   | The new version is still a work in progress,
   | and I'd welcome any feedback and/or comments.

   Looks good - but would it be possible to have a more
   informative uri title than <title>Titles</title> ?
   This would help e.g. bookmarking

    Yours sincerely / Vriendelijke groeten / Bien à vous

    Richard Hardwick
    ===============================================
    email  [email protected]
    ===============================================

Incorrect pages

http://biostor.gopagoda.io/reference/biostor/4298

BioStor articles not harvested by BHL

Limit on number of articles returned by my API that is used by BHL has probably resulted in some articles being missed.

Rod,

BHL did not pick up the new articles Susan references in her email below because the BHL item (48981) does not appear in this list: http://direct.biostor.org/itemsince.php?since=2017-01-01.

It seems that the reason it does not appear is that the list is limited to 1000 items. I never noticed that before. (Although looking now at the first emails we exchanged about the BioStor API back in February of 2012, I see that you wrote “I've limited this to 1000 items so the database doesn't keel over”… uh oh!)

I am now wondering how many other articles have been missed over time. Two questions: First, is there a way to get more than 1000 items in the response? If not, what is an expected upper limit on the number of items to be processed in a single day… maybe BHL needs to harvest more frequently than once a week. Second, can you generate a list of all BHL item IDs for which there are articles in BioStor? Such a list would help me get an idea how many items were missed by the harvesting process.

Thanks,

MIKE

Missing pages for http://biostor.org/reference/50092

http://biostor.org/reference/50092 is missing page images, has item for Tijdschrift Voor Entomologie 1999 gone?

Index Raptor Research

BHL received permission to digitize and upload to BHL the journals from Raptor Research Foundation.

Permission given for the following titles:

Raptor Research News (v.1-5. 1967-1971)
Raptor Research (v.6-20, 1972-1986)
Journal of Raptor Research (v.21-39, 1987-2005)
Wingspan (v.1 (1992) to present)

These are now in BHL and I will begin downloading citations from Web of Science. I'll let you know when these files are ready for Biostor.

Add articles for Amphipacifica (ISSN 1189-9905)

Search returns no hits in browser and PHP error but API OK

The query http://biostor.org/?q=Cabamofa in the browser returns no results and a PHP error Warning: Invalid argument supplied for foreach() in /data/index.php on line 1249 ac527ae but the API call returns the expected hit http://biostor.org/api.php?q=Cabamofa

Remove BioStor ID 12538, Biotropica journal not in BHL

Hi Rod, this citation http://biostor.org/reference/12538 is incorrect. I have removed it from BHL. Thought you might want to remove from BioStor. Unfortunately we do not have Biotropica digitized in BHL. Thanks!

Move BHL and JournalMap APIs from biostor-classic

Currently BHL and JournalMap get their BioStor updates from the old BioStor.

Page images vanished

http://biostor.gopagoda.io/reference/13047

Add articles for Arnoldia, 0004-2633

@rdmpage This publication will be scanned and added to BHL shortly. I received a full set of article metadata for the publication from the Arnold Arboretum and will provide 'scrubbed' article metadata after the content is uploaded. Is this content within scope for BioStor? Can we handle these articles the same way that we handled the articles for other publications, e.g. Phasmid Studies?

Support abstract and subject(s) when importing and exporting article metadata.

No thumbnail and no pages

http://biostor.gopagoda.io/reference/144556

Missing page images

Images are missing for http://biostor.org/reference/12882 (and other articles in Annals of the Missouri Botanical Garden)

The Wilson Bulletin

Use metadata from JSTOR

BHL item removed leading to missing pages in Transactions of the Entomological Society of London

http://biostor.org/issn/0035-8894/year/1901

Phasmid Studies

@rdmpage I want to let you know that I got a bunch of citations from Phasmida Species File. I'm cleaning them up and expect to be done soon. When I finish, I'll upload an RIS file. BHL title http://www.biodiversitylibrary.org/bibliography/119312.

Phytologia 0031-9430 and a question

@rdmpage I want to be sure that there's no duplication of effort between you and the team that I work with when gathering citations/references. (We did some work to create article level metadata for Quaestiones Entomologicae and I just noticed that you pulled most if not all of the articles for that publication in from elsewhere. Did you pull the citations/references in from Wikidata? If so would you be willing to share the technique that you used with me?) I got what I think is a complete list of citations for Phytologia from the Index of American Botanical Literature. We're happy to fill the article definition gaps for this publication but don't want to invest the time if you're on the verge of filling the gaps in a different way. Are you working on Phytologia or should we go ahead and identify the gaps? Thanks for letting me know.

Add articles for Annals of the Carnegie Museum (ISSN 0097-4463)

Insects of Samoa and other Samoan terrestial arthropods missing pages

Lots of articles have missing pages as BHL has deleted some items (sigh)

Add view counter to article pages

Add a view counter to each article page to track and display usage. See Project Counter https://www.projectcounter.org/code-of-practice/appendices/850-2/ for list of bots, crawlers, etc. to exclude.

Could simply add $_SERVER to CouchDB as JSON record and create views that count visits by IP address.

Link together individual articles that belong to same series

Articles such as http://biostor.org/?q=Heterocera+from+Costa+Rica are in effect one article over several journal issues. It would be useful to link these together. One possibility is to use http://schema.org/ListItem terms previousItem and nextItem.

Add articles for Journal of the Entomological Society of British Columbia

http://journal.entsocbc.ca/index.php/journal/issue/archive

Index The Journal of the East Africa and Uganda Natural History Society

Rod,

BHL has recently digitized some items for the Journal of the East Africa and Uganda Natural History Society. see http://www.biodiversitylibrary.org/bibliography/14163#/summary and we'd like to article-ize this content. The publisher, East Africa Natural History Society, gave us article citations for this content back in the Citebank days. I'm hoping we can use pass the citations on to you to index via BioStor. One of the challenges with the journal is the inconsistent numbering of volumes and issues. Also many of the volumes were bound together which means there will be multiple page 1s in a single item. I recall you saying this makes it more challenging for BioStor. I have attached some sample data for your review just to see if contains enough data to do the matching. Let me know your thoughts.
JEANH_1910_1918.xlsx
JEANH_1910_1918.xlsx

Trish

Example of article that is in BioStor and Pensoft and has XML

http://biostor.gopagoda.io/reference/biostor/145125, see also http://dx.doi.org/10.3897/nl.37.7954

BioStor ID 90301 had wrong BHL item attached

Hi Rod, Biostor ID 90301 was originally attached to item ID 19556 = v.18 (1846) but it needed to be attached instead to the second series v.18 under item ID 19544. I have corrected this in BHL, see http://www.biodiversitylibrary.org/page/2313485. Wanted to let you know to update your system just in case.

PHP errors if search fails to return results

php[error] [25-Nov-2016 12:39:35 UTC] PHP Notice:  Undefined property: stdClass::$total_rows in /data/index.php on line 1110
php[error] [25-Nov-2016 12:39:35 UTC] PHP Notice:  Undefined property: stdClass::$total_rows in /data/index.php on line 1113
php[error] [25-Nov-2016 12:39:35 UTC] PHP Notice:  Undefined property: stdClass::$total_rows in /data/index.php on line 1135

Fully index Amphibian & Reptile Conservation

As discussed in a recent email, please fully index Amphibian & Reptile Conservation in BioStor and BHL.

Use Wikipedia/Dbpedia as source of citation data

Dbpedia has extracted citations from Wikipedia, see http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challenge/

Could use this to search for articles in BHL

Incorrect pages

http://biostor.gopagoda.io/reference/4307 stats with blank pages

Incomplete (duplicate?) articles

BHL picked up a few new articles (BioStor IDs 50280, 50281, 50282, and 50284) missing a lot of metadata (including titles) this weekend. Each new article includes only a single page, and in each case that page duplicates the start page of another article that was defined long ago.

You can see these articles with these API calls:

http://direct.biostor.org/itemarticles.php?item=25762
http://direct.biostor.org/itemarticles.php?item=25760

Based on the dates attached to them, it looks like these articles were actually defined in BioStor a long time ago. Also, I have made a few changes to BHL's data harvesting process recently. Let me know if the appearance of these articles in the API not new; maybe the BHL changes have resulted in the harvesting process no longer ignoring something that should be ignored.

Kirtlandia, BHL bib 121359

Hi Rod,
I apologize for not being in touch sooner. We got complete article metadata for Kirtlandia from the library at the Carnegie Museum of Natural History. I did a lot of scrubbing of the data in Open Refine including some rudimentary standardization of the author names. I then dumped the metadata out of OpenRefine as a tsv, loaded it into EndNote and dumped it out of EndNote as an RIS file. Finally I turned it into a zip file because that seems to be a requirement for adding the metadata to a Github issue. I noticed that there was a single Kirtlandia article already defined in BioStor and BHL so I removed it from the attached metadata.

Please let me know if this metadata is all right or if you can recommend ways to improve it. The other thing that occurs to me is that the EABL team could share a cloud-based EndNote library with you and you could grab metadata from there.

As always, thanks very, very much.
Susan Lynch
kirtlandia6.txt.zip

Index Myriapodologica

Small journal being added to BioStor, see http://biostor.org/issn/0163-5395/year/1978

Missing page caused by duplicate page name

http://biostor.org/reference/160487 shows only one plate, but http://direct.biostor.org/reference/160487 has two. I think the two plats are both labelled "Pl XXV" and my code assumes unique page names :(

Entomological News pagination changed

Check http://www.biodiversitylibrary.org/item/21125 as the BHL pagination has changed.

Question about italics for botanical scientific names.

@rdmpage I've been given thousands of good citations for Phytologia, ISSN 0031-9430, from the Index of American Botanical Literature. There's a potential for many more citations from this source. In the csv files that I received from the system administrator (EMU at NYBG), the article titles contain mark-up to cause scientific names to be presented on the UI in italics as they should be. For example, I was given:
A new variety of <Astragalus hyalilnus> (Fabaceae) from Wyoming
Italics are used when this is displayed in IABL. See http://sweetgum.nybg.org/science/iabl/iabl_details.php?irn=468377
In BHL: http://www.biodiversitylibrary.org/part/184351
In BioStor: http://biostor.org/reference/175861

In BioStor and BHL, it seems that we have no way to display marked up text in italics. Is this true? I can easily strip out this markup using a regular expression and OpenRefine but it seems a shame to do so. A subject matter expert (NYBG botanist) did a lot of work to markup the scientific names and it's the right thing to do. If I discard the markup now we'll probably never get it back... I'd very much like your advice on this.

Index Entomologica Americana

Rod

We'd like to index 3 journals from the New York Entomological Society

Entomologica Americana (v.1 (1885) to v.49 (1975)) http://biodiversitylibrary.org/bibliography/9429
Journal of the New York Entomological Society (v.1 (1893) to v.107 (1999)) http://biodiversitylibrary.org/bibliography/8089
Bulletin of the Brooklyn Entomological Society (v.1 (1878) to v.60 (1965)) http://biodiversitylibrary.org/bibliography/16211

Lets start with Entomologica Americana. Some of this has been indexed already by BioStor so I'm trying to figure out where we need to gap fill. A BHL staff member informed me that Web of Science has 749 citations for Entomological Americana. Since I don't have access to Web of Science myself the staff member would need to download them for me. What I'm wondering is do you have access to Web of Science and are you able to verify which citations they have vs those in BioStor and BHL? If its easier I can have the staff member download the citations and we can send to you. Let me know what you think is most efficient.
Trish