Git Product home page Git Product logo

biostor's Introduction

biostor

Stories in Ready

Throughput Graph

Join the chat at https://gitter.im/biostor/Lobby

SSH keys

Use github SSH keys (see https://pagodabox.io/docs/setting_up_ssh-osx-linux). The following command puts the public key into the clipboard:

pbcopy < ~/.ssh/github_rsa.pub

Can then paste this key into Pagodabox site.

Multiple keys

I found that sometimes Pagodabox would expect the github SSH key, authorities the one I’d generated for Pagodabox, so I pasted both keys into the Pagodabox admin panel.

Pushing to Pagodabox

You need to add Pagoda as a remote repository:

git remote add pagoda [email protected]:apps/biostor.git

Then push the changes

git push pagoda --all

Nanobox

Nanobox is the replacement to Pagodabox. It requires a different boxfile. Note that this file needs specify some additional PHP extensions that others depend upon, for example, just adding xsl by itself generated linking errors until dom was also added.

run.config:
  # install php and associated runtimes
  engine: php
  # php engine configuration (php version, extensions, etc)
  engine.config:
    # sets the php version to 5.6
    runtime: php-5.6
    extensions:
      - curl
      - exif
      - gd
      - mbstring
      - dom # need this for xsl to work
      - xml # need this for utf8_decode
      - xsl
      
web.main:

  start:
    php: start-php
    apache: start-apache
    
  # the path to a logfile you want streamed to the nanobox dashboard
  log_watch:
    apache[access]: /data/var/log/apache/access.log
    apache[error]: /data/var/log/apache/error.log
    php[error]: /data/var/log/php/php_error.log
    php[fpm]: /data/var/log/php/php_fpm.log
        

To make a local version of BioStor, type:

nanobox deploy dry-run

To deploy to nanobox add a remote, e.g.:

nanobox remote add happy-hog

Then deploy:

nanobox deploy

Things to remember

  • Make sure that your DNS for your website points to the IP address (A-Record) of the nanobox app (find the A-Record in the “Network” tab).

  • Add New Relic key to the “CONFIG” tab.

Nanobox resources required

Initially ran on Google Cloud using f1-micro (1 vCPU, 0.6 GB memory), which Google Cloud reported was overused. Can add more resources via nanobox “scale” which sets up a new server. Need to explore why we need more resources.

Cloudflare

nanobox

Type Name Value
A biostor.org IP address from nanobox

heroku

Type Name Value
CNAME biostor.org biostor.org.herokudns.com
CNAME www biostor.org.herokudns.com

Cloudflare applies CNAME flattening to

Heroku

Deploy to Heroku.

New Relic on Heroku

Check it is installed:

heroku run env --app biostor | grep NEW_RELIC

CouchDB on Bitnami

Create an instance at https://google.bitnami.com/vms

Note that you need to follow the steps here https://docs.bitnami.com/google/infrastructure/couchdb/#how-to-connect-to-couchdb-from-a-different-machine in order to be able to connect. Click on “Launch ssh console” and edit the local.ini file:

sudo nano /opt/bitnami/couchdb/etc/local.ini

Change the bind_address from 127.0.0.1 to 0.0.0.0:

[chttpd]
port = 5984
bind_address = 0.0.0.0
...

[httpd]
bind_address = 0.0.0.0
...

Reboot the VM.

Firewall

Note that now we also need to add firewall access, see https://docs.bitnami.com/google/faq/administration/use-firewall/

Replicate

curl http://localhost:5984/_replicate -H 'Content-Type: application/json' -d '{ "source": "biostor", "target": "http://admin:<password>@IP-SERVER:5984/biostor"}'

Monitoring

Added New Relic key, after a while New Relic shows data for the app https://rpm.newrelic.com/accounts/691868/applications/8332767

Replication

Launch this from local machine to replicate CouchDB with Cloudant.

curl http://localhost:5984/_replicate -H 'Content-Type: application/json' -d '{ "source": "biostor", "target": "https://<username>:<password>@rdmpage.cloudant.com/biostor"}'

IBM hosted Cloudant

curl http://localhost:5984/_replicate -H 'Content-Type: application/json' -d '{ "source": "biostor", "target": "https://<username>:<password>@4c577ff8-0f3d-4292-9624-41c1693c433b-bluemix.cloudant.com/biostor" }'

Image proxy

BioStor uses CloudFlare http://cloudflare.com to provide caching, and by default CloudFlare doesn’t cache images that with dynamic URLs (i.e., it expects a URL to have a file extension). I’ve borrowed heavily from https://github.com/andrieslouw/imagesweserv to create an image proxy that fetches images from BHL, then outputs them such that CloudFlare will treat them as static images and cache them.

Future ideas

Interfaces

For a very different interface to historical texts see the UK Medical Heritage Library.

Backup

See details in “backup” folder.

biostor's People

Contributors

gitter-badger avatar katrinleinweber avatar rdmpage avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biostor's Issues

More BHL items replaced by others

A few more instances of articles being attached to BHL items that have been replaced occurred during this weekend's data harvest.

Here is the list of no-longer-active item IDs and their replacements:

ID Replaced By
17635 84886
17695 84741
18270 84781
18295 84760
45725 87759

Replaced BHL items

Via email from @mlichtenberg:

Over time, re-scans of books have resulted in BHL items being replaced with other items. This has resulted in about 2500 articles in BHL and BioStor that are associated with about 100 inactive items. (For example, http://biostor.org/reference/134917.)

At the end of this message is a list of no-longer-active BHL item IDs along with their replacement IDs. Given this list, would it be difficult for you to reprocess the articles that are attached to those items so that they are reassigned to the correct BHL item and page IDs?

If you could do that, it would fix both BioStor and BHL. BHL should pick up the updates the next time BioStor data is harvested.

ID            Replaced By
13942    54265
13961    54271
13976    86336
13978    54725
43946    130127
13936    54259
13950    54269
13953    54267
13967    89035
14000    87380
14017    89038
14018    89037
119060  211380
13946    54274
13957    54437
13971    65863
13996    89024
14014    89030
14021    87374
14400    86341
46567    13718
13941    54264
13945    86340
13958    54743
13975    86335
13992    65873
14009    89021
13948    54272
13955    54435
13962    54732
13991    54728
13998    54735
14005    87377
14012    89029
36407    133894
13947    54273
13956    54436
13981    86337
13990    54727
14015    89031
53247    55101
13949    54270
13954    54266
13979    54721
13988    65870
14013    89405
14022    87373
81625    34159
105675  101305
119064  212127
13935    54258
13968    89036
13985    65867
13986    65868
14003    54734
13939    54262
13964    54730
13989    65872
14007    87375
13938    54261
13965    54741
13974    86342
126229  127883
13943    54717
13960    54733
13977    54724
13994    89027
14011    54744
50997    183290
149492  182036
12974    54720
13959    54742
13993    89026
13995    89028
14008    89020
14010    89022
81620    105193
105799  101303
13951    54719
13952    54268
13966    54745
14001    87379
14002    54746
46849    183041
13937    54260
13973    54723
13980    54726
13987    65869
14016    87378
25429    202453
13940    54263
13963    54731
13997    89023
13999    89032
14004    89033
14006    87376
105316  100979
13933    54249
13934    54257
13969    65861
13970    65862
13983    65865
13984    65866
14019    86338
14020    86343
20991    100809
105673  101293

Define lacking articles in Zoologica, New York Zoological Society

@rdmpage My enthusiasm about the definition of articles in Kirtlandia inspired me to finish the work that I was doing on Zoologica. I got complete article metadata from the librarians at the Wildlife Conservation Society and cleaned it up. I also removed all citations that are already defined in BHL in order to prevent duplication. I did this by pulling all articles defined in Zoologica in BHL in EndNote format using the BHL APIs and then importing those citations into EndNote. I could spot the missing articles fairly easily once everything was in EndNote. Again, there's an RIS file inside the zip.
zoologica_nodups.txt.zip

More informative <title> tag for browser window

   ===============================================
                          Bruxelles
                           2015_08_03


   Dear Roderic Page

   | The new version is still a work in progress,
   | and I'd welcome any feedback and/or comments.

   Looks good - but would it be possible to have a more
   informative uri title than <title>Titles</title> ?
   This would help e.g. bookmarking

    Yours sincerely / Vriendelijke groeten / Bien à vous

    Richard Hardwick
    ===============================================
    email  [email protected]
    ===============================================

BioStor articles not harvested by BHL

Limit on number of articles returned by my API that is used by BHL has probably resulted in some articles being missed.

Rod,

BHL did not pick up the new articles Susan references in her email below because the BHL item (48981) does not appear in this list: http://direct.biostor.org/itemsince.php?since=2017-01-01.

It seems that the reason it does not appear is that the list is limited to 1000 items. I never noticed that before. (Although looking now at the first emails we exchanged about the BioStor API back in February of 2012, I see that you wrote “I've limited this to 1000 items so the database doesn't keel over”… uh oh!)

I am now wondering how many other articles have been missed over time. Two questions: First, is there a way to get more than 1000 items in the response? If not, what is an expected upper limit on the number of items to be processed in a single day… maybe BHL needs to harvest more frequently than once a week. Second, can you generate a list of all BHL item IDs for which there are articles in BioStor? Such a list would help me get an idea how many items were missed by the harvesting process.

Thanks,

MIKE

Index Raptor Research

BHL received permission to digitize and upload to BHL the journals from Raptor Research Foundation.

Permission given for the following titles:

Raptor Research News (v.1-5. 1967-1971)
Raptor Research (v.6-20, 1972-1986)
Journal of Raptor Research (v.21-39, 1987-2005)
Wingspan (v.1 (1992) to present)

These are now in BHL and I will begin downloading citations from Web of Science. I'll let you know when these files are ready for Biostor.

Add articles for Arnoldia, 0004-2633

@rdmpage This publication will be scanned and added to BHL shortly. I received a full set of article metadata for the publication from the Arnold Arboretum and will provide 'scrubbed' article metadata after the content is uploaded. Is this content within scope for BioStor? Can we handle these articles the same way that we handled the articles for other publications, e.g. Phasmid Studies?

Phytologia 0031-9430 and a question

@rdmpage I want to be sure that there's no duplication of effort between you and the team that I work with when gathering citations/references. (We did some work to create article level metadata for Quaestiones Entomologicae and I just noticed that you pulled most if not all of the articles for that publication in from elsewhere. Did you pull the citations/references in from Wikidata? If so would you be willing to share the technique that you used with me?) I got what I think is a complete list of citations for Phytologia from the Index of American Botanical Literature. We're happy to fill the article definition gaps for this publication but don't want to invest the time if you're on the verge of filling the gaps in a different way. Are you working on Phytologia or should we go ahead and identify the gaps? Thanks for letting me know.

Index The Journal of the East Africa and Uganda Natural History Society

Rod,

BHL has recently digitized some items for the Journal of the East Africa and Uganda Natural History Society. see http://www.biodiversitylibrary.org/bibliography/14163#/summary and we'd like to article-ize this content. The publisher, East Africa Natural History Society, gave us article citations for this content back in the Citebank days. I'm hoping we can use pass the citations on to you to index via BioStor. One of the challenges with the journal is the inconsistent numbering of volumes and issues. Also many of the volumes were bound together which means there will be multiple page 1s in a single item. I recall you saying this makes it more challenging for BioStor. I have attached some sample data for your review just to see if contains enough data to do the matching. Let me know your thoughts.
JEANH_1910_1918.xlsx
JEANH_1910_1918.xlsx

Trish

PHP errors if search fails to return results

php[error] [25-Nov-2016 12:39:35 UTC] PHP Notice:  Undefined property: stdClass::$total_rows in /data/index.php on line 1110
php[error] [25-Nov-2016 12:39:35 UTC] PHP Notice:  Undefined property: stdClass::$total_rows in /data/index.php on line 1113
php[error] [25-Nov-2016 12:39:35 UTC] PHP Notice:  Undefined property: stdClass::$total_rows in /data/index.php on line 1135

Incomplete (duplicate?) articles

BHL picked up a few new articles (BioStor IDs 50280, 50281, 50282, and 50284) missing a lot of metadata (including titles) this weekend. Each new article includes only a single page, and in each case that page duplicates the start page of another article that was defined long ago.

You can see these articles with these API calls:

http://direct.biostor.org/itemarticles.php?item=25762
http://direct.biostor.org/itemarticles.php?item=25760

Based on the dates attached to them, it looks like these articles were actually defined in BioStor a long time ago. Also, I have made a few changes to BHL's data harvesting process recently. Let me know if the appearance of these articles in the API not new; maybe the BHL changes have resulted in the harvesting process no longer ignoring something that should be ignored.

Kirtlandia, BHL bib 121359

Hi Rod,
I apologize for not being in touch sooner. We got complete article metadata for Kirtlandia from the library at the Carnegie Museum of Natural History. I did a lot of scrubbing of the data in Open Refine including some rudimentary standardization of the author names. I then dumped the metadata out of OpenRefine as a tsv, loaded it into EndNote and dumped it out of EndNote as an RIS file. Finally I turned it into a zip file because that seems to be a requirement for adding the metadata to a Github issue. I noticed that there was a single Kirtlandia article already defined in BioStor and BHL so I removed it from the attached metadata.

Please let me know if this metadata is all right or if you can recommend ways to improve it. The other thing that occurs to me is that the EABL team could share a cloud-based EndNote library with you and you could grab metadata from there.

As always, thanks very, very much.
Susan Lynch
kirtlandia6.txt.zip

Question about italics for botanical scientific names.

@rdmpage I've been given thousands of good citations for Phytologia, ISSN 0031-9430, from the Index of American Botanical Literature. There's a potential for many more citations from this source. In the csv files that I received from the system administrator (EMU at NYBG), the article titles contain mark-up to cause scientific names to be presented on the UI in italics as they should be. For example, I was given:
A new variety of <Astragalus hyalilnus> (Fabaceae) from Wyoming
Italics are used when this is displayed in IABL. See http://sweetgum.nybg.org/science/iabl/iabl_details.php?irn=468377
In BHL: http://www.biodiversitylibrary.org/part/184351
In BioStor: http://biostor.org/reference/175861

In BioStor and BHL, it seems that we have no way to display marked up text in italics. Is this true? I can easily strip out this markup using a regular expression and OpenRefine but it seems a shame to do so. A subject matter expert (NYBG botanist) did a lot of work to markup the scientific names and it's the right thing to do. If I discard the markup now we'll probably never get it back... I'd very much like your advice on this.

Index Entomologica Americana

Rod

We'd like to index 3 journals from the New York Entomological Society

  1. Entomologica Americana (v.1 (1885) to v.49 (1975)) http://biodiversitylibrary.org/bibliography/9429
  2. Journal of the New York Entomological Society (v.1 (1893) to v.107 (1999)) http://biodiversitylibrary.org/bibliography/8089
  3. Bulletin of the Brooklyn Entomological Society (v.1 (1878) to v.60 (1965)) http://biodiversitylibrary.org/bibliography/16211

Lets start with Entomologica Americana. Some of this has been indexed already by BioStor so I'm trying to figure out where we need to gap fill. A BHL staff member informed me that Web of Science has 749 citations for Entomological Americana. Since I don't have access to Web of Science myself the staff member would need to download them for me. What I'm wondering is do you have access to Web of Science and are you able to verify which citations they have vs those in BioStor and BHL? If its easier I can have the staff member download the citations and we can send to you. Let me know what you think is most efficient.
Trish

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.