gnames / gnfinder Goto Github PK

View Code? Open in Web Editor NEW

42.0 8.0 5.0 205.74 MB

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.

License: MIT License

Go 83.36% Makefile 1.01% Dockerfile 0.10% JavaScript 0.31% CSS 6.68% HTML 8.54%

biodiversity-informatics biodiversity-heritage-library bioinformatics

gnfinder's Introduction

Global Names Finder (GNfinder)

Try GNfinder online or learn about its API.

Very fast finder of scientific names. It uses dictionary and NLP approaches. On modern multiprocessor laptop it is able to process 15 million pages per hour. Works with many file formats and includes names verification against many biological databases. For full functionality it requires an Internet connection.

GNfinder is also awailable via web or as a RESTful API.

Citing
Features
Installation
Configuration
Usage
Projects based on GNfinder
Development
Testing

Citing

Zenodo DOI can be used to cite GNfinder.

Features

Multiplatform app (supports Linux, Windows, Mac OS X).
Self-contained, no external dependencies, only binary gnfinder or gnfinder.exe (~15Mb) is needed. However the internet connection is required for name-verification.
Includes REST API and web-based User Interface.
Takes UTF8-encoded text and returns back CSV, TSV or JSON-formatted output that contains detected scientific names.
Extracts text from PDF files, MS Word, MS Excel, HTML, XML, RTF, JPG, TIFF, GIF etc. files for names-detection.
Downloads web-page from a given URL for names-detection.
Optionally, automatically detects the language of the text, and adjusts Bayes algorithm for the language. English and German languages are currently supported.
Uses complementary heuristic and natural language processing algorithms.
Optionally verifies found names against multiple biodiversity databases using gnindex service.
Detection of nomenclatural annotations like sp. nov., comb. nov., ssp. nov., nom. nov. and their variants.
Ability to see words that surround detected name-strings.
The library can be used concurrently to significantly improve speed. On a server with 40threads it is able to detect names on 50 million pages in approximately 3 hours using both heuristic and Bayes algorithms. Check bhlindex project for an example.

Installation

Homebrew on Mac OS X, Linux, and Linux on Windows (WSL2)

Homebrew is a popular package manager for Open Source software originally developed for Mac OS X. Now it is also available on Linux, and can easily be used on MS Windows 10 or 11, if Windows Subsystem for Linux (WSL) is [installed][WSL install].

Note that Homebrew requires some other programs to be installed, like Curl, Git, a compiler (GCC compiler on Linux, Xcode on Mac). If it is too much, go to the Linux and Mac without Homebrew section.

Install Homebrew according to their instructions.

Install GNfinder with:

brew tap gnames/gn
brew install gnfinder
# to upgrade
brew upgrade gnfinder

Arch Linux AUR package

AUR package is located at https://aur.archlinux.org/packages/gnfinder. Install it by hand, or with AUR helpers like yay or pacaur.

yay -S gnfinder
# or
pacaur -S gnfinder

Manual Install

GNfinder consists of just one executable file, so it is pretty easy to install it by hand. To do that download the binary executable for your operating system from the latest release.

Linux and Mac without Homebrew

Move gnfinder executable somewhere in your PATH (for example /usr/local/bin)

sudo mv path_to/gnfinder /usr/local/bin

Windows without Homebrew and WSL

It is possible to use GNfinder natively on Windows, without Homebrew or Linux installed.

One possible way would be to create a default folder for executables and place gnfinder there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

mkdir C:\bin
copy path_to\gnfinder.exe C:\bin

Add C:\bin directory to your PATH environment variable.

Go

Install Go v1.19 or higher.

git clone [email protected]:/gnames/gnfinder
cd gnfinder
make tools
make install

Configuration

When you run gnfinder command for the first time, it will create a gnfinder.yml configuration file.

This file should be located in the following places:

MS Windows: C:\Users\AppData\Roaming\gnfinder.yml

Mac OS: $HOME/.config/gnfinder.yml

Linux: $HOME/.config/gnfinder.yml

This file allows to set options that will modify behaviour of GNfinder according to your needs. It will spare you to enter the same flags for the command line application again and again.

Command line flags will override the settings in the configuration file.

It is also possible to setup environment variables. They will override the settings in both the configuration file and from the flags.

Settings	Environment variables
BayesOddsThreshold	GNF_BAYES_ODDS_THRESHOLD
DataSources	GNF_DATA_SOURCES
Format	GNF_FORMAT
InputTextOnly	GNF_INPUT_TEXT_ONLY
IncludeInputText	GNF_INCLUDE_INPUT_TEXT
Language	GNF_LANGUAGE
TikaURL	GNF_TIKA_URL
TokensAround	GNF_TOKENS_AROUND
VerifierURL	GNF_VERIFIER_URL
WithAllMatches	GNF_WITH_ALL_MATCHES
WithAmbiguousNames	GNF_WITH_AMBIGUOUS_NAMES
WithBayesOddsDetails	GNF_WITH_BAYES_ODDS_DETAILS
WithOddsAdjustment	GNF_WITH_ODDS_ADJUSTMENT
WithPlainInput	GNF_WITH_PLAIN_INPUT
WithPositionInBytes	GNF_WITH_POSITION_IN_BYTES
WithUniqueNames	GNF_WITH_UNIQUE_NAMES
WithVerification	GNF_WITH_VERIFICATION
WithoutBayes	GNF_WITHOUT_BAYES

Usage

Usage of a web-based application.

GNfinder can be found at https://finder.globalnames.org.

Usage of RESTful API

API is located at https://finder.globalnames.org/api/v1.

Best source for API usage is its documenation.

If you want to start your own API endpoint (for example on localhost, port 8080) use:

gnfinder -p 8080
curl localhost:8080/api/v1/ping

To upload a file and detect names from its content:

curl -v -F verification=true -F file=@/path/to/test.txt https://gnfinder.globalnames.org/api/v1/find

Usage as a command line app

To see flags and usage:

gnfinder --help
# or just
gnfinder

To see the version of its binary:

gnfinder -V

Examples:

Starting as a web-application and an API server on port 8080

gnfinder -p 8080

Getting names from a UTF8-encoded file without remote Tika service.

# -U flag prevents use of remote Apache Tika service for file conversion to
# UTF8-encoded plain text
# -U flag is optional, but it removes unnecessary remote call to Tika.

gnfinder file_with_names.txt -U

Getting names from a UTF8-encoded file in tab-separated values (TSV) format

gnfinder file_with_names.txt -U -f tsv

Getting names from a file that is not a plain UTF8-encoded text

gnfinder file.pdf

Getting names from a URL

gnfinder https://en.wikipedia.org/wiki/Raccoon

Getting unique names from a file in JSON format. Disables -w flag.

gnfinder file_with_names.txt -u -f pretty

Getting names from a file in JSON format, and using jq to process JSON

gnfinder file_with_names.txt -f compact | jq

Getting data from a pipe forcing English language and verification

echo "Pomatomus saltator and Parus major" | gnfinder -v -l eng
echo "Pomatomus saltator and Parus major" | gnfinder --verify --lang eng

Limit matches to NCBI and Encyclopedia of Life. For the list of data source ids go to gnverifier's data sources page.

echo "And Parus major" | gnfinder -v -l eng -s "4,12"
echo "And Parus major" | gnfinder --verify --lang eng --sources "4,12"

Preserve uninomial names that are also common words.

echo "Cancer is a genus" | gnfinder -A
echo "America is also a genus" | gnfinder --ambiguous-uninomials

Show all matches, not only the best result.

echo "Pomatomus saltator and Parus major" | gnfinder -M
echo "Pomatomus saltator and Parus major" | gnfinder --all-matches

Show all matches, but only for selected data-sources.

echo "Pomatomus saltator and Parus major" | gnfinder -M -s 1,12

Adjusting Prior Odds using information about found names. They are calculated as "found names number / (capitalized words number - found names number)". Such adjustment will decrease Odds for texts with very few names, and increase odds for texts with a lot of found names.

gnfinder -a -d -f pretty file_with_names.txt

Returning 5 words before and after found name-candidate. This flag does is ignored if unique names are returned.

gnfinder -w 5 file_with_names.txt
gnfinder --words-around 5 file_with_names.txt

Getting data from a file and redirecting result to another file

gnfinder file1.txt > file2.json

Detection of nomenclatural annotations

echo "Parus major sp. n." | gnfinder

Returning found names positions in the number of bytes from the beginning of the text instead of the number of UTF-8 characters

echo "Это Parus major" | gnfinder -b

There is also a tutorial about processing many PDF files in parallel.

Usage as a library

import (
  "github.com/gnames/gnfinder"
  "github.com/gnames/gnfinder/ent/nlp"
  "github.com/gnames/gnfinder/io/dict"
)

func Example() {
  txt := `Blue Adussel (Mytilus edulis) grows to about two
inches the first year,Pardosa moesta Banks, 1892`
  cfg := gnfinder.NewConfig()
  dictionary := dict.LoadDictionary()
  weights := nlp.BayesWeights()
  gnf := gnfinder.New(cfg, dictionary, weights)
  res := gnf.Find(txt)
  name := res.Names[0]
  fmt.Printf(
    "Name: %s, start: %d, end: %d",
    name.Name,
    name.OffsetStart,
    name.OffsetEnd,
  )
  // Output:
  // Name: Mytilus edulis, start: 13, end: 29
}

Usage as a docker container

docker pull gnames/gnfinder

# run GNfinder server, and map it to port 8888 on the host machine
docker run -d -p 8888:8778 --name gnfinder gnames/gnfinder

Projects based on GNfinder

gnfinder-plus allows to work with MS Docs and PDF files without remote services (requires local install of poppler package).

bhlindex creates an index of scientific names for Biodiversity Heritage Library (BHL).

bhlnames adds synonymy and currently accepted names to searches in BHL, connects publications to pages in BHL.

Development

To install the latest GNfinder

git clone [email protected]:/gnames/gnfinder
cd gnfinder
make tools
make install

Testing

From the root of the project:

make tools
# run make install for CLI testing
make install

To run tests go to the root directory of the project and run

go test ./...

#or

make test

gnfinder's People

Contributors

Stargazers

Watchers

Forkers

locodelassembly gitmail-name feyoung gaybro8777 deyuanyang

gnfinder's Issues

Improve dictionaries

In #43 @Adafede wrote:
Hi,

Wonderful for the doc! Thank you for all your hard work and the nice last versions you developed. Now that I am able to build from scratch I had the opportunity to have a more careful look at the dictionaries and related files.

I therefore have a question:
Is there a precise reason why you chose to put piper in your black uninomials dictionary?

I am asking you this because I am interested in this species and can not afford loosing it during text recognition. Therefore, since I was not able to build from scratch till now I used the R taxize package which, in a certain way, allows to find more results than the ones obtained with GNFinder, although they use your tool! I think this is because they removed some entries of your black dictionaries but I'm not sure about it. @sckott

I could build my own version on my side and modify accordingly (commenting piper for example) but I think it is more useful to discuss this point publicly and maybe find common solutions in order to avoid everyone having its own version.

If you are interested in it, I already retrieved manually some problematic entries that are found in some biological databases and lead to aberrant results, such as (in alphabetical order):

japanese yew (Taxus cuspidata)
anaerobic (is already in your list, but probably taxize ignores it)
candidatus (same as anaerobic)
chinensis
green (same as anaerobic)
megaleia
ootheca
peripatoides
red (same as anaerobic)
sinensis
tasmanian
uncultured

Many thanks again

As a user of gRPC server I want to get nomenclatural annotations and surrounding a name words

As a User I want to be able to collect a number of tokens surrounding a name.

Additional tokens would demonstrate a context in which a name candidate exists. It would help to weed out false positives when a name is also a 'normal' word in some language.

As a User I want to be able to detect language for every new use of FindNames

If language is not given by an option, detect language for every FindNames, and return back not only a supported language, but also the code of detected language, as well as information if the language was forced by the language option or not.

As a User I want to try several times when I get any verification error

Currently only timeouts are used to try again the verification against gnindex. We need to retry every error that we get.

As a Developer I want to have a Dockerfile to simplify deployment of gnfinder server

gRPC does not work with diacritics in UTF-8 files

Name verification breaks on large texts

How to reproduce:

from the root of the project:

make
gnfinder  testdata/seashells_book.txt -c

produces:

fatal error: concurrent map iteration and map write

goroutine 4 [running]:
runtime.throw(0x93a6f2, 0x26)
	/usr/local/go/src/runtime/panic.go:619 +0x81 fp=0xc420045640 sp=0xc420045620 pc=0x42b1a1
runtime.mapiternext(0xc420045758)
	/usr/local/go/src/runtime/hashmap.go:747 +0x55c fp=0xc4200456d0 sp=0xc420045640 pc=0x40a40c
github.com/gnames/gnfinder/resolver.prepareJobs(0xc42f6e1530, 0xc4200926c0, 0xc4200b4140)
	/home/dimus/go/src/github.com/gnames/gnfinder/resolver/resolver.go:114 +0xf7 fp=0xc4200457c8 sp=0xc4200456d0 pc=0x6fafe7
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200457d0 sp=0xc4200457c8 pc=0x457161
created by github.com/gnames/gnfinder/resolver.Run
	/home/dimus/go/src/github.com/gnames/gnfinder/resolver/resolver.go:47 +0x20d

As a User I want to see help message when entering just `gnfinder` alone to cmd line

As a User (when Bayes name-finding is enabled) I want all found names to have odds data

As a User I want to be able to find names in files where there is no space after comma

In CSV files there is usually no space between commas. As a result scientific names are not found in them,
even if names are present.

grpc command made root command unreachable

Black / grey dictionaries

Hi,

First of all, thank you for your wonderful work!

I am using the binary executable and I was wondering if it was possible to "bypass" the black/grey dictionaries, or to modify them, or to allow "piper", as an example.

Thank you very much in advance.

P.S.: I'm using a MacOS device and I wasn't able to build it from scratch using make deps

As a gRPC User I want to know the version of gnfinder

Tokenizer breaks if a volume ends with a dash followed by space.

expand verification to the same level as GNRD data for primary and preferred sources

As a User I do not want Verification results to crash the program if I use its data

Sometimes there might be an error at the level of resolution service. In such cases there is no BestResult at all.
We already save the error, but we return null pointer for BestResult, and it crashes program if users try to access details of the BestResult. We should return empty BestResult instead of null pointer.

As a User I want to know edit distance for fuzzy matches

As a Developer I want gnfinder to be "go-gettable"

If we use gnfinder as a module we need to include generated files as well, because they are created by makefile, but module import does not create them by itself.

Wrong output when querying Catalogue of Life.

Observed problem

It looks like the wrong output is returned when matching against the Catalogue of Life.

Example

Example on the entry Solanum tuberosum.

Expected output

Expected output, according to Catalogue of Life web page : Solanum tuberosum (this is indeed the accepted name). See Catalogue of Life output for 'Solanum tuberosum ' query.

Actual output

However, the output of GNFinder is Solanum etuberosum (see below).

{
      "type": "Binomial",
      "verbatim": "(Solanum tuberosum),",
      "name": "Solanum tuberosum",
      "odds": 815712.9591463371,
      "odds_details": {
        "Name": {
          "abbr": {
            "false": 0.8679430877999654
          },
          "uniEnd3": {
            "num": 23.597054331788886
          },
          "spLen": {
            "9": 10.081642477424332
          },
          "spDict": {
            "WhiteSpecies": 5628.6125203841275
          },
          "spEnd3": {
            "sum": 172.68753397354592
          },
          "PriorOdds": {
            "true": 0.1
          },
          "uniLen": {
            "7": 0.8324223452545503
          },
          "uniDict": {
            "GreyGenus": 1.4684399638974754
          }
        }
      },
      "start": 29336,
      "end": 29356,
      "annotation": "",
      "verification": {
        "dataSourceId": 1,
        "dataSourceTitle": "Catalogue of Life",
        "taxonId": "6208dd5855b41dfa4f99a4a2d0a55854",
        "matchedName": "Solanum tuberosum Bert. ex Walp.",
        "currentName": "Solanum etuberosum Lindl.",
        "isSynonym": true,
        "classificationPath": "Plantae|Tracheophyta|Magnoliopsida|Solanales|Solanaceae|Solanum|Solanum etuberosum",
        "dataSourcesNum": 27,
        "dataSourceQuality": "HasCuratedSources",
        "matchType": "ExactCanonicalMatch",
        "preferredResults": [
          {
            "dataSourceId": 11,
            "dataSourceTitle": "GBIF Backbone Taxonomy",
            "nameId": "4261d820-2c48-5313-a720-1d90bedc0c6a",
            "name": "Solanum tuberosum Bertero",
            "taxonId": "8555981"
          }
        ],
        "retries": 1
      }

Possible explanation

In fact it appears that, when matching Catalogue of Life, the returned entry is the first row.
@Adafede observed that the entries of Catalogue of Life are in fact ordered, 1 by Rank and 2 by Alphabetical order (in this case Solanum tuberosum Bert. ex Walp. > Solanum tuberosum L. > Solanum tuberosum Poepp. ex Walp.)

Expected behaviour of GNFinder

In these case, first filter by Name status = Accepted name and the return the corresponding output.
How could this be done ? Is it doable on the GNFinder side or should it be taken care of at Catalogue of Life ?

Many thanks

Note that this behaviour is observed for a large number of entries. Another example: Pisonia grandis (accepted name) query returns Pisonia umbellifera

{
      "type": "Binomial",
      "verbatim": "Pisonia grandis",
      "name": "Pisonia grandis",
      "odds": 10804591456.75299,
      "odds_details": {
        "Name": {
          "spLen": {
            "7": 4.425451988126818
          },
          "spDict": {
            "WhiteSpecies": 5628.6125203841275
          },
          "spEnd3": {
            "dis": 105.1141511143323
          },
          "PriorOdds": {
            "true": 0.1
          },
          "uniLen": {
            "7": 0.8324223452545503
          },
          "uniDict": {
            "WhiteGenus": 20194.430603370172
          },
          "abbr": {
            "false": 0.8679430877999654
          },
          "uniEnd3": {
            "nia": 2.8282746815509467
          }
        }
      },
      "start": 23593,
      "end": 23608,
      "annotation": "",
      "verification": {
        "dataSourceId": 1,
        "dataSourceTitle": "Catalogue of Life",
        "taxonId": "c4b0b41c2961b29ea3b447b6b903ad68",
        "matchedName": "Pisonia grandis A.Cunn. ex Hook. fil.",
        "currentName": "Pisonia umbellifera (J. \u0026 G. Forst.) Seem.",
        "isSynonym": true,
        "classificationPath": "Plantae|Tracheophyta|Magnoliopsida|Caryophyllales|Nyctaginaceae|Pisonia|Pisonia umbellifera",
        "dataSourcesNum": 18,
        "dataSourceQuality": "HasCuratedSources",
        "matchType": "ExactCanonicalMatch",
        "preferredResults": [
          {
            "dataSourceId": 11,
            "dataSourceTitle": "GBIF Backbone Taxonomy",
            "nameId": "c331072e-c786-5d82-bc3c-c4ff938d6250",
            "name": "Pisonia grandis A.Cunn.",
            "taxonId": "8638411"
          }
        ],
        "retries": 1
      }

As a Remote User I want to access gnfinder's functionality via HTTP API

As a User I want to use improved gnindex API

As a User I want to be able to validate found names

As a User I want to be able to pick preferred data sources to verify names against

Getting Started with OSX

Hi Dmitry, Great to see this and many thanks. I have been trying to get the app up and running in both Mac OSX and Windows but am not getting past first base. In Mac OSX with gnfinder in the applications folder it launches but then simply shows a start up screen. On windows I seem to need the path to launch but am not getting past first base with that either. When you have some time would you mind jotting down some notes in the read me on how to actually get started - for complete idiots like me. Great to see this turning in to a user application and I need to follow up with you by email on something else. Best, Paul

As a Developer I want to have benchmarks in tests

As a User I want to know if a name appears in curated databases

As a User I want gRPC server to return the position of every name-string on a page

MatchType: JSON vs protob

There seems to be al little difference between JSON and protobuf responses. In particular, there is a distinction between ExactMatch and variations of CanonicalMatch. However in protob responses canonical is no longer reflected and even two distinct cases are translated to the same type in getMatchType

An earlier version of the code (which I failed to determine the exact moment this changed), used to be this:

func getMatchType(match string) protob.MatchType {
	switch match {
	case "ExactMatch":
		return protob.MatchType_EXACT
	case "ExactCanonicalMatch":
		return protob.MatchType_CANONICAL_EXACT
	case "FuzzyCanonicalMatch":
		return protob.MatchType_CANONICAL_FUZZY
	case "ExactPartialMatch":
		return protob.MatchType_PARTIAL_EXACT
	case "FuzzyPartialMatch":
		return protob.MatchType_PARTIAL_FUZZY
	}
	return protob.MatchType_NONE
}

Is the current implementation OK?

As a Developer I want to access gnfinder from Ruby or Python as a shared library

Remove botanical authors from uninomials

As a User I want gnfinder verification to work with updated gnindex API

As a User I want an improvement in protobuf output

Problems:

type fields returns a string, and the strings are cryptic. For example it is hard to understand what is ProbablyBinomial, or Binomial(nlp).
Solution: break it into 2 fields:

cardinality: 0-3 (0 is no match, 1 uninonial, 2 binomial, 3 trinomial)
method: enum array with following possible results: [HEURISTIC] or [HEURISTIC, BAYES]

Add more information to resolution data with fields:
matchedCardinality
matchedCanonicalSimple
matchedCanonicalFull
currentNameCardinality
currentNameCanonicalSimple
currentNameCanonicalFull

As a User I want a NoMatch status when verification failed

As a User and Developer I want cleaner, more usable options for language and huristic/bayes engines

There are many possible combinations between language and engine options. It makes sense to build them in such way that makes better defaults and gives flexibility to enter new options not during creation of GNfinder object, but at the point of running text search. I think it makes sense to return back to original settings after running these options for one text.

No options: run both heuristic and bayes engines, have default language as english, no language detection.

OptBayes: run heuristic, bayes is true or false, language provided by OptLanguage or OptDetect

OptLanguage: run heuristic, bayes depends on OptBayes, have provided language, no language detection

OptDetectLanguage: run heuristic, bayes depends on OptBayes, have detected langugage, run language detection.

As a User I want to have canonical form of matched name returned

ExactMatch has edit_distance > 0 from time to time

Caused gnames/bhlindex#28

As a Developer I want to be able to test cli application

Wrongly found names (part of names)

@diatomsRcool found the following names that are detected wrongly. I have to update black and grey lists accordingly. Some names we cannot do much about, for example Phylloscopus collybita versus because there are legitimate species with versus as an epithet.

Oikopleura dioica genome (genome to black dict)
Phylloscopus collybita versus (versus is in grey dict already)
Ciona intestinalis revealed (revealed to black dict)
Prochlorococcus isolates (isolates to black dict)
Synechococcus clade (clade to black dict)
Rna genes may (genes to black, may grey already)
A. lituratus data (data is in grey already)
Oikopleura lineage (lineage to black)
Pinoyscincus character (there is 'Poecilasthena character Prout, 1932' *sighs to grey)
Those are the highlights (hm weird cannot reproduce this one)
Age refugia (both Age and refugia exist as genus (in grey dict already) and sp epithet correspondingly)

So in many of these cases we have to rely on verification

As a developer I want to be able to send an array of texts to c-lib and get back an array of answers

As a User of gRPC service I want to set a list of data sources IDs to verify name-strings against

JSON and protobuf output for Odds overflows int32

Change output for Odds to Log10

As a Developer I want to include Go modules to make builds more stable

As a User I want to retry failed remote requests for data verification

As a user I want to force Bayes on or off

Currently, we only can force Bayes on, but sometimes we need to switch it off unconditionally as well.

As a User I want to find names in many small documents faster

Currently finding names in an equal amount of text that is split on 500 pages is much slower than finding in one big text. We need to fix that.

Verification fields show up even when verification is not done

As a User I want to know if there are new species/new combination annotation next to a name.

As a User I want to have similar output from gRPC as from command line gnfinder

As a Developer I want to refactor gnfinder code to make more sensible packages

Currently gnfinder code has all the smells of a Ruby developer who is staring with Go. There is a
util package and packages that make no much sense.

It is better to reorganize this code with a user of the library in mind, so people do not force to import many packages to do name finding. Ideally if someone needs to do name finding they have to import just one package, and that package should care of all the dependencies needed for its functionality.

Tests also need to be reorganized and placed according to their belonging to packages.