Git Product home page Git Product logo

wcvpy's People

Contributors

alrichardbollans avatar

Stargazers

 avatar  avatar

Watchers

 avatar

wcvpy's Issues

Spelling corrections

When genus has been found, use algorithm like fuzzywuzzy to match species names. Similar to approach in taxonstand

Allow specifying wcvp version in name matching methods

Currently the name matching methods e.g. get_accepted_info_from_names_in_column use the default version of wcvp given from get_all_taxa. However, this should be updated so that different version can be specified in the methods.

tqdm get_accepted_info_from_ids_in_column

Add tqdm and/or optimise get_accepted_info_from_ids_in_column method.

Could be done with a simple merge (without constructing the dicts) if get_all_taxa is loaded with the correct columns

String tidying

There are some ways of tidying input names which improve (direct) matching and don't introduce uncertainty in the match. Some of these have been implemented e.g. fixing casing, ensuring there are spaces after hybrid characters, but there is probably more to do here e.g.

  • reduce double spaces to single
  • ensuring spaces after infraspecific epithets and hybrid characters
  • trying to match when full stops have been removed (lots of manual inputs where full stops aren't included and then don't directly match with wcvp or via knms e.g. "Aspidosperma subincanum Mart" instead of "Aspidosperma subincanum Mart."

Note that some of these are purely tidying i.e. there are no double spaces in WCVP so it is worth removing them. Others are equivalent variations e.g. removing or adding full stops doesn't change which name should be matched to but can help find the match.

  • extend use of 'equivalent variations'

Common misspellings

There are some common prefixes/suffixes which are spelt differently to given accepted names and make name matching difficult. For example Urechites karwinsky in (B. del Castillo et al., Tetrahedron Lett., 1970, 11, 1219 - 1220.) refers to Urechites karwinskii Müll.Arg.. This name won't be appropriately matched in the current version. It is also not matched by the Kew Name Matching Service

Family resolutions

Allowing addition of a 'Family' column in input data could speed up resolutions and make them more accurate

Space-like characters

Characters that may look like spaces in data can cause more conservative matching, as the characters differ from WCVP data

Allow variety of input formats

Input currently has to be pandas dataframe + name of column with names in, but would be useful to allow varied formats e.g. simple list of names

http requests

Old versions will raise 'KeyError: 'last-modified'' when trying to access an up to date zip version as the request gets a 301 response

Known issues

Some examples of known issues are given in examples_to_fix.csv in unit_tests/test_inputs.

Optimisation

The final step in get_accepted_info_from_names_in_column of appending the found accepted info to the original dataframe can be very slow for large dataframes and this could be optimised.

Known issues

Known Issues

The following (hard cases) fail in test_name_matching.py in unit_tests:

  • From test_capitals_db.csv the following resolve to 'Rothmannia':
    • 'ROTHMANIA ENGLERIANA (K. SCHUM.) KEAV'
    • 'ROTHMANIA ENGLERIANA (K. Schum) Keav'
    • 'ROTHMANNIA ENGLERIANA (K. SCHUM.) KEAV'
  • From hybrid_list.csv 'Sarcorhiza' resolves to NaN

Matching names in Phylogenetic Trees

The following is some experimental, unoptimised code for matching names in a phylogenetic tree in newick format:

def substitute_name_in_tree(tree_string: str, old_name: str, new_name: str):
    # Note ape will not read spaces, so add underscores back to names
    tree_string = re.sub(r'\b{}(?=;|:)'.format(old_name), new_name.replace(' ', '_'), tree_string)
    return tree_string


def relabel_tree():
    f = open(tree_file, "r")
    tree_string = f.readline()
    # From https://stackoverflow.com/questions/45668107/python-regex-parsing-newick-format
    rx = r'[(),]+([^;:]+)\b'
    name_list = re.findall(rx, tree_string)
    binomial_names = [get_binomial_from_label(x) for x in name_list]
    zipped = list(zip(name_list, binomial_names))

    df = pd.DataFrame(zipped, columns=['tree_name', 'binomial_name'])

    acc_name_df = get_accepted_info_from_names_in_column(df, 'binomial_name', match_level='fuzzy')
    acc_name_df.to_csv(os.path.join('inputs', 'acc_name_tree.csv'))
    acc_name_df = pd.read_csv(os.path.join('inputs', 'acc_name_tree.csv'), index_col=0)

    # Catch words in tree string by left hand word boundaries (generic) and right hand ; or : characters
    for index, row in acc_name_df.iterrows():
        print(f'{index} out of {len(acc_name_df)}')
        if isinstance(row[wcvp_accepted_columns['name']], str):

            tree_string = substitute_name_in_tree(tree_string, row['tree_name'], row[wcvp_accepted_columns['name']])
        else:
            # If not matched, use old name. This could be changed to a generic string to be dropped later.
            tree_string = substitute_name_in_tree(tree_string, row['tree_name'], row['binomial_name'])


    f = open(standard_tree_file, "w")
    f.writelines([tree_string])

However, in this case resolution requirements may be different depending on planned phylogenetic analyses. For example resolving species with misspelled/unknown species epithets to genera (as in 'full' matching) could cause issues if trying to induce genus-level subtrees.

KNMS -> OpenRefine

AFAIK KNMS is currently unsupported (and has been for a while!).
It may be useful to include openrefine as a matching step for when KNMS is eventually taken offline.

WCVP issues

Here I highlight some potential issues within WCVP, in particular the 'string issues' below hinder name matching. These are by no means exhaustive, but may indicate areas to explore further.

  • Some records in WCVP are not given accepted information
  • Sometimes POWO and WCVP don't agree (mostly due to short lag in POWO updates?)
  • Accepted names are not always unique (without author information) e.g. Helichrysum oligocephalum
  • Some taxa are not given ipni ids, including some accepted taxa
  • Some (accepted) taxa are not given any author information e.g. Caralluma adscendens var. adscendens

String issues

These issues are mostly resolved when matching with the automatchnames package

  • Some taxon names and author strings include double spaces:
    image
    image
  • Some taxon names end in full stops and have other issues:
    image
  • Some entries have strange use of full stops:
    image
    image

Comparison tests

Write tests to compare this implementation with other methods (e.g. those surveyed in "Grenié, Matthias, Emilio Berti, Juan Carvajal‐Quintero, Gala Mona Louise Dädlow, Alban Sagouis, and Marten Winter. “Harmonizing Taxon Names in Biodiversity Data: A Review of Tools, Databases and Best Practices.” Methods in Ecology and Evolution, February 18, 2022, 2041-210X.13802. https://doi.org/10.1111/2041-210X.13802.")

Optimisations

  • Optimise loading/parsing of the checklist -- 20s is slow
  • Compression and decompression of parsed WCVP is slow, either improve compression or remove this step
  • Allow using checklist as a parameter to avoid repeated loading
  • The final step in get_accepted_info_from_names_in_column of appending the found accepted info to the original dataframe can be very slow for large dataframes and this could be optimised.
  • Improve speed of autoresolution by creating dataframe with potential names (e.g. split on words) and then merge with checklist

× [Genus] in WCVP

In WCVP e.g. × Hoodiapelia is given as accepted name and taxon name but genus is listed as Hoodiapelia, which causes confusion.

General Improvements

  • Improve outputs to align with PfH, in particular include 'matched to' (with authors).
  • If Kew sftp is down, package wont load even a local copy of the checklist.
  • Improve uninstall by adding dummy wcvp file
  • Output matched wcvp id
  • include authors of accepted name?.
  • Improve 'matched_by' output:
    • some taxa are not given author information in WCVP (e.g. Caralluma adscendens var. adscendens) and these are being incorrectly given 'direct_match_w_author'
    • Output a column with the match state e.g. 'unique', 'ambiguous' or 'unmatched'
    • Add method to summarise matched_by column of output
  • When first running package, if wcvp download is interuppted the associated zip file is unusable and an error is raised on the next usage. Fix this bby catching errors and redownloading
  • _capitalize_first_letter_of_taxon method raises error when inputting ''
  • Add checks for all input string parameters e.g. catch errors in spelling of family names, taxon ranks etc..
  • Improve handling of different encodings
  • String cleaning:
    • Use OpenRefine string transformers?
    • remove characters like "
    • remove full stops at end of entire string (both in given name and wcvp) -- very common that authors are given with/without full stops which can deter matching
  • Add extra steps prior to autoresolution: (1) using openrefine (2) some sort of fuzzy matching
  • When genus has been found, use algorithm like fuzzywuzzy to match species names. Similar to approach in taxonstand
  • Improve support for common misspellings e.g. y -> ii. OpenRefine improves this but still some issues
  • Input currently has to be pandas dataframe + name of column with names in, but would be useful to allow varied formats e.g. simple list of names
  • Add versioning to distribution lists
  • Add distribution plotting methods
  • Add a genus column to outputs with acc_data['accepted_genus'] = acc_data[wcvp_accepted_columns['name']].apply(get_genus_from_full_name)

KRS

Could use epithets from KRS to improve matching

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.