The wcvpy from alrichardbollans

Spelling corrections

When genus has been found, use algorithm like fuzzywuzzy to match species names. Similar to approach in taxonstand

Allow specifying wcvp version in name matching methods

Currently the name matching methods e.g. get_accepted_info_from_names_in_column use the default version of wcvp given from get_all_taxa. However, this should be updated so that different version can be specified in the methods.

Add support for name matching in phylogenetic trees

tqdm get_accepted_info_from_ids_in_column

Add tqdm and/or optimise get_accepted_info_from_ids_in_column method.

Could be done with a simple merge (without constructing the dicts) if get_all_taxa is loaded with the correct columns

Improving direct matches

Include direct matches with author names included (lower case all).

String tidying

There are some ways of tidying input names which improve (direct) matching and don't introduce uncertainty in the match. Some of these have been implemented e.g. fixing casing, ensuring there are spaces after hybrid characters, but there is probably more to do here e.g.

reduce double spaces to single
ensuring spaces after infraspecific epithets and hybrid characters
trying to match when full stops have been removed (lots of manual inputs where full stops aren't included and then don't directly match with wcvp or via knms e.g. "Aspidosperma subincanum Mart" instead of "Aspidosperma subincanum Mart."

Note that some of these are purely tidying i.e. there are no double spaces in WCVP so it is worth removing them. Others are equivalent variations e.g. removing or adding full stops doesn't change which name should be matched to but can help find the match.

extend use of 'equivalent variations'

Common misspellings

There are some common prefixes/suffixes which are spelt differently to given accepted names and make name matching difficult. For example Urechites karwinsky in (B. del Castillo et al., Tetrahedron Lett., 1970, 11, 1219 - 1220.) refers to Urechites karwinskii Müll.Arg.. This name won't be appropriately matched in the current version. It is also not matched by the Kew Name Matching Service

Improve logs and console messages

Logs and console messages could both be more informative

Moving portals to POWO

May need to update taxa methods related to WCVP following:
https://powo.science.kew.org/upcoming-changes

Family resolutions

Allowing addition of a 'Family' column in input data could speed up resolutions and make them more accurate

Space-like characters

Characters that may look like spaces in data can cause more conservative matching, as the characters differ from WCVP data

Allow variety of input formats

Input currently has to be pandas dataframe + name of column with names in, but would be useful to allow varied formats e.g. simple list of names

Implement degrees of strictness in name matching

Allow for specifying how strict matching is done. Particularly with final stages, can filter whether these are done or not. Also within each stage there could be more filters/parameters.

http requests

Old versions will raise 'KeyError: 'last-modified'' when trying to access an up to date zip version as the request gets a 301 response

Known issues

Some examples of known issues are given in examples_to_fix.csv in unit_tests/test_inputs.

Optimisation

The final step in get_accepted_info_from_names_in_column of appending the found accepted info to the original dataframe can be very slow for large dataframes and this could be optimised.

Known issues

Known Issues

The following (hard cases) fail in test_name_matching.py in unit_tests:

From test_capitals_db.csv the following resolve to 'Rothmannia':
- 'ROTHMANIA ENGLERIANA (K. SCHUM.) KEAV'
- 'ROTHMANIA ENGLERIANA (K. Schum) Keav'
- 'ROTHMANNIA ENGLERIANA (K. SCHUM.) KEAV'
From hybrid_list.csv 'Sarcorhiza' resolves to NaN

Temp output final resolutions

Some rows in final resolution temp output are missing values (does not affect final output however)

Distribution data

Include support for downloading distribution data

Matching names in Phylogenetic Trees

The following is some experimental, unoptimised code for matching names in a phylogenetic tree in newick format:

def substitute_name_in_tree(tree_string: str, old_name: str, new_name: str):
    # Note ape will not read spaces, so add underscores back to names
    tree_string = re.sub(r'\b{}(?=;|:)'.format(old_name), new_name.replace(' ', '_'), tree_string)
    return tree_string


def relabel_tree():
    f = open(tree_file, "r")
    tree_string = f.readline()
    # From https://stackoverflow.com/questions/45668107/python-regex-parsing-newick-format
    rx = r'[(),]+([^;:]+)\b'
    name_list = re.findall(rx, tree_string)
    binomial_names = [get_binomial_from_label(x) for x in name_list]
    zipped = list(zip(name_list, binomial_names))

    df = pd.DataFrame(zipped, columns=['tree_name', 'binomial_name'])

    acc_name_df = get_accepted_info_from_names_in_column(df, 'binomial_name', match_level='fuzzy')
    acc_name_df.to_csv(os.path.join('inputs', 'acc_name_tree.csv'))
    acc_name_df = pd.read_csv(os.path.join('inputs', 'acc_name_tree.csv'), index_col=0)

    # Catch words in tree string by left hand word boundaries (generic) and right hand ; or : characters
    for index, row in acc_name_df.iterrows():
        print(f'{index} out of {len(acc_name_df)}')
        if isinstance(row[wcvp_accepted_columns['name']], str):

            tree_string = substitute_name_in_tree(tree_string, row['tree_name'], row[wcvp_accepted_columns['name']])
        else:
            # If not matched, use old name. This could be changed to a generic string to be dropped later.
            tree_string = substitute_name_in_tree(tree_string, row['tree_name'], row['binomial_name'])


    f = open(standard_tree_file, "w")
    f.writelines([tree_string])

However, in this case resolution requirements may be different depending on planned phylogenetic analyses. For example resolving species with misspelled/unknown species epithets to genera (as in 'full' matching) could cause issues if trying to induce genus-level subtrees.

KNMS -> OpenRefine

AFAIK KNMS is currently unsupported (and has been for a while!).
It may be useful to include openrefine as a matching step for when KNMS is eventually taken offline.

WCVP issues

Here I highlight some potential issues within WCVP, in particular the 'string issues' below hinder name matching. These are by no means exhaustive, but may indicate areas to explore further.

Some records in WCVP are not given accepted information
Sometimes POWO and WCVP don't agree (mostly due to short lag in POWO updates?)
Accepted names are not always unique (without author information) e.g. Helichrysum oligocephalum
Some taxa are not given ipni ids, including some accepted taxa
Some (accepted) taxa are not given any author information e.g. Caralluma adscendens var. adscendens

String issues

These issues are mostly resolved when matching with the automatchnames package

Some taxon names and author strings include double spaces:
Some taxon names end in full stops and have other issues:
Some entries have strange use of full stops:

Fix install

Fix install to install with requirements

Comparison tests

Write tests to compare this implementation with other methods (e.g. those surveyed in "Grenié, Matthias, Emilio Berti, Juan Carvajal‐Quintero, Gala Mona Louise Dädlow, Alban Sagouis, and Marten Winter. “Harmonizing Taxon Names in Biodiversity Data: A Review of Tools, Databases and Best Practices.” Methods in Ecology and Evolution, February 18, 2022, 2041-210X.13802. https://doi.org/10.1111/2041-210X.13802.")

Optimisations

Optimise loading/parsing of the checklist -- 20s is slow
Compression and decompression of parsed WCVP is slow, either improve compression or remove this step
Allow using checklist as a parameter to avoid repeated loading
The final step in get_accepted_info_from_names_in_column of appending the found accepted info to the original dataframe can be very slow for large dataframes and this could be optimised.
Improve speed of autoresolution by creating dataframe with potential names (e.g. split on words) and then merge with checklist

alrichardbollans / wcvpy Goto Github PK

wcvpy's People

Contributors

Stargazers

Watchers

wcvpy's Issues

Known Issues

String issues

Recommend Projects

Recommend Topics

Recommend Org