alrichardbollans / wcvpy Goto Github PK
View Code? Open in Web Editor NEWPackages for downloading the World Checklist of Vascular Plants (WCVP) and resolving plant names to it
License: GNU General Public License v3.0
Packages for downloading the World Checklist of Vascular Plants (WCVP) and resolving plant names to it
License: GNU General Public License v3.0
When genus has been found, use algorithm like fuzzywuzzy to match species names. Similar to approach in taxonstand
Currently the name matching methods e.g. get_accepted_info_from_names_in_column
use the default version of wcvp given from get_all_taxa
. However, this should be updated so that different version can be specified in the methods.
Add tqdm and/or optimise get_accepted_info_from_ids_in_column method.
Could be done with a simple merge (without constructing the dicts) if get_all_taxa
is loaded with the correct columns
Include direct matches with author names included (lower case all).
There are some ways of tidying input names which improve (direct) matching and don't introduce uncertainty in the match. Some of these have been implemented e.g. fixing casing, ensuring there are spaces after hybrid characters, but there is probably more to do here e.g.
Note that some of these are purely tidying i.e. there are no double spaces in WCVP so it is worth removing them. Others are equivalent variations e.g. removing or adding full stops doesn't change which name should be matched to but can help find the match.
There are some common prefixes/suffixes which are spelt differently to given accepted names and make name matching difficult. For example Urechites karwinsky in (B. del Castillo et al., Tetrahedron Lett., 1970, 11, 1219 - 1220.) refers to Urechites karwinskii Müll.Arg.. This name won't be appropriately matched in the current version. It is also not matched by the Kew Name Matching Service
Logs and console messages could both be more informative
May need to update taxa methods related to WCVP following:
https://powo.science.kew.org/upcoming-changes
Allowing addition of a 'Family' column in input data could speed up resolutions and make them more accurate
Characters that may look like spaces in data can cause more conservative matching, as the characters differ from WCVP data
Input currently has to be pandas dataframe + name of column with names in, but would be useful to allow varied formats e.g. simple list of names
Allow for specifying how strict matching is done. Particularly with final stages, can filter whether these are done or not. Also within each stage there could be more filters/parameters.
Old versions will raise 'KeyError: 'last-modified'' when trying to access an up to date zip version as the request gets a 301 response
Some examples of known issues are given in examples_to_fix.csv
in unit_tests/test_inputs
.
The final step in get_accepted_info_from_names_in_column
of appending the found accepted info to the original dataframe can be very slow for large dataframes and this could be optimised.
The following (hard cases) fail in test_name_matching.py
in unit_tests
:
test_capitals_db.csv
the following resolve to 'Rothmannia':
hybrid_list.csv
'Sarcorhiza' resolves to NaNSome rows in final resolution temp output are missing values (does not affect final output however)
Include support for downloading distribution data
The following is some experimental, unoptimised code for matching names in a phylogenetic tree in newick format:
def substitute_name_in_tree(tree_string: str, old_name: str, new_name: str):
# Note ape will not read spaces, so add underscores back to names
tree_string = re.sub(r'\b{}(?=;|:)'.format(old_name), new_name.replace(' ', '_'), tree_string)
return tree_string
def relabel_tree():
f = open(tree_file, "r")
tree_string = f.readline()
# From https://stackoverflow.com/questions/45668107/python-regex-parsing-newick-format
rx = r'[(),]+([^;:]+)\b'
name_list = re.findall(rx, tree_string)
binomial_names = [get_binomial_from_label(x) for x in name_list]
zipped = list(zip(name_list, binomial_names))
df = pd.DataFrame(zipped, columns=['tree_name', 'binomial_name'])
acc_name_df = get_accepted_info_from_names_in_column(df, 'binomial_name', match_level='fuzzy')
acc_name_df.to_csv(os.path.join('inputs', 'acc_name_tree.csv'))
acc_name_df = pd.read_csv(os.path.join('inputs', 'acc_name_tree.csv'), index_col=0)
# Catch words in tree string by left hand word boundaries (generic) and right hand ; or : characters
for index, row in acc_name_df.iterrows():
print(f'{index} out of {len(acc_name_df)}')
if isinstance(row[wcvp_accepted_columns['name']], str):
tree_string = substitute_name_in_tree(tree_string, row['tree_name'], row[wcvp_accepted_columns['name']])
else:
# If not matched, use old name. This could be changed to a generic string to be dropped later.
tree_string = substitute_name_in_tree(tree_string, row['tree_name'], row['binomial_name'])
f = open(standard_tree_file, "w")
f.writelines([tree_string])
However, in this case resolution requirements may be different depending on planned phylogenetic analyses. For example resolving species with misspelled/unknown species epithets to genera (as in 'full' matching) could cause issues if trying to induce genus-level subtrees.
AFAIK KNMS is currently unsupported (and has been for a while!).
It may be useful to include openrefine as a matching step for when KNMS is eventually taken offline.
Here I highlight some potential issues within WCVP, in particular the 'string issues' below hinder name matching. These are by no means exhaustive, but may indicate areas to explore further.
These issues are mostly resolved when matching with the automatchnames package
Fix install to install with requirements
Write tests to compare this implementation with other methods (e.g. those surveyed in "Grenié, Matthias, Emilio Berti, Juan Carvajal‐Quintero, Gala Mona Louise Dädlow, Alban Sagouis, and Marten Winter. “Harmonizing Taxon Names in Biodiversity Data: A Review of Tools, Databases and Best Practices.” Methods in Ecology and Evolution, February 18, 2022, 2041-210X.13802. https://doi.org/10.1111/2041-210X.13802.")
In WCVP e.g. × Hoodiapelia is given as accepted name and taxon name but genus is listed as Hoodiapelia, which causes confusion.
acc_data['accepted_genus'] = acc_data[wcvp_accepted_columns['name']].apply(get_genus_from_full_name)
Could use epithets from KRS to improve matching
account for spelling mistakes in names
Outputting the family name of each taxon would be really helpful
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.