Git Product home page Git Product logo

taxonomic's Introduction

Table of Contents

Introduction

Our project aims to automate taxonomic data capture from scientific reports, something which is currently performed manually. This information can then be uploaded to searchable databases where it can be accessed by the public. Automating this process will save our client time, effort and money which can be better spent elsewhere. Our project will develop a web-based application, which can extract taxonomic information from XML or PDF articles, output a straightforward format which the client is then able to check, edit. XML articles could be download from publishing houses under Pensoft(https://zookeys.pensoft.net/articles) and PDF articles could should be academic articles in biological area. Our online application is deployed on https://taxonomic-tl.herokuapp.com . Users could upload the above artilces and download the results.

Diagram

Project Value

Having a comprehensive database of taxa data is useful to those studying biology as they can easily search the database to find relevant materials describing the taxa they are studying. These biological studies can help with sustainable management of biodiversity, conservation/protection of species, biosecurity, management of invasive species and much more. Our project will build an web-based automatic data collection application. Therefore, it can improve the data collection efficiency for Australian Biological Resources Study(ABRS), optimise extraction accuracy, reduce their labor consumption and save a lot of money.

Client Expectation

The client expects us to achieve functionality in terms of analysing documents in pdf or xml form and returning taxonomic information in a straightforward format, which the client is then able to check, edit. The program should be intuitive, accurate and time-saving.

System Diagram

Diagram

User Interface Prototype

Diagram

User Story Map

Diagram

Milestones

General:

  • CSV output
  • Passed user acceptance testing
  • Well formatted output
  • Generate a report for lessons learnt and issues in the process of extracting data.

UI:

  • Specify UI input and output preview display
  • Design all the windows needed. Editable output

PDF:

  • Acceptable level of name border recognition
  • Reference parsing
  • Field association (connecting relevant information)

XML:

  • Changing output format to csv.
  • Adding and changing output fields according to clients’ requirement.
  • Mapping with the standard schema.
  • Testing more cases from Zookeys, and other publishers’ articles who also use TaxPub xml structure. Make improvements on the extraction accuracy.
  • Analysing and report on the failure situation.

Server:

  • Getting contact with ANU to know if we could establish a server in ANU.
  • Studying informations about other platforms where we could implement server, such as Heroku.
  • Learning server knowledges and give a break-down plan of implementation.

Testing:

  • Main function testing
  • Client acceptance testing
  • User's manual

Schedule

Weeks 2-3:

  • Re-establish communication with client
  • Create plan for the semester
  • Update necessary documentation (updating landing page, set new google files, risk register, milestones, schedules. )
  • Prepare for audit (ppt)
  • Getting contact with ANU to know if we could establish a server in ANU. If not applicable, research other server platform (Heroku)
  • Studying on the knowledge of server deployment.
  • Specify UI input and output preview display.
  • Change output format to csv

Weeks 4-5:

  • Improve UI (output display)
  • Work on parsing bibliography references (PDF)
  • Adding and changing output fields according to clients’ requirement and feedback. Working on different cases of ZooKeys (XML)
  • Try to reduce reliance on webservers
  • Studying on the knowledge of server deployment. Making a breakdown plan of server deployment. Kick-off.

Weeks 5-6:

  • Improve UI (output display)
  • Improve UI (input display)
  • Attempt to increase "border word" list / Integrate other name detection (PDF)
  • Add some holotype/coordinate detection (PDF)
  • Working on different cases of Zookeys. Making imporvements on accuracy. Make the output in the standard schema (XML)
  • Continue deploying server, link UI with server (Server)
  • Design testing for accuracy. Start build testing accuracy (Testing)
  • Prepare for audit

Mid-break:

  • Improve UI (editable output)
  • Improve UI (output display)
  • Improve UI (input display)
  • Try to link different attributes within text (eg genders with holotypes, holotypes with species) (PDF)
  • Try to highlighting the extracted information in the original pdf (PDF)
  • Working on the cases of species combination, or changing species' category (XML)
  • Finishing mapping original output to standard schemas(XML)
  • Working on different publishers' cases of TaxPub. Making imporvements on accuracy. Make the output in the standard schema (XML)
  • Continue deploying server, link backend with server. Provide the website to clients which they could use (Server)
  • Testing accuracy.

Week 7-8:

  • Improve UI (editable output)
  • PDF output in standard schema (PDF)
  • Server updates to make sure the result is stardard schema (Server)
  • Finishing testing accuracy (Testing)
  • Accuracy algorithm design and implement (Testing)

Week 9-10:

  • Testing with client
  • Acting on client feedback
  • Prepare for audit
  • Generate a report for lessons learnt and issues in the process of extracting data.

Week 11-12:

  • Finalising project
  • User acceptance testing

Progress

Semester 1

Audit 1:

  • Researched relevant biology/taxonomy information
  • Indentified risks.
  • Created a github project.
  • Communicated with clients, find the problem, underlying needs.

Audit 2:

  • Created the general structure of the project and the structure of each part (frontend, backend, UI, testing).
  • Made statement of work (SOW).
  • Allocated tasks and roles according to the SOW.
  • Created UI prototype.
  • Started research GoldenGate.
  • Implemented basic functions to identify new species name and genus name in the abstract of XML formatted article.

Audit 3:

  • Communicated with clients about the output schema, updating clients’ needs.
  • Designed poster.
  • Finished the research about GoldenGate, started using NLP to process PDF articles.
  • Found other taxonomic information related with new species/genus, agents and reference in XML articles, matching.
  • Formatted the extracted the extracted data as clients’ needs.
  • Used flask (a micro web framework written in Python) to connect with backend.
  • Wrote unit test.

Semester 2

In this semester, we make a schedule for the whole semester in week2. Each week, in the group meeting, each member will put forward their own tasks according to the schedule. And report how they did according to their tasks of last week, especially analyse the reasons for the delay parts and raise new solutions. We use "issues" on github to document everyone's achievements and reflections.

Audit 1:

  • Create plan for the semester
  • Update necessary documentation eg.SOW, risk register, github.

Audit 2:

  • Deploy the server.
  • Reach agreement on plan of testing accuracy plan, and begin to build test.
  • PDF could parse reference.
  • XML extraction program integrated and pack outputs file into zip file
  • Get populated standard schemas. Mapping XML outputs to standard schemas (in progress)
  • Accpetance criteria

Risk Management

As the project is being implemented as part of a secure system, it is important that it does not present any new vulnerabilities to that system. This can be achieved by being considerate of the environment in which our project will be deployed and using appropriate programming techniques.

Team Member Roles

Team Member Uni ID Role
Jing Li u6531952 Project Manager, Developer(XML extraction)
Biwei Cao u5926643 Developer(XML extraction), Documentation
Jiaqi Zhang u6089193 Developer(Testing)
Joshua Trevor u6405233 Developer(PDF extraction), Spokesperson
Yanlong LI u5890571 Developer(front end)
Yuan Yao u5945391 Developer(data interaction), Documentation

Communication Tools

  1. Email
  2. Facebook Messenger

Development Environment

  • Language:
  • Backend is written in python
  • Frontend uses HTML, CSS and JavaScript
  • Testing:
    • Unit test during development by black/white box
    • A/B test for the final stage

Development Tools

  1. PyCharm(Python IDE)
  2. Flask(A micro web framework)
  3. DreamWeaver(HTML, CSS, JavaScript)

Decision Make Procedures

Diagram

Testing

  1. Testing Process

Diagram

Meeting Agendas

Audit Presentation

Other Resources

Handover

taxonomic's People

Contributors

tarasom123 avatar jingli201802 avatar claire0212 avatar tyraeldlee avatar superfeone avatar joshuatrevor avatar afuchs1 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

taxonomic's Issues

Scrape citethisforme output

From the response generated by citethisforme, grab the appropriate fields from the page source and translate these into a dictionary output which can be used to populate bibliography.csv

Correct "kindOfNameUsage" column

Instead of describing whether the name is scientific or common, it should instead describe whether the name is an instance of sp. nov., comb. nov, gen. nov. etc.

Add clean error handling for webservice failure

If either citethisforme or gnparser cannot be accessed, it is important that the client can be made aware of this:

A) So they can tell that the issue is not an error with our program
B) If the issue persists they can consult our documentation to understand the nature of the problem and possible solutions. (It is possible that instead of replacing the webservice our interface with that service might just have to be adapted to changes they have made.)

Finishing mapping original output to TNC

Totally 4 sheets.
TNC_TaxonomicName is almost finished. To do:

  • The fullNameWithAuthorship needs to import article author (if not mention taxonomic author) from Tarasom functions.
  • Find subspecies, common names.

TNC_TaxonomicNameUsage and TNC_BibliographicReference are in process..

Remap output to TNU

The client has specified an output format which uses four classes. In order to fit this format the PDF output needs to be in the form of four CSVs.

Definition of done:
Executing the PDF extraction code will produce a folder with four CSV's- each CSV containing the list of instances of a particular class in the TNU schema.

Implement support for multiple name authors

Currently even though the authorship value in GNParser's JSON output is always a list, only the first value is used (usually there is only one value).

Definition of done:
The authorship list in GNParser's output is concatenated into a single string and this string is inserted into the final CSV output of the program.

server

use flask with ui, pdf input, xml input. output accordingly.

Constrict type detection

Currently typification.csv records too much information, accidentally capturing some type descriptions. Each type should be described in around 1 sentence and this sentence should contain gender, location and identifier.

Correct TNU id's

Current TNU ID's are generated in the TNU csv, this is incorrect, they should be generated in TaxonomicName.csv and then passed on to TNU csv and typification.csv respectively

Correct spacing issues caused by PYPDF2

Currently PYPDF2 (the library used to convert PDFs to strings) adds many unnecessary and unpredictable line breaks which make it difficult to parse important information (eg. references)

Possible solutions include:

  • Replacing PYPDF2 with more a more suitable tool
  • Attempting to correct the line breaks manually

Improve name detection

Name detection has two significant ways it can be improved

  • By accounting for punctuation in the name which should tell the program certain words don't belong. Eg. capital letters where there shouldn't be, unclosed brackets and misplaced full stops.

  • By checking if GNParser returns an "unparsed tail" string which contains words adjacent to the sp./gen./comb. Because these words can be assumed to be in the name, this error means that too many words are included in front of the name (so remove the front word and try again)
    Since GNParser is a webservice it is important to do these checks in rounds, instead of 1 by 1 as that would increase the latency immensely and may also cause gnparser to flag the ip of the program's machine as malicious.

Change coordinate scope

Include coordinates which are described using decimals instead of DMS, and include more room for variation of spacing, punctuation etc. in the regex.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.