Git Product home page Git Product logo

gsq-metadata-extraction's Introduction

Metadata extraction and generation

Aim

Aim is to utilise machine learning (ML) to extract and generate metadata from the existing files by creating, training and improving a model (Nerual Network) to process currently existing files and the filestructures to extract labled metadata such as project date-time, dimensionality, projetc name and/or other metadata.

Desired attributes are:

Attributes
SurveyNum
SurveyName
LineName
SurveyType
PrimaryDataType
SecondaryDataType
TertiaryDataType
Quaternary
File_Range
First_SP_CDP
Last_SP_CDP
CompletionYear
TenureType
Operator Name
GSQBarcode
EnergySource
LookupDOSFilePath
Source Of Data

Subprojects

A folder exists for each subproject, further README files exist in them for more information

  • Filepath metadata extraction: extract different metadata from the file path of each file. (Current)
  • Document metadata extraction: extract different metadata from coduments such as pdf reports.
  • Hierarchical analysis and metadata sharing: copy metadata to related files based on hierachicl structure.

Abbreviations

Abbreviation Meaning
ML Machine learning
TF Tensorflow
NN Neural network
RNN Recurrent Neural network
DNN Deep Neural Network
CNN Convolutional Neural Network
DRNN Deep Recurrent Neural Networks
LSTM Long short term memory

Definitions

Word / Phrase Meaning
Sequene a sequence of objects, ie text
Temporal inputs / data sequencial data, often of varying size
Latent Hidden or internal

Research

Tensorflow

Tensorflow (TF) provides all the machine learning (ML) functionality and is a widely used scalable ML tool kit. Thus TF will be used for this project.

The ML process will generate and improve in an itterative process a model that can be used to predict metadata from new file path inputs. Note that the once a prediction has been made, it can be confirmed or corrected by a human and this response can be fed back into the ML process to improve the models accuracy.

Recurrent Neural Networks (RNN)

The RNN will be able to process input of varying lengths, thus rather than feeding the entire path at once its fed character by character. The RNN will utilise Long Short Term Memory (LSTM) to process the entire path character by character and continously output information discovered at that stage.

Unlike regualr Neural Neworks (NN) which process each input seperately, RNN feeds data from the previous steps (previous characters processed) into the current computation.

Deep Neural Network (DNN)

Likly This Network will become a Deep Neural Newtork (DNN) which means it has more than one hidden layer. Note that the NN can be both recurrent (RNN) and deep (DNN).

DNN are known for better processing of more complex data and thus is likly to increase model accuracy. Similarily to RNN DNN are more complex than regular NN.

Advantages different Neural Networks

Type Advantage Disatvantage
RNN Tempral data More complex
Variable input, output size Harder to train
Known for better accuracy good at predicting how the input might continue (not what we want)
DNN Better accuracy for complex data More complex
Better incoorporates past inputs Slightly slower
LSTM Variable input, output size Very complex
Known for very good accuracy Slower
plain NN Very fast to train Likly less accurate
Simpler model fixed input, output size
convolutional NN Convulution does not apply to text based data

In conclusion we will initially attempt to implement and test a Deep Rrecurrent Neural Network (DRNN) and possibly compare it to RNN and DNN implementation depending on performance results.

Contacts

Geoscience Information Team, Geological Survey of Queensland, Department of Resources, Brisbane, QLD, Australia, [email protected]

gsq-metadata-extraction's People

Contributors

chrisskorka avatar lukehauck avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

bluetyson

gsq-metadata-extraction's Issues

Application

Hi,

Was this applied in any way other than this project?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.