Git Product home page Git Product logo

textmatching's Introduction

TextMatching

Match a text file against a repository of text files, sorting by similarity.

Here is a program for finding the most similar file (to a sample file) in a repository of candidate files.

Below is the output for sample "FAIRY TALES By The Brothers Grimm" against a repository of 10 other books.

0.0320757 Repo\THE ADVENTURES OF TOM SAWYER.txt
0.0363329 Repo\A TALE OF TWO CITIES - A STORY OF THE FRENCH REVOLUTION.txt
0.0388528 Repo\ALICEТS ADVENTURES IN WONDERLAND.txt
0.0440605 Repo\MOBY-DICK or, THE WHALE.txt
0.046679 Repo\THE ADVENTURES OF SHERLOCK HOLMES.txt
0.0472574 Repo\The Iliad of Homer.txt
0.0511793 Repo\The Romance of Lust.txt
0.053746 Repo\PRIDE AND PREJUDICE.txt
0.0543531 Repo\BEOWULF - AN ANGLO-SAXON EPIC POEM.txt
0.0557194 Repo\Frankenstein; or, the Modern Prometheus.txt

Here the whole repository is listed, starting from most similar texts down to least similar. As you can see, fairy-tales come first and a horror book comes last.

Commercially this can be used for matching the current page a user is viewing against a repository of advertisement pages, so to find the most relevant advertisement.

Another application is for matching the job description or a resume of a relevant but unwilling to join candidate, against a repository of resumes so to find a similar candidate.

Usage

MatchText.exe <Sample File> <Repository Directory>

textmatching's People

Contributors

srogatch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

textmatching's Issues

Bug: Possible difference issue

Setup: Win10x64

Steps:

  1. Extracted program from the available 64-bit download
  2. Launched the program with the available sample files matchtext.exe sample.txt repo and everything works as expected
  3. To test out the difference function:
  • I copied the sample.txt exactly as-is to the Repo directory as "Fairy Tales.txt"
  • Another copy was created with one line deleted
  • A 3rd copy was made with two lines deleted

Result:

The numberical results were increasing as 0, 1.3, 2.8. Note that the file output is in the correct order, but the scores seem odd.

image

Expected result:

An exact copy should probably have the highest number, followed by the file with one line deleted, and then the file with two lines deleted e.g. 9, 7, 6 or some similar scoring that declines in the series.

Request: binary distribution / build steps

Hey I was looking to test this out but as a non-developer, compiling code isn't something I spend a lot of time on. A friend with some dotnet experience helped me through some of the process but we kept running into minor issues. I'd love to use the program to compare several text files i have that currently require a lot of manual effort. Could you post a quick process on how to compile, point to a resource, or post a binary version?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.