Git Product home page Git Product logo

paper-checker's Introduction

PaperCheck

Actions Status Actions Status Code style: black

PaperCheck is a python script that searches for simple grammar mistakes in scientific english texts. Unlike other grammar checkers it is free and tailored for scientific texts, such as papers. It might find words that pass a spell check but are most likely not intended in a scientific context, such as "angel" vs. "angle".

Getting Started

git clone https://github.com/emareg/paper-checker.git
cd paper-checker
make setup

Afterwards, you can use the script in two ways:

1. Run the python file

python3 papercheck.py -sgy example/testfile.tex

2. Compile as a stand-alone executable (Unix only)

make
./papercheck -sgy example/testfile.tex

Supported file types: .tex .txt .md .pdf

The found issues are displayed in the terminal and also written into papercheck_report.html

System wide installation

make install

This will copy the stand-alone executable to ~/.local/bin

Install as a Python package

pip3 install .
cd example
python3 -m papercheck -sgy testfile.tex

Features

Spell Checker (-s option)

Will highlight spelling errors. The script uses a small basic dictionary plus some additional self-made dictionaries for terms such as

  • technical: “microcontroller”, “superframe”, “bitmask”
  • mathematical: “eigenvector”, “linearization”
  • chemical: todo

The larger standard dictionaries are unsuitable because they

  • contain errors such as “longitudianl” or “schemati”
  • mask informal plural forms such as “vertexes” which should be “vertices”
  • include obsolete forms such as “latence” which should be “latency”

Grammar Checker (-g option)

Will highlight simple grammar mistakes such as

  • misuse of “a” or “an”
  • doubled auxiliary verbs (e.g. “is are”)
  • doubled determiners (e.g. “this the”)
  • confused “then” vs. “than”
  • confused “to” vs. “too”
  • wrong person-verb combination (e.g. “This were”)

Style Checker (-y option)

Will highlight language that could be improved such as

  • wrong words in scientific context (e.g. “angle” vs. “angel”)
  • non explained acronyms
  • improve less formal words (e.g. use “entire” instead of “whole”)

Plagiarism Checker (-p option)

experimental!

The script will try to find significant sentences, which are then compared to Google search results. This is a very poor approach but useful as a minimal effort with zero cost.

TeX checker

When you run the script on .tex files, it will also check for certain TeX problems such as

  • unused labels
  • missing periods in figure/table captions
  • unused math operators in math mode, e.g. $sin$ instead of $\sin$

Related Work

  • LanguageTool: Grammar, Style and Spell Checker written in Java
  • textidote: uses LanguageTool on .tex files

So why not use LanguageTool? It is large, slow and not tailored for scientific/technical texts. However, I recommend to use LanguageTool in addition.

paper-checker's People

Contributors

adriankast avatar alxhoff avatar christiankral avatar egekorkan avatar emareg avatar hofbi avatar ihaveint avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

paper-checker's Issues

[question][plagiarism] Would it be better to use the Google Search API instead of normal Google search

Background information

Google provides an API endpoint for search queries that return JSON based responses that contain all the information that the plagiarism checker needs. For example, searching for lectures in the custom search engine of the API documentation returns the following:

{
  "kind": "customsearch#search",
  "url": {
    "type": "application/json",
    "template": "https://...."
  },
  "queries": {
    "request": [
      {
//...
      }
    ],
    "nextPage": [
      {
        "title": "Google Custom Search - lectures",
        "totalResults": "781000000",
        "searchTerms": "lectures",
        "count": 10,
        "startIndex": 11,
        "inputEncoding": "utf8",
        "outputEncoding": "utf8",
        "safe": "off",
        "cx": "017576662512468239146:omuauf_lfve"
      }
    ]
  },
  "context": {
    "title": "CS Curriculum",
    "facets": [
      [
        {
          "anchor": "Lectures",
          "label": "lectures",
          "label_with_op": "more:lectures"
        }
      ],
      [
        {
          "anchor": "Assignments",
          "label": "assignments",
          "label_with_op": "more:assignments"
        }
      ],
      [
        {
          "anchor": "Reference",
          "label": "reference",
          "label_with_op": "more:reference"
        }
      ]
    ]
  },
  "searchInformation": {
    "searchTime": 0.350489,
    "formattedSearchTime": "0.35",
    "totalResults": "781000000",
    "formattedTotalResults": "781,000,000"
  },
  "items": [
    {
      "kind": "customsearch#result",
      "title": "Introduction to Machine Learning",
      "htmlTitle": "Introduction to Machine Learning",
      "link": "https://see.stanford.edu/Course/CS229",
      "displayLink": "see.stanford.edu",
      "snippet": "Slides from Andrew's lecture on getting machine learning algorithms to work in \npractice can be found here. Previous projects: A list of last year's final projects ...",
      "htmlSnippet": "Slides from Andrew's \u003cb\u003electure\u003c/b\u003e on getting machine learning algorithms to work in \u003cbr\u003e\npractice can be found here. Previous projects: A list of last year's final projects ...",
      "cacheId": "vB97xQjhxVcJ",
      "formattedUrl": "https://see.stanford.edu/Course/CS229",
      "htmlFormattedUrl": "https://see.stanford.edu/Course/CS229",
      "pagemap": {
        "cse_thumbnail": [
          {
            "src": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ2_-hJWbczpcTOUvBJuymIrbHevHrTlAL-EhyPo--xfmFh0F0Ts8iCmOc",
            "width": "148",
            "height": "208"
          }
        ],
        "metatags": [
          {
            "viewport": "width=device-width, initial-scale=1"
          }
        ],
        "cse_image": [
          {
            "src": "https://see.stanford.edu/Content/Images/Instructors/ng.jpg"
          }
        ]
      },
      "labels": [
        {
          "name": "lectures",
          "displayName": "Lectures",
          "label_with_op": "more:lectures"
        }
      ]
    },
// There are more results here
  ]
}

For more information: https://developers.google.com/custom-search/v1/overview

Question

Should this type of search replace the current search or maybe added as an additional search?

Advantages

  • No need to parse raw HTML/DOM to get the information needed
  • Being able to create custom search engines that can focus on specific websites: This can improve the detection capability

Disadvantages

  • Need to create a custom search engine using a Google account
  • Creating an API key: This is the most troublesome. Either, we have to create an API key and push it into the repository, which means that anyone can use that to search for stuff which will get linked to the owner of the key (@emareg probably). Or each time someone wants to use the plagiarism checking tool, they would need to add their own API key. This is very clean/safe but it implies some work for every user.
  • There is a limitation on how many requests can be done for free for a certain search engine

Dependency list is missing

Currently bs4 is needed to be installed for plagiarism checker. Could be good to have a requirements.txt or any other method to install dependencies automatically

Multi-line captions

Hey,
in multiline captions the checker complains about missing dots.

\caption{
    abcdef hijkl.
}

It insteads recommends:

\caption{
    abcdef hijkl.
.}

Also, it recommends an unanimous, although a unanimous is correct. (It begins with a "y" sound -> a year).

Thanks!

Plagiarism Checker does not output to the report

It is misleading that the even though the plagiarism checker is being done and logged to terminal, the results do not show up in the HTML report. It should be at least documented that its output is only in the terminal/stdout.

names_geo.dic & names_people.dic files are missing

In spelling.py the files names_geo.dic and names_people.dic are used in line 207, 208.
But they are missing in the src/dictionary folder.

dictionary = read_dictionary(dictionary, 'src/dictionary/names_geo.dic')
dictionary = read_dictionary(dictionary, 'src/dictionary/names_people.dic')

hyphenated wrapped words not resolving

hyphenated words that just happen to fall at the end of a line are reconstructed without the hyphen.

In my paper I have this example.
`...but with very contrasting power-
performance thread....."

This becomes
"but with very contrasting powerperformance thread"

after pdf2text. No idea if it's solvable but thought I'd let you know.

Text statistics ignores LaTeX environments

Reproduce: run papercheck on a .tex file including a table (e.g. example/testfile.tex)
Expected: textstats should count at least one table
Actual: textstats shows 0 as table count

Problem:
the textstats.py for some reason does not count \begin{table}, even the regex seems to be correct.

Output file argument is unused

The CLI interface provides: -o FILE, --output FILE write report to FILE which is never used. The output filename is hardcoded.

Check if variables are used in math mode

Another useful check for the TeX checker would be to identify variables which are used in the text but not in math mode ($variable$). This mainly happens if variables don't have a subscript or superscript and are just a single character.

Executable zip file is broken

The __main__.py searches for a subfolder papercheck which is not present in the zip file:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./paperchecker/__main__.py", line 67, in <module>
ModuleNotFoundError: No module named 'papercheck

__main__.py:67 states from papercheck.checker.grammar import checkGrammar

Color Legend in the report

It would be nice if each generated report has a legend on the top that says the meaning of different colors. Additionally, in the readme, for each option it would be nice to add lines like " The results are then shown in {color}"

False positives

Here I list some false positives which were reported. They could be easily reproduced in test cases

Text Output
by at least You have repeated an adposition, which is probably not intended.: by at → by
from \cite{...} with You have repeated an adposition, which is probably not intended.: from with → from
of \cite{...} Do not use prepositions to end your sentences.: of. → .
for in-depth You have repeated an adposition, which is probably not intended.: for in- → for -
so far Informal word, could be substituted. so ⇒ Therefore,
\num{554400} Large number, you should use a thousand separator.: 554400

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.