emareg / paper-checker Goto Github PK

Find simple grammar mistakes in scientific documents.

Python 94.81% Makefile 0.58% TeX 4.52% Dockerfile 0.09%

grammar-checker spell-checker academic latex

paper-checker's Introduction

PaperCheck

PaperCheck is a python script that searches for simple grammar mistakes in scientific english texts. Unlike other grammar checkers it is free and tailored for scientific texts, such as papers. It might find words that pass a spell check but are most likely not intended in a scientific context, such as "angel" vs. "angle".

Getting Started

git clone https://github.com/emareg/paper-checker.git
cd paper-checker
make setup

Afterwards, you can use the script in two ways:

1. Run the python file

python3 papercheck.py -sgy example/testfile.tex

2. Compile as a stand-alone executable (Unix only)

make
./papercheck -sgy example/testfile.tex

Supported file types: .tex .txt .md .pdf

The found issues are displayed in the terminal and also written into papercheck_report.html

System wide installation

make install

This will copy the stand-alone executable to ~/.local/bin

Install as a Python package

pip3 install .
cd example
python3 -m papercheck -sgy testfile.tex

Features

Spell Checker (`-s` option)

Will highlight spelling errors. The script uses a small basic dictionary plus some additional self-made dictionaries for terms such as

technical: “microcontroller”, “superframe”, “bitmask”
mathematical: “eigenvector”, “linearization”
chemical: todo

The larger standard dictionaries are unsuitable because they

contain errors such as “longitudianl” or “schemati”
mask informal plural forms such as “vertexes” which should be “vertices”
include obsolete forms such as “latence” which should be “latency”

Grammar Checker (`-g` option)

Will highlight simple grammar mistakes such as

misuse of “a” or “an”
doubled auxiliary verbs (e.g. “is are”)
doubled determiners (e.g. “this the”)
confused “then” vs. “than”
confused “to” vs. “too”
wrong person-verb combination (e.g. “This were”)

Style Checker (`-y` option)

Will highlight language that could be improved such as

wrong words in scientific context (e.g. “angle” vs. “angel”)
non explained acronyms
improve less formal words (e.g. use “entire” instead of “whole”)

Plagiarism Checker (`-p` option)

experimental!

The script will try to find significant sentences, which are then compared to Google search results. This is a very poor approach but useful as a minimal effort with zero cost.

TeX checker

When you run the script on .tex files, it will also check for certain TeX problems such as

unused labels
missing periods in figure/table captions
unused math operators in math mode, e.g. $sin$ instead of $\sin$

Related Work

LanguageTool: Grammar, Style and Spell Checker written in Java
textidote: uses LanguageTool on .tex files

So why not use LanguageTool? It is large, slow and not tailored for scientific/technical texts. However, I recommend to use LanguageTool in addition.

paper-checker's People

Contributors

Stargazers

Watchers

Forkers

egekorkan hofbi ihaveint christiankral daltonjorge asadroy77 adamisik tednigoulart kandyjam

paper-checker's Issues

[question][plagiarism] Would it be better to use the Google Search API instead of normal Google search

Background information

Google provides an API endpoint for search queries that return JSON based responses that contain all the information that the plagiarism checker needs. For example, searching for lectures in the custom search engine of the API documentation returns the following:

{
  "kind": "customsearch#search",
  "url": {
    "type": "application/json",
    "template": "https://...."
  },
  "queries": {
    "request": [
      {
//...
      }
    ],
    "nextPage": [
      {
        "title": "Google Custom Search - lectures",
        "totalResults": "781000000",
        "searchTerms": "lectures",
        "count": 10,
        "startIndex": 11,
        "inputEncoding": "utf8",
        "outputEncoding": "utf8",
        "safe": "off",
        "cx": "017576662512468239146:omuauf_lfve"
      }
    ]
  },
  "context": {
    "title": "CS Curriculum",
    "facets": [
      [
        {
          "anchor": "Lectures",
          "label": "lectures",
          "label_with_op": "more:lectures"
        }
      ],
      [
        {
          "anchor": "Assignments",
          "label": "assignments",
          "label_with_op": "more:assignments"
        }
      ],
      [
        {
          "anchor": "Reference",
          "label": "reference",
          "label_with_op": "more:reference"
        }
      ]
    ]
  },
  "searchInformation": {
    "searchTime": 0.350489,
    "formattedSearchTime": "0.35",
    "totalResults": "781000000",
    "formattedTotalResults": "781,000,000"
  },
  "items": [
    {
      "kind": "customsearch#result",
      "title": "Introduction to Machine Learning",
      "htmlTitle": "Introduction to Machine Learning",
      "link": "https://see.stanford.edu/Course/CS229",
      "displayLink": "see.stanford.edu",
      "snippet": "Slides from Andrew's lecture on getting machine learning algorithms to work in \npractice can be found here. Previous projects: A list of last year's final projects ...",
      "htmlSnippet": "Slides from Andrew&#39;s \u003cb\u003electure\u003c/b\u003e on getting machine learning algorithms to work in \u003cbr\u003e\npractice can be found here. Previous projects: A list of last year&#39;s final projects&nbsp;...",
      "cacheId": "vB97xQjhxVcJ",
      "formattedUrl": "https://see.stanford.edu/Course/CS229",
      "htmlFormattedUrl": "https://see.stanford.edu/Course/CS229",
      "pagemap": {
        "cse_thumbnail": [
          {
            "src": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ2_-hJWbczpcTOUvBJuymIrbHevHrTlAL-EhyPo--xfmFh0F0Ts8iCmOc",
            "width": "148",
            "height": "208"
          }
        ],
        "metatags": [
          {
            "viewport": "width=device-width, initial-scale=1"
          }
        ],
        "cse_image": [
          {
            "src": "https://see.stanford.edu/Content/Images/Instructors/ng.jpg"
          }
        ]
      },
      "labels": [
        {
          "name": "lectures",
          "displayName": "Lectures",
          "label_with_op": "more:lectures"
        }
      ]
    },
// There are more results here
  ]
}

For more information: https://developers.google.com/custom-search/v1/overview

Question

Should this type of search replace the current search or maybe added as an additional search?

Advantages

No need to parse raw HTML/DOM to get the information needed
Being able to create custom search engines that can focus on specific websites: This can improve the detection capability

Disadvantages

Need to create a custom search engine using a Google account
Creating an API key: This is the most troublesome. Either, we have to create an API key and push it into the repository, which means that anyone can use that to search for stuff which will get linked to the owner of the key (@emareg probably). Or each time someone wants to use the plagiarism checking tool, they would need to add their own API key. This is very clean/safe but it implies some work for every user.
There is a limitation on how many requests can be done for free for a certain search engine

Dependency list is missing

Currently bs4 is needed to be installed for plagiarism checker. Could be good to have a requirements.txt or any other method to install dependencies automatically

Multi-line captions

Hey,
in multiline captions the checker complains about missing dots.

\caption{
    abcdef hijkl.
}

It insteads recommends:

\caption{
    abcdef hijkl.
.}

Also, it recommends an unanimous, although a unanimous is correct. (It begins with a "y" sound -> a year).

Thanks!

Plagiarism Checker does not output to the report

It is misleading that the even though the plagiarism checker is being done and logged to terminal, the results do not show up in the HTML report. It should be at least documented that its output is only in the terminal/stdout.

names_geo.dic & names_people.dic files are missing

In spelling.py the files names_geo.dic and names_people.dic are used in line 207, 208.
But they are missing in the src/dictionary folder.

dictionary = read_dictionary(dictionary, 'src/dictionary/names_geo.dic')
dictionary = read_dictionary(dictionary, 'src/dictionary/names_people.dic')

hyphenated wrapped words not resolving

hyphenated words that just happen to fall at the end of a line are reconstructed without the hyphen.

In my paper I have this example.
`...but with very contrasting power-
performance thread....."

This becomes
"but with very contrasting powerperformance thread"

after pdf2text. No idea if it's solvable but thought I'd let you know.

Text statistics ignores LaTeX environments

Reproduce: run papercheck on a .tex file including a table (e.g. example/testfile.tex)
Expected: textstats should count at least one table
Actual: textstats shows 0 as table count

Problem:
the textstats.py for some reason does not count \begin{table}, even the regex seems to be correct.

Output file argument is unused

The CLI interface provides: -o FILE, --output FILE write report to FILE which is never used. The output filename is hardcoded.

Check if variables are used in math mode

Another useful check for the TeX checker would be to identify variables which are used in the text but not in math mode ($variable$). This mainly happens if variables don't have a subscript or superscript and are just a single character.

Spellcheck color in the line number does not match the text highlighting color

When using the -s option for spellchecking, with or without the other options, the spelling errors are marked with yellow/orange on the left at the line numbers whereas they are marked with pink in the text

Executable zip file is broken

The __main__.py searches for a subfolder papercheck which is not present in the zip file:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./paperchecker/__main__.py", line 67, in <module>
ModuleNotFoundError: No module named 'papercheck

__main__.py:67 states from papercheck.checker.grammar import checkGrammar

Color Legend in the report

It would be nice if each generated report has a legend on the top that says the meaning of different colors. Additionally, in the readme, for each option it would be nice to add lines like " The results are then shown in {color}"

False positives

Here I list some false positives which were reported. They could be easily reproduced in test cases

Text	Output
by at least	You have repeated an adposition, which is probably not intended.: by at → by
from \cite{...} with	You have repeated an adposition, which is probably not intended.: from with → from
of \cite{...}	Do not use prepositions to end your sentences.: of. → .
for in-depth	You have repeated an adposition, which is probably not intended.: for in- → for -
so far	Informal word, could be substituted. so ⇒ Therefore,
\num{554400}	Large number, you should use a thousand separator.: 554400