emareg / paper-checker Goto Github PK

Find simple grammar mistakes in scientific documents.

Python 94.81% Makefile 0.58% TeX 4.52% Dockerfile 0.09%

grammar-checker spell-checker academic latex

paper-checker's Issues

Color Legend in the report

It would be nice if each generated report has a legend on the top that says the meaning of different colors. Additionally, in the readme, for each option it would be nice to add lines like " The results are then shown in {color}"

hyphenated wrapped words not resolving

hyphenated words that just happen to fall at the end of a line are reconstructed without the hyphen.

In my paper I have this example.
`...but with very contrasting power-
performance thread....."

This becomes
"but with very contrasting powerperformance thread"

after pdf2text. No idea if it's solvable but thought I'd let you know.

Dependency list is missing

Currently bs4 is needed to be installed for plagiarism checker. Could be good to have a requirements.txt or any other method to install dependencies automatically

[question][plagiarism] Would it be better to use the Google Search API instead of normal Google search

Background information

Google provides an API endpoint for search queries that return JSON based responses that contain all the information that the plagiarism checker needs. For example, searching for lectures in the custom search engine of the API documentation returns the following:

{
  "kind": "customsearch#search",
  "url": {
    "type": "application/json",
    "template": "https://...."
  },
  "queries": {
    "request": [
      {
//...
      }
    ],
    "nextPage": [
      {
        "title": "Google Custom Search - lectures",
        "totalResults": "781000000",
        "searchTerms": "lectures",
        "count": 10,
        "startIndex": 11,
        "inputEncoding": "utf8",
        "outputEncoding": "utf8",
        "safe": "off",
        "cx": "017576662512468239146:omuauf_lfve"
      }
    ]
  },
  "context": {
    "title": "CS Curriculum",
    "facets": [
      [
        {
          "anchor": "Lectures",
          "label": "lectures",
          "label_with_op": "more:lectures"
        }
      ],
      [
        {
          "anchor": "Assignments",
          "label": "assignments",
          "label_with_op": "more:assignments"
        }
      ],
      [
        {
          "anchor": "Reference",
          "label": "reference",
          "label_with_op": "more:reference"
        }
      ]
    ]
  },
  "searchInformation": {
    "searchTime": 0.350489,
    "formattedSearchTime": "0.35",
    "totalResults": "781000000",
    "formattedTotalResults": "781,000,000"
  },
  "items": [
    {
      "kind": "customsearch#result",
      "title": "Introduction to Machine Learning",
      "htmlTitle": "Introduction to Machine Learning",
      "link": "https://see.stanford.edu/Course/CS229",
      "displayLink": "see.stanford.edu",
      "snippet": "Slides from Andrew's lecture on getting machine learning algorithms to work in \npractice can be found here. Previous projects: A list of last year's final projects ...",
      "htmlSnippet": "Slides from Andrew&#39;s \u003cb\u003electure\u003c/b\u003e on getting machine learning algorithms to work in \u003cbr\u003e\npractice can be found here. Previous projects: A list of last year&#39;s final projects&nbsp;...",
      "cacheId": "vB97xQjhxVcJ",
      "formattedUrl": "https://see.stanford.edu/Course/CS229",
      "htmlFormattedUrl": "https://see.stanford.edu/Course/CS229",
      "pagemap": {
        "cse_thumbnail": [
          {
            "src": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ2_-hJWbczpcTOUvBJuymIrbHevHrTlAL-EhyPo--xfmFh0F0Ts8iCmOc",
            "width": "148",
            "height": "208"
          }
        ],
        "metatags": [
          {
            "viewport": "width=device-width, initial-scale=1"
          }
        ],
        "cse_image": [
          {
            "src": "https://see.stanford.edu/Content/Images/Instructors/ng.jpg"
          }
        ]
      },
      "labels": [
        {
          "name": "lectures",
          "displayName": "Lectures",
          "label_with_op": "more:lectures"
        }
      ]
    },
// There are more results here
  ]
}

For more information: https://developers.google.com/custom-search/v1/overview

Question

Should this type of search replace the current search or maybe added as an additional search?

Advantages

No need to parse raw HTML/DOM to get the information needed
Being able to create custom search engines that can focus on specific websites: This can improve the detection capability

Disadvantages

Need to create a custom search engine using a Google account
Creating an API key: This is the most troublesome. Either, we have to create an API key and push it into the repository, which means that anyone can use that to search for stuff which will get linked to the owner of the key (@emareg probably). Or each time someone wants to use the plagiarism checking tool, they would need to add their own API key. This is very clean/safe but it implies some work for every user.
There is a limitation on how many requests can be done for free for a certain search engine

names_geo.dic & names_people.dic files are missing

In spelling.py the files names_geo.dic and names_people.dic are used in line 207, 208.
But they are missing in the src/dictionary folder.

dictionary = read_dictionary(dictionary, 'src/dictionary/names_geo.dic')
dictionary = read_dictionary(dictionary, 'src/dictionary/names_people.dic')

Multi-line captions

Hey,
in multiline captions the checker complains about missing dots.

\caption{
    abcdef hijkl.
}

It insteads recommends:

\caption{
    abcdef hijkl.
.}

Also, it recommends an unanimous, although a unanimous is correct. (It begins with a "y" sound -> a year).

Thanks!

Spellcheck color in the line number does not match the text highlighting color

When using the -s option for spellchecking, with or without the other options, the spelling errors are marked with yellow/orange on the left at the line numbers whereas they are marked with pink in the text

Plagiarism Checker does not output to the report

It is misleading that the even though the plagiarism checker is being done and logged to terminal, the results do not show up in the HTML report. It should be at least documented that its output is only in the terminal/stdout.

False positives

Here I list some false positives which were reported. They could be easily reproduced in test cases

Text	Output
by at least	You have repeated an adposition, which is probably not intended.: by at → by
from \cite{...} with	You have repeated an adposition, which is probably not intended.: from with → from
of \cite{...}	Do not use prepositions to end your sentences.: of. → .
for in-depth	You have repeated an adposition, which is probably not intended.: for in- → for -
so far	Informal word, could be substituted. so ⇒ Therefore,
\num{554400}	Large number, you should use a thousand separator.: 554400

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./paperchecker/__main__.py", line 67, in <module>
ModuleNotFoundError: No module named 'papercheck

__main__.py:67 states from papercheck.checker.grammar import checkGrammar

emareg / paper-checker Goto Github PK

paper-checker's Issues

Background information

Question

Advantages

Disadvantages

Recommend Projects

Recommend Topics

Recommend Org