sveetch / py-html-checker Goto Github PK

A tool to raise quality issues about HTML pages

License: MIT License

Makefile 0.87% Python 79.18% HTML 19.96%

py-html-checker's Introduction

Py Html Checker

This is an interface around Nu Html Checker (v.Nu) to check document validation either from a list of pages or a Sitemap.

Requires

Python>=3.4;
Java>=8 (openjdk8 or oraclejdk8);
Virtualenv (recommended);
Pip (recommended);

Dependancies

requests;
click>=7.0,<8.0 (CLI only);
colorama (CLI only);
colorlog (CLI only);
Jinja2>=2.10,<3.0 (Jinja only);
Pygments (Jinja only);

Install

pip install py-html-checker[cli,jinja]

If you don't plan to use it from command line (like as a module) and for HTML export you can avoid the cli and jinja parts:

pip install py-html-checker

Usage

Validate one or many pages

With the command page you can validate one or many pages. Command accept one or many path and each path can be either an URL or a filepath (absolute or relative from your current location):

htmlcheck page ping.html
htmlcheck page http://perdu.com
htmlcheck page ping.html http://perdu.com foo/bar.html

Validate all path from a sitemap

With the command site you can validate every page referenced in a sitemap.xml file. Command accept only one argument for the sitemap path which can be either an URL or a filepath (absolute or relative from your current location).

Note than for a sitemap file, its referenced urls must be absolute or relative to your current location. For a sitemap url, its referenced urls must be an absolute url (with leading http):

htmlcheck site sitemap.xml
htmlcheck site http://perdu.com/sitemap.xml

Manage verbosity

Default commandline verbosity is set to "Info" level, you may set it to "Debug" level to get also some more informations about command line work:

htmlcheck -v 5 site sitemap.xml

Or a totally silent output (beware that not any error will be return to output except commandline critical error):

htmlcheck -v 0 site sitemap.xml

Common options

--destination: Directory path where to write report files. If destination is not given, every files will be printed out. You can use a dot to write files to your current directory, a relative path or an absolute path. Path can start with ~ to point to your user home directory.
--exporter: Select exporter format. Default format is logging, it just printout report messages. There is also a json format to create JSON files for reports. And finally a html format to create HTML files.
--pack/--no-pack: Pack reports into a single file or not. Default is to pack everything in a single file. 'no-pack' will create a file for each report and then an export summary. It is recommended to define a destination directory with '--destination' if you don't plan to use packed export, else every files will just be printed out in an unique output. This option has no effect with logging format.
--safe: Invalid paths won't break execution of script and it will be able to continue to the end. This is mostly for rare usecase when invalid source encounter a bug from report parsing or from validator.
--split: Execute validation for each path in its own distinct instance. Useful for very large path list which may take too long to display anything until every path has been validated. However, for small or moderate path list it will be longer than packed execution.
--user-agent: A customer user-agent to use for every possible requests.
--Xss: Java thread stack size. Useful in some case where you encounter error 'StackOverflowError' from validator. Set it to something like '512k'.

Specific formats options

html

--template-dir: A path to a template directory for your custom templates. Your template directory must contains the summary, report and audit main templates and also a main.css file.

Specific 'site' options

--sitemap-only: For site command only. This will only get and parse given sitemap path but without validating its items, useful to validate a sitemap before using it for validations.

CLI help

See commandline helps for more details :

htmlcheck -h
htmlcheck page -h
htmlcheck site -h

py-html-checker's People

Contributors

Stargazers

Watchers

Forkers

thehawk3r acbaraka

py-html-checker's Issues

Option to use --filterpattern

It seems usage of --filterpattern option from validator is a useful thing with some stuff like Angular to ignore some required HTML attribute like ng-something="".

This should be a multiple value option, that will be automatically joined (validator expect only one regex).

Think about to add this option to the CLI.

Export report to HTML

Currently there is only a logging report.

We need also to export to HTML. The HTML have to be a one page only with a single file everything included.

It have to be clean, ergonomic and it would be nice to be correctly done to be printable (so it can be printed to a pdf).

Jinja will probably used as a dependancies to build HTML.

CLI option to silent error on invalid paths

Currently, invalid path will abort page and site commands.

It may be useful to let pass this error kind with an option like --safe, it will be reported in export and let continue command execution for valid paths.

At least for the site command it will be really useful to play nice with sitemap which mention some url which does not exists anymore or fail.

Validation differences between CLI and online service

As it can be see on https://validator.w3.org/nu/?showsource=yes&showoutline=yes&doc=https%3A%2F%2Femencia.com

Reported validation is different from cli (html_checker or directly vnu.jar) and online service (https://validator.w3.org/nu/). The first does not have an error log about missing "href" attribute on an element <a> which is present from online validator.

It does not have any issue from validator github about this, nor any vnu.jar option to get around, need to cover this in test and find what is going on (if impossible to find any issue, mind to post a github issue about it on validator repo).

Summary report for unique errors

When using report on huge site and it can become hard to make an inventory of all error kinds since you need to crawl on every page report with error to collect each unique error kind.

Like on 500pages which report a total of 200 errors, you may have 80 ones which have invalid link closing tag, 60 errors about missing "alt' attribute on image, and a 5 other various errors.

Finally, you have 7 different errors kinds you may fix quickly but it is hard to know without to read for 200 errors.

There should be a summary of unique similar error based on the error title, not the content.

Add tox usage

For sanity, check tox config is ok and it works well. This will ensure packaging and all.

_JAVA_OPTION environment variable breaks json parse

java -jar c:\users\acbaraka\appdata\local\programs\python\python39\lib\site-packages\html_checker/vnujar/vnu.jar --format json --exit-zero-always --user-agent Validator.nu/LV py-html-checker/0.4.0 C:\test.jhtml

Will produce an initial first line:

Picked up _JAVA_OPTIONS: -foobar

if the _JAVA_OPTIONS environment variable is set. Users can work around this by unsetting the variable but some detection for this might be best in the long run.

Report export

In its current stage, the validator report is just an ordered dictionnary for paths.

We need human readable export like HTML (keep it clean and simple) and possibly a terminal on using colored logs for work process info, validation error and validation warnings.

Validator CLI

There is no command line yet to use validator.

It should be able to either validate from directly given path or from a sitemap.xml.

Unexistent firstColumn key in some report

Saw that on a stress test using sitemap.xml from a Richie instance.

Seems a rare case where firstColumn does not exist from a report, didn't see yet the detail but source maybe empty from this error.

Don't allow to validate the same path multiple times

If user has given the same path many times, the path should be validated only once.

Have to check if it's better to filter this at CLI level or from validator.validate()

Rename "report" module as "export"

Since report is done from validator and actual report module contain only code to manage export from given report.

Option to execute each url on a new validator instance

Currently, validator tool is executed once for every found paths (whether site or page command).

This is cool for performance but for a very long list of paths like more than >=50 paths it keep the script instance silent for a long time.

Tested on a sitemap with ~900 paths, the script was silent during ~6min.

We need an option to force opening new validator instance for each path, so report could be exported on logging output in real time to be able to see it is still working.

Option for destination directory with date

Not sure yet how it has to be implemented, but we need an option or trick to create a directory with a date stamp to quickly make a new report without to overwrite previous report and without to give a new directory name.

Seems, it should be a new option like --date-dir which should just create a directory like 2020-12-31. And additionaly combined with --destination, the date directory would be created in the defined destination (else it just create it in current directory).

Better commandline name

As from https://smallstep.com/blog/the-poetics-of-cli-command-names/

Particulary, remove the - in the name.

Not able to use library

I am trying to validate a url using htmlcheck page http://someurl.xyz but getting error like "Unable to reach interpreter to run validator: [WinError 2] The system cannot find the file specified"

And apart from these if I want to use HTML checker within python code, so is there any way to do this?

Report live run option

This would bring a new feature to automatically run a http server to serve a report.

Proposal is to add a new common argument like "'--serve" (would require additional optional arguments for the interface to listen to) for "page" and "site".

When used, after the report has been successfully built, it would start a lightweight http server (like PyCharm) binded on the report directory.

Using this new argument without "exporter" set to "html" should raise a critical error, we won't have any html file to serve with other formats.

The http server output should display a message about using CTRL+C to stop its instance.

User agent

html_checker default validator behavior should be to send its name and version, possibly accompanied with vnu version if it does not cost too much (invoking vnu.jar before launching validation just for version info seems too much).

Optionally validator should accept a custom user agent as it could be used from a CLI option.

HTML export is not available even 'jinja' extra requirement is installed

With recent new install even with the jinja extra requirement enabled, the "html" choice for --exporter argument is not available.

From tests we can see there is a problem with Jinja dependancy "MarkupSafe":

ImportError while importing test module '/home/thenonda/Applications/py-html-checker/tests/034_export_jinja.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.venv/lib/python3.8/site-packages/_pytest/python.py:608: in _importtestmodule
    mod = import_path(self.path, mode=importmode, root=self.config.rootpath)
.venv/lib/python3.8/site-packages/_pytest/pathlib.py:533: in import_path
    importlib.import_module(module_name)
/usr/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1014: in _gcd_import
    ???
<frozen importlib._bootstrap>:991: in _find_and_load
    ???
<frozen importlib._bootstrap>:975: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:671: in _load_unlocked
    ???
.venv/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
    exec(co, module.__dict__)
tests/034_export_jinja.py:5: in <module>
    from jinja2 import Environment, Template
.venv/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
    exec(co, module.__dict__)
.venv/lib/python3.8/site-packages/jinja2/__init__.py:12: in <module>
    from .environment import Environment
.venv/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
    exec(co, module.__dict__)
.venv/lib/python3.8/site-packages/jinja2/environment.py:25: in <module>
    from .defaults import BLOCK_END_STRING
.venv/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
    exec(co, module.__dict__)
.venv/lib/python3.8/site-packages/jinja2/defaults.py:3: in <module>
    from .filters import FILTERS as DEFAULT_FILTERS  # noqa: F401
.venv/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
    exec(co, module.__dict__)
.venv/lib/python3.8/site-packages/jinja2/filters.py:13: in <module>
    from markupsafe import soft_unicode
E   ImportError: cannot import name 'soft_unicode' from 'markupsafe' (/home/thenonda/Applications/py-html-checker/.venv/lib/python3.8/site-packages/markupsafe/__init__.py)

It seems recent MarkupSafe have dropped the soft_unicode function which Jinja rely on.

Include HTML source in report

Alike the online validator, report should contain the full checked HTML source so user can look at it for validation message since the truncated extract part in each message is not always helpful.

Unfortunately the vnu validator does not seems to have any option to include the source in its report, it seems to do so only on its frontend.

This will force us to change the way we use the validator. Instead of just giving path/urls to its CLI, we will need to do some trick like getting URL source with requests lib, write it to a temporary file which will be given to the validator and transport the source in report. Path files will be more simple since we will just need to read it to get the source.

This is a major breaking change in the API since it will break many tests.

Finish validator tests

Currently validator tests are still broken until finished. Also wrapper may not be finished yet since it's TDD so depending from tests.

vnu.jar options support

Watch for available vnu.jar options and possibly add support for them.

Required

User agent have to be supported since first release in another issue;
Xss for java block allocation;
no-stream is for rare cases to continue validation even with critical error;

Maybe

css support may be interesting enough;
svg may be;
asciiquotes is for exotic usecases and lowest priority;
option about html extension may not be very common case;
filter options is to filter messages from regexes may be interesting but a rare case and not very simple to cover;

Won't do

Format is not elligible since html_checker stands on JSON report only;
error and exit code stuff is only managed internally in html_checker;
no-langdetect is to avoid warning about missing lang attribute, i would prefer to keep this common warning since it is not advised to overrule it;

Catch internal exceptions from cli

Currently internal exception (based on ddd) is just raised as it.

CLI commands should catch them to raise critical logs about them instead of plain full exception traceback.

'dry run' option

Required at least for the site command to only open and parse sitemap and output counter of found url item.

Probably not useful for page command.

May be called something else than --dry-run as it's not really a dry run since sitemap is opened indeed.