Comments (7)
@RochaStratovan Can you update/request an update to MDL 0.13.0, released in October 2023?
With this version in hand, both your README.md
file as well as a toy test file (cf. archive attached below) don't report a problem.
from markdownlint.
Will do. Thank you.
from markdownlint.
Hmmmmmm..... so I agree it doesn't happen for the README.md file I posted. I was also able to reproduce it within my environment with that file, and now with MDL.0.13.0 it passes.
However, it is still failing with my full README.md file. I'm trying to figure out more to share with you.
from markdownlint.
It seems like it's getting a UTF-8 failure on a different README.md file now.
The problem no longer happens for that "small" example, but it's still happening on my larger files. I started the "minification" process again to find the problem.
Updated file:
README_new.md
Updated failure message
rocha@e20c13008e8e:~/JRRTEST2$ mdl README_new.md
Traceback (most recent call last):
9: from /usr/local/bin/mdl:23:in `<main>'
8: from /usr/local/bin/mdl:23:in `load'
7: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/bin/mdl:10:in `<top (required)>'
6: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `run'
5: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `each'
4: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:91:in `block in run'
3: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new_from_file'
2: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new'
1: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `initialize'
/var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `split': invalid byte sequence in UTF-8 (ArgumentError)
Updated GitLab Rendered Output with what I think are the problem characters highlighted in yellow
![utf-8-new](https://private-user-images.githubusercontent.com/40575252/331319955-c4a8d054-d6fa-4947-87e1-236865073073.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE1OTA2NzYsIm5iZiI6MTcyMTU5MDM3NiwicGF0aCI6Ii80MDU3NTI1Mi8zMzEzMTk5NTUtYzRhOGQwNTQtZDZmYS00OTQ3LTg3ZTEtMjM2ODY1MDczMDczLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzIxVDE5MzI1NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFlZjJhZjIzMDVhMGY3YzcyNWE5MGYxYWJmNDU1N2JjMzQ4NmMwMjE1YzI3NGRjYmU3YjlkMmI3Njc3OGJmOTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.B2Gr8UzL9G7GWFwVtHqs7N7ySvnovQ7gGIG8esFBHgc)
What it looks like in a vi session, with yellow highlights again
![utf-8-new-vi](https://private-user-images.githubusercontent.com/40575252/331319985-a7574dc2-a13b-424e-a65c-13aae5d717c8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE1OTA2NzYsIm5iZiI6MTcyMTU5MDM3NiwicGF0aCI6Ii80MDU3NTI1Mi8zMzEzMTk5ODUtYTc1NzRkYzItYTEzYi00MjRlLWE2NWMtMTNhYWU1ZDcxN2M4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzIxVDE5MzI1NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJhMThkZTQ0NTFjYTg2N2Q2MTZiYTBjNzZhMTBmNGM3M2NlNzQyMzJkNTI4N2EzODdhMThkYjFlY2NlOWEzNDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.R-XKyGbYPINQwJ9txgCNK0LtlQE4aa3PlvJVS2m45IA)
from markdownlint.
@RochaStratovan I was able to replicate your findings.
the background of the story:
The cause is character encoding and what the operating system/the editor uses as code page. Originally, there was ASCII 7bit, allowing to store 2^7 = 128 characters only (some non-visible/control, A-Z, a-z, 0-9, a few special characters) for US American English. Because that's not enough to cover other languages and other scripts, unicode encodings are today the better way. While working with contemporary Python, you possibly encounter lines like
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
records = []
with open("example.txt", mode="r", encoding="utf-8") as source:
records = source.readlines()
to be a explicit about the file encoding in the Python script file (line 2), or/and about the file to process by the script (line 6). Unicode utf-8
is very frequent, but not the only unicode around (for instance utf-16
and utf-32
). Between the two are many other code tables which may depend both on the language/script as well as the release and setup of the operating system/editor used. However, I wouldn't consider this a bug related to markdownlint
.
The character in particular here is the (R) / ®.
how to prevent this obstacle with files created in future
Check your editor used to toggle to UTF-8. By your screen photo, I presume Windows is (one/the) operating system you use. In the case notepad++ (project page, entry on portableapps) you can set this parameter here:
In case you prefer cross-platform geany (project page, entry portableapps), go Edit -> Preferences, tab Files:
The two only as an example; feel free to use the editor which suits your needs best. Equally, it might be worth to check a twice if (presuming you use git from Windows) the setup of your instance of git uses Linux file endings. (Which is on one of questions on an early pane, during the installation.)
how to resolve the current obstacle
You have to edit the files in question, which requires i) to identify "the ones" in first place, and ii) adjust the code page used for them. The following approach requires some basic Unix/Linux commands; in case you don't have access to Linux Ubuntu, Debian, suse, or Fedora, etc you equally can resort to the minimal (Bash) shell provided e.g., by TortoiseGit for the pull down menu there.
-
step 1: using the minimal git shell, enter the folder with the files to be checked. It may require a couple of
cd
to change into the corresponding directory. -
step 2: run e.g.,
file *.md
to check all files in the present folder and at present level of hierarchy.$ file *.md backup.md: ISO-8859 text, with very long lines (456) no_r_backup.md: ASCII text, with very long lines (456) out.md: Unicode text, UTF-8 text, with very long lines (456) README_new.md: ISO-8859 text, with very long lines (456)
In the listing above, in addition to your file (note
ISO-8859
) an unchanged backup to work with, one where I manually removed the ® and one modified toutf-8
. -
step 3: for the conversion of the code page to be used, there is the
iconv
utility. For each file, you state the current encoding (-f
, as in "from ..."), the new encoding you want to convert to (-t
) and where to save the resulting output. In case you access a Linux installation, the pattern isiconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file -o out.file
I equally attempted a conversion in an old installation of Windows with the minimal bash shell by
tortoise git
and noticed the-o
flag did not work well. Instead, I had to redirect the result into a new file, i.e. a pattern oficonv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file > out.file
Personally, I prefer the conversion to provide a new file first (which can be checked) over one approach which attempts an automatic overwrite of the original file (which can cause to loose the file in question for good). The transliteration (
//TRANSLIT
) possibly can be dropped if you convert files from/into an encoding of the identical (for instance Latin) script. If there are multiple files to adjust, then the small bash script provided here might be helpful.
from markdownlint.
Hello @nbehrnd,
Thank you for the detailed analysis and answer. I understand the problem, however, I don't agree with what I think you are proposing as the solution. I believe you are suggesting that in order to avoid/prevent MDL from crashing, we should modify the input tools.
First, this isn't really a solution that scales. We have many developers that contribute to the documentation at our company and they use various tools such as:
- vi
- emacs
- visual studio text editor
- notepad[++]
just to name a few.
Second, I would categorize this as an issue with MDL. It crashes on text files that standard text editors can handle. When my devs and I see this crash, it's an MDL error. I agree as a workaround they could scan the text file to find the symbols that MDL is crashing on, but that doesn't take away from the fact that this is an issue with the MDL parsing logic.
MDL is a great tool. It just needs a few fixes such as this to be a bit more robust.
from markdownlint.
from markdownlint.
Related Issues (20)
- Wrong mdl version on latest Docker image
- Unable to store '.markdownlint.json' in any other location than root HOT 2
- Fully release v0.13.0
- Please publish release > v0.13.0
- MD024: property allow_different_nesting is not allowed HOT 2
- MD022: add `num_empty_lines_above` and `num_empty_lines_below` parameter
- Publish arm docker image
- [MD041] Error with metadata HOT 2
- Enhance rule MD033 to be more specific HOT 1
- Remove trailing slashes from filename directory argument
- extra newlines after front matter introduce off-by-one error in line numbers with `--ignore-front-matter`
- MD056 - Consistent column number check not taking into account escaped pipes
- "Extra documentation for rule..." printed when no URL is available
- --filenames-only option
- [MD022] consecutive increasing headers should not need to be separated by blank lines HOT 1
- Format tables so the columns are equal widths.
- [MD010] Hard tabs - "consistent" option
- MD038: Returning false positives for empty strings HOT 1
- MD024: `allow_different_nesting` should be more permissive
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from markdownlint.