Git Product home page Git Product logo

Comments (7)

nbehrnd avatar nbehrnd commented on July 22, 2024

@RochaStratovan Can you update/request an update to MDL 0.13.0, released in October 2023?

With this version in hand, both your README.md file as well as a toy test file (cf. archive attached below) don't report a problem.

2024-05-16_test_mdl.zip

from markdownlint.

RochaStratovan avatar RochaStratovan commented on July 22, 2024

Will do. Thank you.

from markdownlint.

RochaStratovan avatar RochaStratovan commented on July 22, 2024

Hmmmmmm..... so I agree it doesn't happen for the README.md file I posted. I was also able to reproduce it within my environment with that file, and now with MDL.0.13.0 it passes.

However, it is still failing with my full README.md file. I'm trying to figure out more to share with you.

from markdownlint.

RochaStratovan avatar RochaStratovan commented on July 22, 2024

@nbehrnd,

It seems like it's getting a UTF-8 failure on a different README.md file now.

The problem no longer happens for that "small" example, but it's still happening on my larger files. I started the "minification" process again to find the problem.


Updated file:
README_new.md

Updated failure message

rocha@e20c13008e8e:~/JRRTEST2$ mdl README_new.md
Traceback (most recent call last):
        9: from /usr/local/bin/mdl:23:in `<main>'
        8: from /usr/local/bin/mdl:23:in `load'
        7: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/bin/mdl:10:in `<top (required)>'
        6: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `run'
        5: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `each'
        4: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:91:in `block in run'
        3: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new_from_file'
        2: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new'
        1: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `initialize'
/var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `split': invalid byte sequence in UTF-8 (ArgumentError)

Updated GitLab Rendered Output with what I think are the problem characters highlighted in yellow

utf-8-new

What it looks like in a vi session, with yellow highlights again

utf-8-new-vi

from markdownlint.

nbehrnd avatar nbehrnd commented on July 22, 2024

@RochaStratovan I was able to replicate your findings.

the background of the story:

The cause is character encoding and what the operating system/the editor uses as code page. Originally, there was ASCII 7bit, allowing to store 2^7 = 128 characters only (some non-visible/control, A-Z, a-z, 0-9, a few special characters) for US American English. Because that's not enough to cover other languages and other scripts, unicode encodings are today the better way. While working with contemporary Python, you possibly encounter lines like

#!/usr/bin/env python3
# -*- coding: utf-8 -*- 
import os
records = []

with open("example.txt", mode="r", encoding="utf-8") as source:
    records = source.readlines()

to be a explicit about the file encoding in the Python script file (line 2), or/and about the file to process by the script (line 6). Unicode utf-8 is very frequent, but not the only unicode around (for instance utf-16 and utf-32). Between the two are many other code tables which may depend both on the language/script as well as the release and setup of the operating system/editor used. However, I wouldn't consider this a bug related to markdownlint.

The character in particular here is the (R) / ®.

how to prevent this obstacle with files created in future

Check your editor used to toggle to UTF-8. By your screen photo, I presume Windows is (one/the) operating system you use. In the case notepad++ (project page, entry on portableapps) you can set this parameter here:

npp

In case you prefer cross-platform geany (project page, entry portableapps), go Edit -> Preferences, tab Files:

geany

The two only as an example; feel free to use the editor which suits your needs best. Equally, it might be worth to check a twice if (presuming you use git from Windows) the setup of your instance of git uses Linux file endings. (Which is on one of questions on an early pane, during the installation.)

how to resolve the current obstacle

You have to edit the files in question, which requires i) to identify "the ones" in first place, and ii) adjust the code page used for them. The following approach requires some basic Unix/Linux commands; in case you don't have access to Linux Ubuntu, Debian, suse, or Fedora, etc you equally can resort to the minimal (Bash) shell provided e.g., by TortoiseGit for the pull down menu there.

  • step 1: using the minimal git shell, enter the folder with the files to be checked. It may require a couple of cd to change into the corresponding directory.

  • step 2: run e.g., file *.md to check all files in the present folder and at present level of hierarchy.

    $ file *.md
    backup.md:      ISO-8859 text, with very long lines (456)
    no_r_backup.md: ASCII text, with very long lines (456)
    out.md:         Unicode text, UTF-8 text, with very long lines (456)
    README_new.md:  ISO-8859 text, with very long lines (456)

    In the listing above, in addition to your file (note ISO-8859) an unchanged backup to work with, one where I manually removed the ® and one modified to utf-8.

  • step 3: for the conversion of the code page to be used, there is the iconv utility. For each file, you state the current encoding (-f, as in "from ..."), the new encoding you want to convert to (-t) and where to save the resulting output. In case you access a Linux installation, the pattern is

    iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file -o out.file

    I equally attempted a conversion in an old installation of Windows with the minimal bash shell by tortoise git and noticed the -o flag did not work well. Instead, I had to redirect the result into a new file, i.e. a pattern of

    iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file > out.file

    Personally, I prefer the conversion to provide a new file first (which can be checked) over one approach which attempts an automatic overwrite of the original file (which can cause to loose the file in question for good). The transliteration (//TRANSLIT) possibly can be dropped if you convert files from/into an encoding of the identical (for instance Latin) script. If there are multiple files to adjust, then the small bash script provided here might be helpful.

from markdownlint.

RochaStratovan avatar RochaStratovan commented on July 22, 2024

Hello @nbehrnd,

Thank you for the detailed analysis and answer. I understand the problem, however, I don't agree with what I think you are proposing as the solution. I believe you are suggesting that in order to avoid/prevent MDL from crashing, we should modify the input tools.

First, this isn't really a solution that scales. We have many developers that contribute to the documentation at our company and they use various tools such as:

  1. vi
  2. emacs
  3. visual studio text editor
  4. notepad[++]

just to name a few.

Second, I would categorize this as an issue with MDL. It crashes on text files that standard text editors can handle. When my devs and I see this crash, it's an MDL error. I agree as a workaround they could scan the text file to find the symbols that MDL is crashing on, but that doesn't take away from the fact that this is an issue with the MDL parsing logic.

MDL is a great tool. It just needs a few fixes such as this to be a bit more robust.

from markdownlint.

nbehrnd avatar nbehrnd commented on July 22, 2024

from markdownlint.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.