johanneskaufmann / html-to-markdown Goto Github PK

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

License: MIT License

Go 100.00%

go golang html html-to-markdown markdown goquery converter

html-to-markdown's Introduction

Hi, I'm Johannes 👋

Experience in developing REST/RPC APIs with Golang on AWS. Also using React, Redux, Webpack, SCSS and GraphQL on the Frontend. I am always excited to learn new skills: Flutter, Elm and a never-ending supply of AWS Services 😉

html-to-markdown - Golang library that converts HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown's People

Contributors

Stargazers

Watchers

html-to-markdown's Issues

🐛 Bug: code tag nested inside pre tag is not recognized

This code doesn't seem to work

🐛 `<` and `>` should not be converted to `<` and `>`

Describe the bug

< and > should not be converted to < and >, it breaks the resulting markdown.

HTML Input

&lt;not a tag&gt;

Generated Markdown

<not a tag>

Expected Markdown

&lt;not a tag&gt;

Additional context
Markdown parsers take <not a tag> as a tag and do not show it. That's not what is in the HTML though.

Example: https://spec.commonmark.org/dingus/?text=%3Cnot%20a%20tag%3E%0A%0A%26lt%3Bsecond%26gt%3B

📣 Plans for V2

The V2 of the library is in the works. It is a rewrite from the ground up — even more accurate than the current version.

Some new features:

Nested lists: More edge cases around (deeply) nested lists are supported
Smart escaping: Only escape characters if they would be mistaken for markdown syntax
...

➡️ What are some things that you would want to see? How could the API be improved? What currently annoys you?

💬 Who is using it?

People are using the library for different use cases. Some are using it for better readability of websites, others for migrating content. Knowing the use cases helps to prioritize features and plugins. So I would be interested in...

➡️ Who is using it? Could you let me know what you're using the library for? How was the experience?

🐛 Spaces missing before em elements

Describe the bug
Spaces missing before em elements

HTML Input

<ul style="list-style-type:disc">
    <li>All manually reviewed <em>Drosophila melanogaster</em> entries</li>
    <li>All manually reviewed <em>Drosophila pseudoobscura pseudoobscura</em> entries</li>
</ul>

Generated Markdown

- All manually reviewed_Drosophila melanogaster_ entries
- All manually reviewed_Drosophila pseudoobscura pseudoobscura_ entries

Expected Markdown

- All manually reviewed _Drosophila melanogaster_ entries
- All manually reviewed _Drosophila pseudoobscura pseudoobscura_ entries

Additional context
Also seeing this with the GitHubFlavored plugin.

P.S. Thanks a lot for the developing this package - it's very handy!

can‘t convert table in html to markdown

Usage of IsInlineElement function

So, per my observation multipleNewLinesRegex was used because we may encounter excessive newline as a result of consecutive inline element, as we add newline before and after its content (in some elements).
Could we use IsInlineElement function to only add the required newline?
For example if we encounter an inline element and the previous sibling isn't an inline element, then we may add newline as prefix

Plugins list ?

Just wondering is there is a list of plugins that are not just the ones in this repo.

I could just search but figured it’s worth asking

🐛 Bug with square brackets

Describe the bug

Found an issue with square brackets in the input which is confusing me. They end up being converted to \$& in the output. This seems to happen whether they are written in the html as [], [, or [.

HTML Input

<p>first [literal] brackets</p>
<p>then &#91;one&#93; way to escape</p>
<p>then &lbrack;another&rbrack; one</p>

Generated Markdown

first \$&literal\$& brackets

then \$&one\$& way to escape

then \$&another\$& one

Expected Markdown

first \[literal\] brackets

then &#91;one&#93; way to escape

then &lbrack;another&rbrack; one

Additional context

I had this issue come up with some options configured, but then went ahead and removed all configuration to test and I'm still seeing it. Is it something on my end I'm doing incorrectly perhaps? I'm not very experienced with golang so it's possible I'm making a silly error.

In code tag ,"_" Should not "\_".

var html2 =<code>last_30_days</code>

out

`last\_30\_days`

want

`last_30_days`

Image URLs

properly handle image urls that are absolute

Clarify license information for content in testdata/TestRealWorld/

Another thing found while working on Debian packaging -- could you clarify the license(s) for the files in testdata/TestRealWorld/, particularly content in the bonnerruderverein.de and snippets directories? It's not clear if the original author(s) made the content available under the MIT license, public domain, etc. (The content from the Golang website is fine, as each page has license information in its footer.)

Use of escape.Markdown for #text elements

Hello,

I'm using your library for a markdown generation tool for static site generators. The Rule interface is just perfect!

The use of escape for #text elements mostly seem like a problem for me as I read through the code. Would you be able to explain why this was used in the first place? I couldn't understand why certain characters needed to be escaped in the first place.

Thanks!

Unexpected result with additional rule for custom self-closing tags

I was following this example to write a rule to process custom <mention> tags in my input: https://github.com/JohannesKaufmann/html-to-markdown/blob/master/examples/custom_tag/main.go

Result was quite surprising, however not sure if this is a bug or misuse or maybe some limitations of the library?

Code:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
	"github.com/PuerkitoBio/goquery"
)

func main() {
	html := `
	test
	
	<mention user="user1" />
	<mention user="user2" />
	<mention user="user3" />

	blabla
	`

	rule := md.Rule{
		Filter: []string{"mention"},
		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			return &result
		},
	}

	conv := md.NewConverter("", true, nil)
	conv.AddRules(rule)

	markdown, err := conv.ConvertString(html)
	if err != nil {
		log.Fatalln(err)
	}

	fmt.Println("markdown:\n", markdown)
}

Expected output:

markdown:
 test
	
 @user1
 @user2
 @user3

 blabla

Observed output:

markdown:
 test

 @user1

Moreover, if I put these strings to debug what is going on in Replacement calls, it becomes even more weird:

		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			html, err := selec.Html()
			if err != nil {
				log.Fatalln(err)
			}

			fmt.Println("content:", content)
			fmt.Println("selec:", html)
			fmt.Println("result:", result)

			return &result
		},

Output:

content: 

 blabla  

selec:

        blabla 

result: @user3 
content: @user3
selec:
        <mention user="user3">

        blabla
        </mention>
result: @user2
content: @user2
selec:
        <mention user="user2">
        <mention user="user3">

        blabla
        </mention></mention>
result: @user1

dashes in existing frontmatter in source HTML files become escaped.

Describe the bug

dashes in existing frontmatter in source HTML files become escaped.

HTML Input

---
type: page
layout: reisebericht
title: Hamar
date: '2018-10-24 22:32:03 +0100'
weight: 2
tags:
- Norwegen
- Hamar
url: /2018/10-norwegen/02-hamar/
description: Ein Spaziergang durch Hamar, Shoppen und ein Besuch im Norsk jernbanemusem, dem norwegischen Eisenbahnmuseum.
image: files/2018/10-Norwegen/Hamar_Titel.jpg
---

Generated Markdown

\-\-\-
type: page
layout: reisebericht
title: Hamar
date: '2018-10-24 22:32:03 +0100'
weight: 2
tags:
\- Norwegen
\- Hamar
url: /2018/10-norwegen/02-hamar/
description: Ein Spaziergang durch Hamar, Shoppen und ein Besuch im Norsk jernbanemusem, dem norwegischen Eisenbahnmuseum.
image: files/2018/10-Norwegen/Hamar\_Titel.jpg
\-\-\-

Expected Markdown

---
type: page
layout: reisebericht
title: Hamar
date: '2018-10-24 22:32:03 +0100'
weight: 2
tags:
- Norwegen
- Hamar
url: /2018/10-norwegen/02-hamar/
description: Ein Spaziergang durch Hamar, Shoppen und ein Besuch im Norsk jernbanemusem, dem norwegischen Eisenbahnmuseum.
image: files/2018/10-Norwegen/Hamar_Titel.jpg
---

Additional context
This problem occurs when the source files already contain frontmatter (for example when converting Hugo .html files to .md).

FAIL: TestRealWorld/snippets/tweet

While working on packaging version 1.3.5 of this library for inclusion in Debian, I encountered the following test failure, due to a space just before the   tag:

=== RUN   TestRealWorld/snippets/tweet
    commonmark_test.go:74: Result did not match the golden fixture. Diff is below:
        
        --- Expected
        +++ Actual
        @@ -2,3 +2,3 @@
         <br>
        -As a company, it’s our responsibility to better support our Black associates, customers and allies. We know there is more work to do and will keep you updated on our progress, this is only the beginning. Black Lives Matter.<br>
        +As a company, it’s our responsibility to better support our Black associates, customers and allies. We know there is more work to do and will keep you updated on our progress, this is only the beginning. Black Lives Matter. <br>
         <img src="https://cdn.substack.com/image/fetch/w_600,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fpbs.substack.com%2Fmedia%2FEaVVy4aXsAglkCk.jpg" alt=""><br>
        
--- FAIL: TestRealWorld (0.08s)

Looking at that bit of test code, my initial guess is that it has something to do with Debian having a much newer version of github.com/yuin/goldmark (1.4.13) than what is pinned to in this project's go.mod file (1.2.0), but I haven't investigated much further.

🐛 Bug: Support `<tt>` for code next to `<code>` tags

Describe the bug
Unfortunately, some sites don't use semantic markup, e.g.,
http://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/
but instead specify the font directly using tt. Since markdown draws no distinction b/w code and things simply formatted in "typewriter style", these should be recognized at well (or, at least, as a plugin).

HTML Input

<tt>Some typewriter text</tt>

Generated Markdown

Some typewriter text

Expected Markdown

`Some typewriter text`

Additional context
N/A

The YoutubeEmbed Plugin

How to use the YoutubeEmbed Plugin pls?

This,

html-to-markdown/plugin/youtube.go

Line 16 in 8eb812b

var EXPERIMENTALYoutubeEmbed = []md.Rule{

unlike other plugins, is a var instead of a function. Would you,

change it to a normal function, and
remove the EXPERIMENTAL part from its name please?

Ref, the request is coming from suntong/html2md#7

thanks

don't escape twice

if a markdown character is already escaped
\* item
it is escaped a second time
\\* item

wanted: stay with \* item

Option/Flag to completely modify before-after hooks

First, thanks for the great library
I was wondering whether we may have a way to enable overwrite/toggle the before-after hooks.
My current use-case don't want multiple newlines to be removed like the default before hook
Thanks

Mention wrapper program in README.md?

Hi @JohannesKaufmann

I love your project so much that I added a wrapper program to it:

$ html2md -i https://github.com/suntong/lang
[Homepage](https://github.com/)
. . . 


$ html2md -i https://github.com/suntong/lang -s 'div#readme'   
## README.md

# lang -- programming languages demos

Would it be OK that I PR to README.md to mention html2md when it is ready? So far I'm having these planned out:

$ html2md
HTML to Markdown
Version 0.1.0 built on 2020-07-26
Copyright (C) 2020, Tong Sun

HTML to Markdown converter on command line

Usage:
  html2md [Options...]

Options:

  -h, --help                       display help information 
  -i, --in                        *The html/xml file to read from (or stdin) 
  -d, --domain                     Domain of the web page, needed for links when --in is not url 
  -s, --sel                        CSS/goquery selectors [=body]
  -v, --verbose                    Verbose mode (Multiple -v options increase the verbosity.) 

      --opt-heading-style          Option HeadingStyle 
      --opt-horizontal-rule        Option HorizontalRule 
      --opt-bullet-list-marker     Option BulletListMarker 
      --opt-code-block-style       Option CodeBlockStyle 
      --opt-fence                  Option Fence 
      --opt-em-delimiter           Option EmDelimiter 
      --opt-strong-delimiter       Option StrongDelimiter 
      --opt-link-style             Option LinkStyle 
      --opt-link-reference-style   Option LinkReferenceStyle 

  -A, --plugin-conf-attachment     Plugin ConfluenceAttachments 
  -C, --plugin-conf-code           Plugin ConfluenceCodeBlock 
  -F, --plugin-frontmatter         Plugin FrontMatter 
  -G, --plugin-gfm                 Plugin GitHubFlavored 
  -S, --plugin-strikethrough       Plugin Strikethrough 
  -T, --plugin-table               Plugin Table 
  -L, --plugin-task-list           Plugin TaskListItems 
  -V, --plugin-vimeo               Plugin VimeoEmbed 
  -Y, --plugin-youtube             Plugin YoutubeEmbed

Thanks

🐛 Bug Can not handle img

Describe the bug
A clear and concise description of what the bug is.

HTML Input

<figure><img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png"><figcaption></figcaption></figure>

Generated Markdown

<img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png">

Expected Markdown

nonting

Trying to get in touch with you regarding a security issue

Hi there,

I couldn't find a SECURITY.md in your repository and so am not sure how to best contact you privately to disclose the security issue.

Can you add a SECURITY.md file with your e-mail to your repository, so that I know who to contact? GitHub suggests that a security policy is the best way to make sure security issues are responsibly disclosed.

Once you've done that, please let me know so I can ping you the info.

Thanks! (cc @JamieSlome)

Extra elements in <code> blocks

Some websites use <code> blocks with  elements inside. It seems to be the case when the syntax highlighting is computed server-side, rather than on the browser with some JS library such as prettify.

To reproduce:

func main() {
	converter := md.NewConverter("", true, nil)
	url := "https://atomizedobjects.com/blog/javascript/how-to-get-the-last-segment-of-a-url-in-javascript"
	markdown, _ := converter.ConvertURL(url)
	fmt.Println("markdown)
}

What I get (scrolling down a bit):

``js
window<span class="token punctuation">.</span>location<span class="token punctuation">.</span>pathname<span class="token punctuation">.</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">filter</span><span class="token punctuation">(</span><span class="token parameter">entry</span> <span class="token operator">=></span> entry <span class="token operator">!==</span> <span class="token string">""</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token comment">// ["blog", "javascript", "how-to-get-the-last-segment-of-a-url-in-javascript"]</span>
``

What you get if you just remove all  elements from the generated markdown:

window.location.pathname.split("/").filter(entry => entry !== "");
// ["blog", "javascript", "how-to-get-the-last-segment-of-a-url-in-javascript"]

I know that an easy workaround on my side would be to just clean things up with goquery, but I figured it would be better to have it fixed here directly.

Thanks!

🐛 Bug is converted into two new lines (\n\n)

Describe the bug

In my testing I've found that the HTML tag   gets turned into two new lines (\n\n);

Example:

(⎈ |local:default)
prologic@Jamess-iMac
Mon Aug 02 11:37:55
~/tmp/html2md
 (master) 130
$ ./html2md -i
Hello<br />World
Hello

World

HTML Input

Hello<br />World

Generated Markdown

Hello

World

Expected Markdown

Hello
World

Additional context

Is there any way to control this behaviour? I get that this might be getting interpreted as a "paragraph", but I would only expect that if there are two  (s) or an actual paragraph .... Thanks!

https domain

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
)

func main() {
	content := `<img src="/uploads/1.jpg">`

	converter := md.NewConverter("https://www.test.com", true, nil)
	markdown, err := converter.ConvertString(content)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println("md ->", markdown)
}

Generated Markdown

![](http://https:%2F%2Fwww.test.com/uploads/1.jpg)

Expected Markdown

![](https://www.test.com/uploads/1.jpg)

go get unable to handle certain filenames

Hello,

Thank you for this library.
However I am unable to use the latest version.

$ go get github.com/JohannesKaufmann/html-to-markdown@master
verifying github.com/JohannesKaufmann/[email protected]/go.mod: github.com/JohannesKaufmann/[email protected]/go.mod: reading https://sum.golang.org/lookup/github.com/!johannes!kaufmann/[email protected]: 410 Gone

Visiting the sum.golang.org link in the error

not found: create zip: malformed file path "testdata/TestFromString/<br>_adds_new_line_break.golden": invalid char '<'

Would you please consider updating the file names so that the issue is resolved?
Thanks

Incorrect coding of `<code><...></code>`

HTML Input

<code>
<a href="#Blabla">
	<img src="http://bla.bla/img/img.svg" style="height:auto" width="200px"/>
</a>
</code>

Generated Markdown

 `



`

Expected Markdown

```
<a href="#Blabla">
	<img src="http://bla.bla/img/img.svg" style="height:auto" width="200px"/>
</a>
```

Add `dl dt dd` tags support

I want to add processing of specified tags. It seems that i should append commonmark rule and tests, right?

But i'm not quite sure about the way it should be presented in markdown.

UPD: oh i opened a bug issue, sorry. Can't remove label

Broken output with new lines between tags

The problem may appear in a wider amount of cases, but what I've got so far is the following:

There are text posts with links to videos in specific tags

<video>https://youtu.be/SoMeViD</video>\r\n<video>https://youtu.be/SoMeViD</video>

html-to-markdown doesn't understand them, which is absolutely fine, I just want it to leave for further processing. When there is one, or they are separated with some elements - no problem at all, everything works perfectly. However when there two or more, it results in:

https://youtu.be/BpDqa2K0hvIhttps://youtu.be/GfE2D62bMTE

Or, if I wanted to make a regular link from it, or embed in iframe I would get this:
https://youtu.be/BpDqa2K0hvIhttps://youtu.be/GfE2D62bMTE

I think in such a case separators between tags, such as , \t,  , \n, or \r\n should be kept.

Provide a cmd package

🐛 `start` parameter of `<ol>` tag is ignored

Describe the bug

The start parameter in <ol> tags specifies what number in a sequence to start with. This is often used when there's something that needs to be inserted between the entries, like a code block:

HTML Input

<ol start=3><li>Echo the word "foo"</ol></li>
<pre><code>echo('foo')</code></pre>
<ol start=4><li>Now echo "bar"</ol></li>
<pre><code>echo('bar')</code></pre>

Generated Markdown

1. Echo the word "foo"

```
echo('foo')
```

1. Now echo "bar"

```
echo('bar')
```

Expected Markdown

3. Echo the word "foo"

```
echo('foo')
```

4. Now echo "bar"

```
echo('bar')
```

Is `Converter` safe for use by multiple goroutines?

This should be documented. Is it safe to use by multiple goroutines? Am I expected to use one single instance of Converter with same configuration across my app, or to create new in each case? What's the design, what are performance considerations?

PS: there is sync.RWMutex within Converter struct, so the answer is probably yes, but, again, this should be documented to not guess or reverse engineer.

Proper spaces missing

Check out the outputs from #21 & #22:

Only ~blue ones~~left~

[go](/topics/go "Topic: go")[golang](/topics/golang "Topic: golang")[html](/topics/html "Topic: html")[html-to-markdown](/topics/html-to-markdown "Topic: html-to-markdown")[markdown](/topics/markdown "Topic: markdown")

I think proper spaces are missing between items (between "ones" and "left", and between all the tags)

🐛 Bug: Support MathJax custom tags

Describe the bug
MathJax is a JavaScript library allowing to add "custom tags" such as $...$ to HTML which will then be turned into e.g., MathML or whatever the browser supports.

Depending on the Markdown implementation math is either not supported at all -- or directly through the same syntax. Either way, it'd probably make most sense to simply keep $...$ expressions intact and not escape strings contained therein. While a simple filter for that would certainly work, MathJax allows supporting different escape characters than $...$ for inline- and $$...$$ for display-math, e.g., from the article https://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/:

<script>
window.MathJax = {
  tex: {
    tags: "ams",                                                                       inlineMath: [ ['$','$'], ['\\(', '\\)'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
  },
  options: {
    skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
  },
  loader: {
    load: ['[tex]/amscd']                                                            }
};
</script>

This would necessate parsing Js though ...

HTML Input

some formula: $\lambda$

Generated Markdown

some formula: $\\lambda$

Expected Markdown

some formula: $\lambda$

Additional context
This filter (or "unfilter") may be only activated, if MathJax is detected, and otherwise disabled. Further, as mentioned earlier, a more sophisticated parsing of the HTML may be used to detect the precise math-HTML tags used or make them configurable at the least.

🐛 Bug? turns one line break into two

Describe the bug
A clear and concise description of what the bug is --
  turns one line break into two.

HTML Input

foo<br>bar

Generated Markdown

foo

bar

Expected Markdown

foo
bar

Additional context

The problem was initially reported at suntong/html2md#15

Does not work with go modules because of : in filename

With new go+go modules, I get

go get github.com/JohannesKaufmann/html-to-markdown: no matching versions for query "upgrade"

:-(

Nested lists aren't converted correctly

Describe the bug
I'm seeing a problem converting nested HTML lists. The problem appears with either ordered (<ol>) or unordered (<ul>) lists.

HTML Input

<ol>
	<li>One</li>
	<ol>
		<li>One point one</li>
		<li>One point two</li>
	</ol>
</ol>

Generated Markdown

1. One

1. One point one
2. One point two

Expected Markdown

1. One
    1. One point one
    2. One point two

Additional context
I see this with the latest version (1.2.1). I'm using the following test code to check this:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
)

func main() {
	converter := md.NewConverter("", true, nil)

	html := `
<ol>
	<li>One</li>
	<ol>
		<li>One point one</li>
		<li>One point two</li>
	</ol>
</ol>
`

	markdown, err := converter.ConvertString(html)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("md ->\n%s\n", markdown)
}

Thanks for the library!

html not suport.

var html =`
<p>1. xxx <br/>2. xxxx<br/>3. xxx</p><p><span class="img-wrap"><img src="xxx"></span><br>4. golang<br>a. xx<br>b. xx</p>
`

func Test_md(t *testing.T) {
	var converter = md.NewConverter("", true, nil)
	md_str,_ := converter.ConvertString(html)
	println(md_str)
}

output

1\. xxx 2\. xxxx3\. xxx

![](xxx)4\. golanga. xxb. xx

want

1. xxx 
2. xxxx
3. xxx

![](xxx)
4. golang
a. xx
b. xx

Fix punctuation with rules?

Hi!

As I'm writing a scraper for a website, I'd like to fix some minor punctuation issues before saving the text, like when there are wrong spaces next to parenthesis like : Lorem ( ipsum dolor) sit amet or consectetur (adipiscing ) elit.

Do you think writing a converter rule (converter.AddRules) is the right solution to remove these king of error? I'd also like to replace some quotation mark, and add italic for quotations…

Hoping it's the right place for this kind of question!
Best, Laurent

Spacing & numbering issues with nested lists

Describe the bug

I see a couple issues with nested lists.

One issue is that there are extra line breaks between list items in nested lists. When I render this in my application, it wraps text with a  if there's an extra line break (which has implications for margin/padding).

Another (small) issue I see is that numbering gets off for numbered lists. I realize this doesn't matter with Markdown, but I thought I'd note it.

HTML Input

<p>
  The Corinthos Center for Cancer will be partially closed for remodeling
  starting <strong>4/15/21</strong>. Patients should be redirected as space
  permits in the following order:
</p>
<ol>
  <li>Metro Court West.</li>
  <li>Richie General.</li>
  <ol>
    <li>This place is ok.</li>
    <li>Watch out for the doctors.</li>
    <ol>
      <li>They bite.</li>
      <li>But not hard.</li>
    </ol>
  </ol>
  <li>Port Charles Main.</li>
</ol>
<p>For further information about appointment changes, contact:</p>
<ul>
  <li>Dorothy Hardy</li>
  <ul>
    <li><em>Head of Operations</em></li>
    <ul>
      <li><em>Interim</em></li>
    </ul>
  </ul>
  <li>[email protected]</li>
  <li>555-555-5555</li>
</ul>
<p>
  <em>The remodel is </em
  ><a href="http://www.google.com/" target="_self"><em>expected</em></a
  ><em> to complete in June 2021.</em>
  <strong><em>Timeframe subject to change</em></strong
  ><em>.</em>
</p>

Generated Markdown

The Corinthos Center for Cancer will be partially closed for remodeling
starting **4/15/21**. Patients should be redirected as space
permits in the following order:

1. Metro Court West.
2. Richie General.

   1. This place is ok.
   2. Watch out for the doctors.
      1. They bite.
      2. But not hard.

4. Port Charles Main.

For further information about appointment changes, contact:

- Dorothy Hardy

  - _Head of Operations_
    - _Interim_

- [email protected]
- 555-555-5555

_The remodel is_ [_expected_](http://www.google.com/) _to complete in June 2021._ **_Timeframe subject to change_** _._

Note how there are extra line breaks after "2. Richie General.", " 2. But not hard.", "- Dorothy Hardy", and " - Interim".

Also note how "4. Port Charles Main." should be "3. Port Charles Main.".

Expected Markdown

The Corinthos Center for Cancer will be partially closed for remodeling
starting **4/15/21**. Patients should be redirected as space
permits in the following order:

1. Metro Court West.
2. Richie General.
   1. This place is ok.
   2. Watch out for the doctors.
      1. They bite.
      2. But not hard.
3. Port Charles Main.

For further information about appointment changes, contact:

- Dorothy Hardy
  - _Head of Operations_
    - _Interim_
- [email protected]
- 555-555-5555

_The remodel is_ [_expected_](http://www.google.com/) _to complete in June 2021._ **_Timeframe subject to change_** _._

Additional context

I see this with the latest version (1.3.0). I'm using no plugins.

Thanks for the utility!

More Tests

Potential issue in the Table plugin with the isFirstTbody logic

Hello in the table.go plugin there's an issue with the firstSibling logic in the isFirstTbody function.

func isFirstTbody(s *goquery.Selection) bool {
firstSibling := s.Siblings().Eq(0) // TODO: previousSibling
if s.Is("tbody") && firstSibling.Length() == 0 {
return true
}
return false
}

I'm retrieving tables from confluence html format tbody-tr-th's. Somehow the firstSibling.Length() is not 0 haven't figured it out completely but when I comment it out it seems to do what it's supposed to do although might introduce a new bug :).

github.com/JohannesKaufmann/html-to-markdown v1.3.6
github.com/PuerkitoBio/goquery v1.8.0

The domain parameter for NewConverter

domain is used for links and images to convert relative urls ("/image.png") to absolute urls.

However, I found it not working:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	url := "https://github.com/JohannesKaufmann/html-to-markdown"
	doc, err := goquery.NewDocument(url)
	if err != nil {
		log.Fatal(err)
	}
	content := doc.Find("div.BorderGrid-row.hide-sm.hide-md > div")

	conv := md.NewConverter(md.DomainFromURL(url), true, nil)
	markdown := conv.Convert(content)

	fmt.Println(markdown)
}

go run /tmp/h2m-test.go
## About

_ Convert HTML to Markdown. Even works with whole websites.


### Topics

[go](/topics/go "Topic: go")[golang](/topics/golang "Topic: golang")[html](/topics/html "Topic: html")[html-to-markdown](/topics/html-to-markdown "Topic: html-to-markdown")[markdown](/topics/markdown "Topic: markdown")

### Resources

[Readme](#readme)

### License

[MIT License](/JohannesKaufmann/html-to-markdown/blob/master/LICENSE)

I.e., none of links and images are converted from relative urls ("/image.png") to absolute urls.

🐛 Bug Consecutive missing spaces

Describe the bug
Consecutive missing spaces
like

import"fmt"
fortrue

missing spaces!!!!

HTML Input

<div class="example_code">
<span style="color: #b1b100; font-weight: bold;">package</span> main<br>
<br>
<span style="color: #b1b100; font-weight: bold;">import</span> <span style="color: #cc66cc;">"fmt"</span><br>
<br>
<span style="color: #993333;">func</span> main<span style="color: #339933;">()</span> <span style="color: #339933;">{</span><br>
&nbsp; &nbsp; <span style="color: #b1b100; font-weight: bold;">for</span> <span style="color: #000000; font-weight: bold;">true</span> &nbsp;<span style="color: #339933;">{</span><br>
&nbsp; &nbsp; &nbsp; &nbsp; fmt<span style="color: #339933;">.</span>Printf<span style="color: #339933;">(</span><span style="color: #cc66cc;">"xxxxx。<span style="color: #000099; font-weight: bold;">\n</span>"</span><span style="color: #339933;">);</span><br>
&nbsp; &nbsp; <span style="color: #339933;">}</span><br>
<span style="color: #339933;">}</span><br>
</div>

Generated Markdown

package main



import"fmt"



func main(){

fortrue{


        fmt.Printf("xxxxx。\n");

}

}

Expected Markdown

package main

import "fmt"

func main(){

        for true{
        
                fmt.Printf("xxxxx。\n");
        
        }
}

Brackets escaping is currently disabled

Describe the bug
Brackets are currently not being escaped by html-to-markdown

HTML Input

[this should be escaped](http://test)

Generated Markdown

[this should be escaped](http://test)

Expected Markdown

\[this should be escaped\](http://test)

Additional context
What would it take for the bracket escaping to be re-enabled in escape.go? I see that it was previously disabled due to issues with the regex. Does it simply require a more robust regular expression?

<p><!--any comments-->sample markdown.</p>

<h2 id="subsub-SecondHeading">Second Heading</h2>

Then, markdown file is

<!--any comments-->sample markdown.

## Second Heading

Many thanks!

Configure elements to keep in `<code>`

Describe the bug
A clear and concise description of what the bug is.

HTML Input

<p>The ordinal number "fifth" can be abbreviated in various languages as follows:</p>
<ul>
	<li><code>English: 5<sup>th</sup></code></li>
	<li>French: 5<sup>ème</sup></li>
</ul>

Generated Markdown

The ordinal number "fifth" can be abbreviated in various languages as follows:

- `English: 5th`
- French: 5<sup>ème</sup>

Expected Markdown

The ordinal number "fifth" can be abbreviated in various languages as follows:

- `English: 5<sup>th</sup>`
- French: 5<sup>ème</sup>

Additional context
I use NewConverter("", true, nil).Keep("sup") to convert.

johanneskaufmann / html-to-markdown Goto Github PK

html-to-markdown's Introduction

Hi, I'm Johannes 👋

html-to-markdown's People

Contributors

Stargazers

Watchers

Forkers

html-to-markdown's Issues

Describe the bug

HTML Input

Generated Markdown

Expected Markdown

Additional context

Recommend Projects

Recommend Topics

Recommend Org