Git Product home page Git Product logo

html-to-markdown's Introduction

Hi, I'm Johannes 👋

Experience in developing REST/RPC APIs with Golang on AWS. Also using React, Redux, Webpack, SCSS and GraphQL on the Frontend. I am always excited to learn new skills: Flutter, Elm and a never-ending supply of AWS Services 😉


  • html-to-markdown - Golang library that converts HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown's People

Contributors

bubenkoff avatar dependabot[bot] avatar devabreu avatar hilmanski avatar johanneskaufmann avatar mmelvin0 avatar skarlso avatar suntong avatar vivook avatar wcalandro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html-to-markdown's Issues

📣 Plans for V2

The V2 of the library is in the works. It is a rewrite from the ground up — even more accurate than the current version.

Some new features:

  • Nested lists: More edge cases around (deeply) nested lists are supported
  • Smart escaping: Only escape characters if they would be mistaken for markdown syntax
  • ...

➡️ What are some things that you would want to see? How could the API be improved? What currently annoys you?

💬 Who is using it?

People are using the library for different use cases. Some are using it for better readability of websites, others for migrating content. Knowing the use cases helps to prioritize features and plugins. So I would be interested in...

➡️ Who is using it? Could you let me know what you're using the library for? How was the experience?

🐛 Spaces missing before em elements

Describe the bug
Spaces missing before em elements

HTML Input

<ul style="list-style-type:disc">
    <li>All manually reviewed <em>Drosophila melanogaster</em> entries</li>
    <li>All manually reviewed <em>Drosophila pseudoobscura pseudoobscura</em> entries</li>
</ul>

Generated Markdown

- All manually reviewed_Drosophila melanogaster_ entries
- All manually reviewed_Drosophila pseudoobscura pseudoobscura_ entries

Expected Markdown

- All manually reviewed _Drosophila melanogaster_ entries
- All manually reviewed _Drosophila pseudoobscura pseudoobscura_ entries

Additional context
Also seeing this with the GitHubFlavored plugin.

P.S. Thanks a lot for the developing this package - it's very handy!

Usage of IsInlineElement function

So, per my observation multipleNewLinesRegex was used because we may encounter excessive newline as a result of consecutive inline element, as we add newline before and after its content (in some elements).
Could we use IsInlineElement function to only add the required newline?
For example if we encounter an inline element and the previous sibling isn't an inline element, then we may add newline as prefix

Plugins list ?

Just wondering is there is a list of plugins that are not just the ones in this repo.

I could just search but figured it’s worth asking

🐛 Bug with square brackets

Describe the bug

Found an issue with square brackets in the input which is confusing me. They end up being converted to \$& in the output. This seems to happen whether they are written in the html as [], &lbrack;, or &#91;.

HTML Input

<p>first [literal] brackets</p>
<p>then &#91;one&#93; way to escape</p>
<p>then &lbrack;another&rbrack; one</p>

Generated Markdown

first \$&literal\$& brackets

then \$&one\$& way to escape

then \$&another\$& one

Expected Markdown

first \[literal\] brackets

then &#91;one&#93; way to escape

then &lbrack;another&rbrack; one

Additional context

I had this issue come up with some options configured, but then went ahead and removed all configuration to test and I'm still seeing it. Is it something on my end I'm doing incorrectly perhaps? I'm not very experienced with golang so it's possible I'm making a silly error.

Image URLs

properly handle image urls that are absolute

Clarify license information for content in testdata/TestRealWorld/

Another thing found while working on Debian packaging -- could you clarify the license(s) for the files in testdata/TestRealWorld/, particularly content in the bonnerruderverein.de and snippets directories? It's not clear if the original author(s) made the content available under the MIT license, public domain, etc. (The content from the Golang website is fine, as each page has license information in its footer.)

Use of escape.Markdown for #text elements

Hello,

I'm using your library for a markdown generation tool for static site generators. The Rule interface is just perfect!

The use of escape for #text elements mostly seem like a problem for me as I read through the code. Would you be able to explain why this was used in the first place? I couldn't understand why certain characters needed to be escaped in the first place.

Thanks!

Unexpected result with additional rule for custom self-closing tags

I was following this example to write a rule to process custom <mention> tags in my input: https://github.com/JohannesKaufmann/html-to-markdown/blob/master/examples/custom_tag/main.go

Result was quite surprising, however not sure if this is a bug or misuse or maybe some limitations of the library?

Code:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
	"github.com/PuerkitoBio/goquery"
)

func main() {
	html := `
	test
	
	<mention user="user1" />
	<mention user="user2" />
	<mention user="user3" />

	blabla
	`

	rule := md.Rule{
		Filter: []string{"mention"},
		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			return &result
		},
	}

	conv := md.NewConverter("", true, nil)
	conv.AddRules(rule)

	markdown, err := conv.ConvertString(html)
	if err != nil {
		log.Fatalln(err)
	}

	fmt.Println("markdown:\n", markdown)
}

Expected output:

markdown:
 test
	
 @user1
 @user2
 @user3

 blabla

Observed output:

markdown:
 test

 @user1

Moreover, if I put these strings to debug what is going on in Replacement calls, it becomes even more weird:

		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			html, err := selec.Html()
			if err != nil {
				log.Fatalln(err)
			}

			fmt.Println("content:", content)
			fmt.Println("selec:", html)
			fmt.Println("result:", result)

			return &result
		},

Output:

content: 

 blabla  

selec:

        blabla 

result: @user3 
content: @user3
selec:
        <mention user="user3">

        blabla
        </mention>
result: @user2
content: @user2
selec:
        <mention user="user2">
        <mention user="user3">

        blabla
        </mention></mention>
result: @user1

dashes in existing frontmatter in source HTML files become escaped.

Describe the bug

dashes in existing frontmatter in source HTML files become escaped.

HTML Input

---
type: page
layout: reisebericht
title: Hamar
date: '2018-10-24 22:32:03 +0100'
weight: 2
tags:
- Norwegen
- Hamar
url: /2018/10-norwegen/02-hamar/
description: Ein Spaziergang durch Hamar, Shoppen und ein Besuch im Norsk jernbanemusem, dem norwegischen Eisenbahnmuseum.
image: files/2018/10-Norwegen/Hamar_Titel.jpg
---

Generated Markdown

\-\-\-
type: page
layout: reisebericht
title: Hamar
date: '2018-10-24 22:32:03 +0100'
weight: 2
tags:
\- Norwegen
\- Hamar
url: /2018/10-norwegen/02-hamar/
description: Ein Spaziergang durch Hamar, Shoppen und ein Besuch im Norsk jernbanemusem, dem norwegischen Eisenbahnmuseum.
image: files/2018/10-Norwegen/Hamar\_Titel.jpg
\-\-\-

Expected Markdown

---
type: page
layout: reisebericht
title: Hamar
date: '2018-10-24 22:32:03 +0100'
weight: 2
tags:
- Norwegen
- Hamar
url: /2018/10-norwegen/02-hamar/
description: Ein Spaziergang durch Hamar, Shoppen und ein Besuch im Norsk jernbanemusem, dem norwegischen Eisenbahnmuseum.
image: files/2018/10-Norwegen/Hamar_Titel.jpg
---

Additional context
This problem occurs when the source files already contain frontmatter (for example when converting Hugo .html files to .md).

FAIL: TestRealWorld/snippets/tweet

While working on packaging version 1.3.5 of this library for inclusion in Debian, I encountered the following test failure, due to a space just before the <br> tag:

=== RUN   TestRealWorld/snippets/tweet
    commonmark_test.go:74: Result did not match the golden fixture. Diff is below:
        
        --- Expected
        +++ Actual
        @@ -2,3 +2,3 @@
         <br>
        -As a company, it’s our responsibility to better support our Black associates, customers and allies. We know there is more work to do and will keep you updated on our progress, this is only the beginning. Black Lives Matter.<br>
        +As a company, it’s our responsibility to better support our Black associates, customers and allies. We know there is more work to do and will keep you updated on our progress, this is only the beginning. Black Lives Matter. <br>
         <img src="https://cdn.substack.com/image/fetch/w_600,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fpbs.substack.com%2Fmedia%2FEaVVy4aXsAglkCk.jpg" alt=""><br>
        
--- FAIL: TestRealWorld (0.08s)

Looking at that bit of test code, my initial guess is that it has something to do with Debian having a much newer version of github.com/yuin/goldmark (1.4.13) than what is pinned to in this project's go.mod file (1.2.0), but I haven't investigated much further.

🐛 Bug: Support `<tt>` for code next to `<code>` tags

Describe the bug
Unfortunately, some sites don't use semantic markup, e.g.,
http://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/
but instead specify the font directly using tt. Since markdown draws no distinction b/w code and things simply formatted in "typewriter style", these should be recognized at well (or, at least, as a plugin).

HTML Input

<tt>Some typewriter text</tt>

Generated Markdown

Some typewriter text

Expected Markdown

`Some typewriter text`

Additional context
N/A

don't escape twice

if a markdown character is already escaped
\* item
it is escaped a second time
\\* item

wanted: stay with \* item

Mention wrapper program in README.md?

Hi @JohannesKaufmann

I love your project so much that I added a wrapper program to it:

$ html2md -i https://github.com/suntong/lang
[Homepage](https://github.com/)
. . . 


$ html2md -i https://github.com/suntong/lang -s 'div#readme'   
## README.md

# lang -- programming languages demos

Would it be OK that I PR to README.md to mention html2md when it is ready? So far I'm having these planned out:

$ html2md
HTML to Markdown
Version 0.1.0 built on 2020-07-26
Copyright (C) 2020, Tong Sun

HTML to Markdown converter on command line

Usage:
  html2md [Options...]

Options:

  -h, --help                       display help information 
  -i, --in                        *The html/xml file to read from (or stdin) 
  -d, --domain                     Domain of the web page, needed for links when --in is not url 
  -s, --sel                        CSS/goquery selectors [=body]
  -v, --verbose                    Verbose mode (Multiple -v options increase the verbosity.) 

      --opt-heading-style          Option HeadingStyle 
      --opt-horizontal-rule        Option HorizontalRule 
      --opt-bullet-list-marker     Option BulletListMarker 
      --opt-code-block-style       Option CodeBlockStyle 
      --opt-fence                  Option Fence 
      --opt-em-delimiter           Option EmDelimiter 
      --opt-strong-delimiter       Option StrongDelimiter 
      --opt-link-style             Option LinkStyle 
      --opt-link-reference-style   Option LinkReferenceStyle 

  -A, --plugin-conf-attachment     Plugin ConfluenceAttachments 
  -C, --plugin-conf-code           Plugin ConfluenceCodeBlock 
  -F, --plugin-frontmatter         Plugin FrontMatter 
  -G, --plugin-gfm                 Plugin GitHubFlavored 
  -S, --plugin-strikethrough       Plugin Strikethrough 
  -T, --plugin-table               Plugin Table 
  -L, --plugin-task-list           Plugin TaskListItems 
  -V, --plugin-vimeo               Plugin VimeoEmbed 
  -Y, --plugin-youtube             Plugin YoutubeEmbed 

Thanks

🐛 Bug Can not handle img

Describe the bug
A clear and concise description of what the bug is.

HTML Input

<figure><img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png"><figcaption></figcaption></figure>

Generated Markdown

<img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png">

Expected Markdown

nonting

Trying to get in touch with you regarding a security issue

Hi there,

I couldn't find a SECURITY.md in your repository and so am not sure how to best contact you privately to disclose the security issue.

Can you add a SECURITY.md file with your e-mail to your repository, so that I know who to contact? GitHub suggests that a security policy is the best way to make sure security issues are responsibly disclosed.

Once you've done that, please let me know so I can ping you the info.

Thanks! (cc @JamieSlome)

Extra <span> elements in <code> blocks

Some websites use <code> blocks with <span> elements inside. It seems to be the case when the syntax highlighting is computed server-side, rather than on the browser with some JS library such as prettify.

To reproduce:

func main() {
	converter := md.NewConverter("", true, nil)
	url := "https://atomizedobjects.com/blog/javascript/how-to-get-the-last-segment-of-a-url-in-javascript"
	markdown, _ := converter.ConvertURL(url)
	fmt.Println("markdown)
}

What I get (scrolling down a bit):

``js
window<span class="token punctuation">.</span>location<span class="token punctuation">.</span>pathname<span class="token punctuation">.</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">filter</span><span class="token punctuation">(</span><span class="token parameter">entry</span> <span class="token operator">=></span> entry <span class="token operator">!==</span> <span class="token string">""</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token comment">// ["blog", "javascript", "how-to-get-the-last-segment-of-a-url-in-javascript"]</span>
``

What you get if you just remove all <span> elements from the generated markdown:

window.location.pathname.split("/").filter(entry => entry !== "");
// ["blog", "javascript", "how-to-get-the-last-segment-of-a-url-in-javascript"]

I know that an easy workaround on my side would be to just clean things up with goquery, but I figured it would be better to have it fixed here directly.

Thanks!

🐛 Bug <br> is converted into two new lines (\n\n)

Describe the bug

In my testing I've found that the HTML tag <br /> gets turned into two new lines (\n\n);

Example:

(⎈ |local:default)
prologic@Jamess-iMac
Mon Aug 02 11:37:55
~/tmp/html2md
 (master) 130
$ ./html2md -i
Hello<br />World
Hello

World

HTML Input

Hello<br />World

Generated Markdown

Hello

World

Expected Markdown

Hello
World

Additional context

Is there any way to control this behaviour? I get that this might be getting interpreted as a "paragraph", but I would only expect that if there are two <br />(s) or an actual paragraph <p>...</p>. Thanks!

https domain

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
)

func main() {
	content := `<img src="/uploads/1.jpg">`

	converter := md.NewConverter("https://www.test.com", true, nil)
	markdown, err := converter.ConvertString(content)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println("md ->", markdown)
}

Generated Markdown

![](http://https:%2F%2Fwww.test.com/uploads/1.jpg)

Expected Markdown

![](https://www.test.com/uploads/1.jpg)

go get unable to handle certain filenames

Hello,

Thank you for this library.
However I am unable to use the latest version.

$ go get github.com/JohannesKaufmann/html-to-markdown@master
verifying github.com/JohannesKaufmann/[email protected]/go.mod: github.com/JohannesKaufmann/[email protected]/go.mod: reading https://sum.golang.org/lookup/github.com/!johannes!kaufmann/[email protected]: 410 Gone

Visiting the sum.golang.org link in the error

not found: create zip: malformed file path "testdata/TestFromString/<br>_adds_new_line_break.golden": invalid char '<'

Would you please consider updating the file names so that the issue is resolved?
Thanks

Incorrect coding of `<code><...></code>`

HTML Input

<code>
<a href="#Blabla">
	<img src="http://bla.bla/img/img.svg" style="height:auto" width="200px"/>
</a>
</code>

Generated Markdown

 `



`

Expected Markdown

```
<a href="#Blabla">
	<img src="http://bla.bla/img/img.svg" style="height:auto" width="200px"/>
</a>
```

Add `dl dt dd` tags support

I want to add processing of specified tags. It seems that i should append commonmark rule and tests, right?

But i'm not quite sure about the way it should be presented in markdown.

UPD: oh i opened a bug issue, sorry. Can't remove label

Broken output with new lines between tags

The problem may appear in a wider amount of cases, but what I've got so far is the following:

There are text posts with links to videos in specific tags

<video>https://youtu.be/SoMeViD</video>\r\n<video>https://youtu.be/SoMeViD</video>

html-to-markdown doesn't understand them, which is absolutely fine, I just want it to leave for further processing. When there is one, or they are separated with some elements - no problem at all, everything works perfectly. However when there two or more, it results in:

https://youtu.be/BpDqa2K0hvIhttps://youtu.be/GfE2D62bMTE

Or, if I wanted to make a regular link from it, or embed in iframe I would get this:
https://youtu.be/BpDqa2K0hvIhttps://youtu.be/GfE2D62bMTE

I think in such a case separators between tags, such as , \t, &nbsp;, \n, or \r\n should be kept.

🐛 `start` parameter of `<ol>` tag is ignored

Describe the bug

The start parameter in <ol> tags specifies what number in a sequence to start with. This is often used when there's something that needs to be inserted between the entries, like a code block:

HTML Input

<ol start=3><li>Echo the word "foo"</ol></li>
<pre><code>echo('foo')</code></pre>
<ol start=4><li>Now echo "bar"</ol></li>
<pre><code>echo('bar')</code></pre>

Generated Markdown

1. Echo the word "foo"

```
echo('foo')
```

1. Now echo "bar"

```
echo('bar')
```

Expected Markdown

3. Echo the word "foo"

```
echo('foo')
```

4. Now echo "bar"

```
echo('bar')
```

Is `Converter` safe for use by multiple goroutines?

This should be documented. Is it safe to use by multiple goroutines? Am I expected to use one single instance of Converter with same configuration across my app, or to create new in each case? What's the design, what are performance considerations?

PS: there is sync.RWMutex within Converter struct, so the answer is probably yes, but, again, this should be documented to not guess or reverse engineer.

Proper spaces missing

Check out the outputs from #21 & #22:

Only ~blue ones~~left~

[go](/topics/go "Topic: go")[golang](/topics/golang "Topic: golang")[html](/topics/html "Topic: html")[html-to-markdown](/topics/html-to-markdown "Topic: html-to-markdown")[markdown](/topics/markdown "Topic: markdown")

I think proper spaces are missing between items (between "ones" and "left", and between all the tags)

🐛 Bug: Support MathJax custom tags

Describe the bug
MathJax is a JavaScript library allowing to add "custom tags" such as $...$ to HTML which will then be turned into e.g., MathML or whatever the browser supports.

Depending on the Markdown implementation math is either not supported at all -- or directly through the same syntax. Either way, it'd probably make most sense to simply keep $...$ expressions intact and not escape strings contained therein. While a simple filter for that would certainly work, MathJax allows supporting different escape characters than $...$ for inline- and $$...$$ for display-math, e.g., from the article https://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/:

<script>
window.MathJax = {
  tex: {
    tags: "ams",                                                                       inlineMath: [ ['$','$'], ['\\(', '\\)'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
  },
  options: {
    skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
  },
  loader: {
    load: ['[tex]/amscd']                                                            }
};
</script>

This would necessate parsing Js though ...

HTML Input

some formula: $\lambda$

Generated Markdown

some formula: $\\lambda$

Expected Markdown

some formula: $\lambda$

Additional context
This filter (or "unfilter") may be only activated, if MathJax is detected, and otherwise disabled. Further, as mentioned earlier, a more sophisticated parsing of the HTML may be used to detect the precise math-HTML tags used or make them configurable at the least.

Nested lists aren't converted correctly

Describe the bug
I'm seeing a problem converting nested HTML lists. The problem appears with either ordered (<ol>) or unordered (<ul>) lists.

HTML Input

<ol>
	<li>One</li>
	<ol>
		<li>One point one</li>
		<li>One point two</li>
	</ol>
</ol>

Generated Markdown

1. One

1. One point one
2. One point two

Expected Markdown

1. One
    1. One point one
    2. One point two

Additional context
I see this with the latest version (1.2.1). I'm using the following test code to check this:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
)

func main() {
	converter := md.NewConverter("", true, nil)

	html := `
<ol>
	<li>One</li>
	<ol>
		<li>One point one</li>
		<li>One point two</li>
	</ol>
</ol>
`

	markdown, err := converter.ConvertString(html)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("md ->\n%s\n", markdown)
}

Thanks for the library!

html <br> not suport.

var html =`
<p>1. xxx <br/>2. xxxx<br/>3. xxx</p><p><span class="img-wrap"><img src="xxx"></span><br>4. golang<br>a. xx<br>b. xx</p>
`

func Test_md(t *testing.T) {
	var converter = md.NewConverter("", true, nil)
	md_str,_ := converter.ConvertString(html)
	println(md_str)
}

output

1\. xxx 2\. xxxx3\. xxx

![](xxx)4\. golanga. xxb. xx

want

1. xxx 
2. xxxx
3. xxx

![](xxx)
4. golang
a. xx
b. xx

Fix punctuation with rules?

Hi!

As I'm writing a scraper for a website, I'd like to fix some minor punctuation issues before saving the text, like when there are wrong spaces next to parenthesis like : Lorem ( ipsum dolor) sit amet or consectetur (adipiscing ) elit.

Do you think writing a converter rule (converter.AddRules) is the right solution to remove these king of error? I'd also like to replace some quotation mark, and add italic for quotations…

Hoping it's the right place for this kind of question!
Best, Laurent

Spacing & numbering issues with nested lists

Describe the bug

I see a couple issues with nested lists.

One issue is that there are extra line breaks between list items in nested lists. When I render this in my application, it wraps text with a <p> if there's an extra line break (which has implications for margin/padding).

Another (small) issue I see is that numbering gets off for numbered lists. I realize this doesn't matter with Markdown, but I thought I'd note it.

HTML Input

<p>
  The Corinthos Center for Cancer will be partially closed for remodeling
  starting <strong>4/15/21</strong>. Patients should be redirected as space
  permits in the following order:
</p>
<ol>
  <li>Metro Court West.</li>
  <li>Richie General.</li>
  <ol>
    <li>This place is ok.</li>
    <li>Watch out for the doctors.</li>
    <ol>
      <li>They bite.</li>
      <li>But not hard.</li>
    </ol>
  </ol>
  <li>Port Charles Main.</li>
</ol>
<p>For further information about appointment changes, contact:</p>
<ul>
  <li>Dorothy Hardy</li>
  <ul>
    <li><em>Head of Operations</em></li>
    <ul>
      <li><em>Interim</em></li>
    </ul>
  </ul>
  <li>[email protected]</li>
  <li>555-555-5555</li>
</ul>
<p>
  <em>The remodel is </em
  ><a href="http://www.google.com/" target="_self"><em>expected</em></a
  ><em> to complete in June 2021.</em>
  <strong><em>Timeframe subject to change</em></strong
  ><em>.</em>
</p>

Generated Markdown

The Corinthos Center for Cancer will be partially closed for remodeling
starting **4/15/21**. Patients should be redirected as space
permits in the following order:

1. Metro Court West.
2. Richie General.

   1. This place is ok.
   2. Watch out for the doctors.
      1. They bite.
      2. But not hard.

4. Port Charles Main.

For further information about appointment changes, contact:

- Dorothy Hardy

  - _Head of Operations_
    - _Interim_

- [email protected]
- 555-555-5555

_The remodel is_ [_expected_](http://www.google.com/) _to complete in June 2021._ **_Timeframe subject to change_** _._

Note how there are extra line breaks after "2. Richie General.", " 2. But not hard.", "- Dorothy Hardy", and " - Interim".

Also note how "4. Port Charles Main." should be "3. Port Charles Main.".

Expected Markdown

The Corinthos Center for Cancer will be partially closed for remodeling
starting **4/15/21**. Patients should be redirected as space
permits in the following order:

1. Metro Court West.
2. Richie General.
   1. This place is ok.
   2. Watch out for the doctors.
      1. They bite.
      2. But not hard.
3. Port Charles Main.

For further information about appointment changes, contact:

- Dorothy Hardy
  - _Head of Operations_
    - _Interim_
- [email protected]
- 555-555-5555

_The remodel is_ [_expected_](http://www.google.com/) _to complete in June 2021._ **_Timeframe subject to change_** _._

Additional context

I see this with the latest version (1.3.0). I'm using no plugins.

Thanks for the utility!

More Tests

  • Test ConvertX methods
  • Test the outside api
  • Test more plugins (e.g. Table)
  • Test Edge Cases (for example from turndown)
  • Test escape
  • Add Test Coverage Badge

Potential issue in the Table plugin with the isFirstTbody logic

Hello in the table.go plugin there's an issue with the firstSibling logic in the isFirstTbody function.

func isFirstTbody(s *goquery.Selection) bool {
firstSibling := s.Siblings().Eq(0) // TODO: previousSibling
if s.Is("tbody") && firstSibling.Length() == 0 {
return true
}
return false
}

I'm retrieving tables from confluence html format tbody-tr-th's. Somehow the firstSibling.Length() is not 0 haven't figured it out completely but when I comment it out it seems to do what it's supposed to do although might introduce a new bug :).

github.com/JohannesKaufmann/html-to-markdown v1.3.6
github.com/PuerkitoBio/goquery v1.8.0

The domain parameter for NewConverter

domain is used for links and images to convert relative urls ("/image.png") to absolute urls.

However, I found it not working:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	url := "https://github.com/JohannesKaufmann/html-to-markdown"
	doc, err := goquery.NewDocument(url)
	if err != nil {
		log.Fatal(err)
	}
	content := doc.Find("div.BorderGrid-row.hide-sm.hide-md > div")

	conv := md.NewConverter(md.DomainFromURL(url), true, nil)
	markdown := conv.Convert(content)

	fmt.Println(markdown)
}
go run /tmp/h2m-test.go
## About

_ Convert HTML to Markdown. Even works with whole websites.


### Topics

[go](/topics/go "Topic: go")[golang](/topics/golang "Topic: golang")[html](/topics/html "Topic: html")[html-to-markdown](/topics/html-to-markdown "Topic: html-to-markdown")[markdown](/topics/markdown "Topic: markdown")

### Resources

[Readme](#readme)

### License

[MIT License](/JohannesKaufmann/html-to-markdown/blob/master/LICENSE)

I.e., none of links and images are converted from relative urls ("/image.png") to absolute urls.

🐛 Bug Consecutive <span> missing spaces

Describe the bug
Consecutive <span> missing spaces
like

import"fmt"
fortrue

missing spaces!!!!

HTML Input

<div class="example_code">
<span style="color: #b1b100; font-weight: bold;">package</span> main<br>
<br>
<span style="color: #b1b100; font-weight: bold;">import</span> <span style="color: #cc66cc;">"fmt"</span><br>
<br>
<span style="color: #993333;">func</span> main<span style="color: #339933;">()</span> <span style="color: #339933;">{</span><br>
&nbsp; &nbsp; <span style="color: #b1b100; font-weight: bold;">for</span> <span style="color: #000000; font-weight: bold;">true</span> &nbsp;<span style="color: #339933;">{</span><br>
&nbsp; &nbsp; &nbsp; &nbsp; fmt<span style="color: #339933;">.</span>Printf<span style="color: #339933;">(</span><span style="color: #cc66cc;">"xxxxx。<span style="color: #000099; font-weight: bold;">\n</span>"</span><span style="color: #339933;">);</span><br>
&nbsp; &nbsp; <span style="color: #339933;">}</span><br>
<span style="color: #339933;">}</span><br>
</div>

Generated Markdown

package main



import"fmt"



func main(){

fortrue{


        fmt.Printf("xxxxx。\n");

}

}

Expected Markdown

package main

import "fmt"

func main(){

        for true{
        
                fmt.Printf("xxxxx。\n");
        
        }
}

Brackets escaping is currently disabled

Describe the bug
Brackets are currently not being escaped by html-to-markdown

HTML Input

[this should be escaped](http://test)

Generated Markdown

[this should be escaped](http://test)

Expected Markdown

\[this should be escaped\](http://test)

Additional context
What would it take for the bracket escaping to be re-enabled in escape.go? I see that it was previously disabled due to issues with the regex. Does it simply require a more robust regular expression?

Is there a way to export the out to a md file directly?

Is there a way for the code to save the output into a md file instead of having the md output in the terminal? So example.html would automate be save as example.md

And ideally it would work for directories or multiple files. Any way to achieve that now or would this have to be added as a feature?

Missed space between two links

HTML: <p><a href="http://first.com">first</a> <a href="http://second.com">second</a></p>
Result: [first](http://first.com)[second](http://second.com)
Expected: [first](http://first.com) [second](http://second.com)

Configure elements to keep in `<code>`

Describe the bug
A clear and concise description of what the bug is.

HTML Input

<p>The ordinal number "fifth" can be abbreviated in various languages as follows:</p>
<ul>
	<li><code>English: 5<sup>th</sup></code></li>
	<li>French: 5<sup>ème</sup></li>
</ul>

Generated Markdown

The ordinal number "fifth" can be abbreviated in various languages as follows:

- `English: 5th`
- French: 5<sup>ème</sup>

Expected Markdown

The ordinal number "fifth" can be abbreviated in various languages as follows:

- `English: 5<sup>th</sup>`
- French: 5<sup>ème</sup>

Additional context
I use NewConverter("", true, nil).Keep("sup") to convert.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.