xijo / reverse_markdown Goto Github PK

View Code? Open in Web Editor NEW

620.0 13.0 117.0 211 KB

Ruby gem to convert html into markdown

License: Do What The F*ck You Want To Public License

Ruby 85.78% HTML 14.22%

reverse_markdown's Introduction

Summary

Transform html into markdown. Useful for example if you want to import html into your markdown based application.

Changelog

See Change Log

Requirements

Nokogiri
Ruby 2.0.0 or higher

Installation

Install the gem

[sudo] gem install reverse_markdown

or add it to your Gemfile

gem 'reverse_markdown'

Features

Supports all the established html tags like h1, h2, h3, h4, h5, h6, p, em, strong, i, b, blockquote, code, img, a, hr, li, ol, ul, table, tr, th, td, br, figure
Module based - if you miss a tag, just add it
Can deal with nested lists
Inline and block code is supported
Supports blockquote

Usage

Ruby

You can convert html content as string or Nokogiri document:

input  = '<strong>feelings</strong>'
result = ReverseMarkdown.convert input
result.inspect # " **feelings** "

Commandline

It's also possible to convert html files to markdown using the binary:

$ reverse_markdown file.html > file.md
$ cat file.html | reverse_markdown > file.md

Configuration

The following options are available:

unknown_tags (default pass_through) - how to handle unknown tags. Valid options are:
- pass_through - Include the unknown tag completely into the result
- drop - Drop the unknown tag and its content
- bypass - Ignore the unknown tag but try to convert its content
- raise - Raise an error to let you know
github_flavored (default false) - use github flavored markdown (yet only code blocks are supported)
tag_border (default ' ') - how to handle tag borders. valid options are:
- ' ' - Add whitespace if there is none at tag borders.
- '' - Do not not add whitespace.

As options

Just pass your chosen configuration options in after the input. The given options will last for this operation only.

ReverseMarkdown.convert(input, unknown_tags: :raise, github_flavored: true)

Preconfigure

Or configure it block style on a initializer level. These configurations will last for all conversions until they are set to something different.

ReverseMarkdown.config do |config|
  config.unknown_tags     = :bypass
  config.github_flavored  = true
  config.tag_border  = ''
end

Related stuff

Write custom converters - Wiki entry about how to write your own converter
html_massage - A gem by Harlan T. Wood to convert regular sites into markdown using reverse_markdown
word-to-markdown - Convert word docs into markdown while using reverse_markdown, by Ben Balter
markdown syntax - The markdown syntax specification
github flavored markdown - Githubs extension to markdown
wmd-editor - Markdown flavored text editor

Thanks

Thanks to all contributors and all other helpers:

Empact Ben Woosley
harlantwood Harlan T. Wood
aprescott Adam Prescott
danschultzer Dan Schultzer
Benjamin-Dobell Benjamin Dobell
schkovich Goran Miskovic
craig-day Craig Day
grmartin Glenn R. Martin
willglynn Will Glynn

reverse_markdown's People

Contributors

Stargazers

Watchers

Forkers

laszlopapp rgould dedene alvinlai jrobertson empact snikch jaen harlantwood visoft mbj harrisj sunaku aprescott ben-axnick marcusramberg mrgcohen generalscripting pablorc mauidude jerryluk yanismydj ticketinghub-old idlecool henrypoydar slowyourlife mose gdiphil brunokoga overcastnetwork tibastral benjamin-dobell mark-kraemer aha-app schkovich gammons gregoryjscott acco bdunagan livathinos adampash staugaard arunthampi faheemmughal mitchellstoresweb olevannghia tcannonfodder grmartin chromamatic ehsandarroudi willglynn vicemedia kert-io craig-day ysllyfe villainystudios yongqianme flyeven nicholas-johnson krisdigital antonversal 01022012 2m kanedo jclif fddayan niall3rs sgbo ocowchun grddev iostreamatlab fukmaru pooza e7dal lazybios michaelglass meisterlabs envato-archive fauno zhangliwen shivabhusal pocke olleolleolle warmintro userhas404d jaspervandenberg redevitoria claudioalcantara anshul78 aried3r yatryan fivetonine devopstoday11 global-localhost global19 global19-atlassian-net jnylen utkukaynar benubois bearerpipelinetest

reverse_markdown's Issues

Blockquotes are assumed to be free of any HTML elements

It looks like <blockquote>s are assumed to be free of any HTML elements. If input like <blockquote>...</blockquote> is given, the output is broken:

ReverseMarkdown.parse("<blockquote><p>foo</p></blockquote>")
#=> ">\n\n> foo"

i.e.,:

>

> foo

The desired output here is simply

> foo

Likewise, if a <blockquote> contains, say, a list:

ReverseMarkdown.parse("<blockquote><ul><li>foo</li></ul></blockquote>")
#=> ">\n- foo"

i.e.,

>
- foo

The desired output is

> - foo

<blockquote> should probably be unwrapped, then its contents parsed to Markdown, then the final result prepended with >, so that input like:

<blockquote>
<p>Some text.</p>
<p>Some more text.</p>
</blockquote>

Results in:

> Some text.
>
> Some more text.

Nested indentation - confirm bug?

Before I dive in and try and fix it, can you confirm that this behavior is undesirable?

Given a nested list between adjacent list items like this:

<ul>
  <li>alpha</li>
  <li>bravo
    <ul>
      <li>bravo alpha</li>
      <li>bravo bravo
        <ul>
          <li>bravo bravo alpha</i>
        </ul>
      </li>
    </ul>
  </li>
  <li>charlie</li>
  <li>delta</li>
</ul>

An extra newline seems to be inserted. So instead of getting this:

- alpha
- bravo
  - bravo alpha
  - bravo bravo
    - bravo bravo alpha
- charlie
- delta

Reverse markdown produces this:

- alpha
- bravo
  - bravo alpha
  - bravo bravo
    - bravo bravo alpha

- charlie
- delta

Seems wrong to me, but just wanted to check before I dug around for a fix. Markdown has surprised me before.

convert error

I get this error when convert html to markdown, is it an issue?
ArgumentError: negative argument
from /home/case18/.rvm/gems/ruby-2.1.0/gems/reverse_markdown-0.5.0/lib/reverse_markdown/converters/li.rb:22:in *' from /home/case18/.rvm/gems/ruby-2.1.0/gems/reverse_markdown-0.5.0/lib/reverse_markdown/converters/li.rb:22:inindentation_for'
from /home/case18/.rvm/gems/ruby-2.1.0/gems/reverse_markdown-0.5.0/lib/reverse_markdown/converters/li.rb:6:in convert' from /home/case18/.rvm/gems/ruby-2.1.0/gems/reverse_markdown-0.5.0/lib/reverse_markdown/converters/base.rb:11:intreat'
from /home/case18/.rvm/gems/ruby-2.1.0/gems/reverse_markdown-0.5.0/lib/reverse_markdown/converters/base.rb:6:in block in treat_children' from /home/case18/.rvm/gems/ruby-2.1.0/gems/nokogiri-1.6.1/lib/nokogiri/xml/node_set.rb:237:inblock in each'
from /home/case18/.rvm/gems/ruby-2.1.0/gems/nokogiri-1.6.1/lib/nokogiri/xml/node_set.rb:236:in upto' from /home/case18/.rvm/gems/ruby-2.1.0/gems/nokogiri-1.6.1/lib/nokogiri/xml/node_set.rb:236:ineach'

Incorrect markdown from adjacent strong tags with non-strong text after

ReverseMarkdown.convert('<strong>abc<strong> <strong>123</strong><br><br><strong>xyz</strong> hi')

"**abc 123  \n  \nxyz hi**  \n  "

Expected is

"**abc** **123**\n\n**xyz** hi"

Support for <colgroup> and <col>

Add support for <colgroup> and <col> which could be simply ignored...

http://www.w3schools.com/tags/tag_colgroup.asp

tag_border option is being ignored.

The tag_border option can theoretically be used to specify whether tags have a whitespace border or not:

#default:
ReverseMarkdown.convert("markdownFoo")
#no whitespace:
ReverseMarkdown.convert("markdownFoo", tag_border:'')

This option doesn't work at the moment, because:

tag_border is only used from cleaner.rb=>tidy
The call cleaner.tidy(result) in reverse_markdown.rb happens outside of the config block, so...
Inside of cleaner we revert to the default config, ignoring the input tag_border.

The bug can be fixed straightforwardly by moving the cleaner.tidy call into the config block. I'll add a PR in a bit demonstrating the fix.

Justification: Using ReverseMarkdown to parse tricky mixed tags of this form

<a href=\"http://Google.de\">Google.de</a>

results in the following output

[_ **Google** _ **.de**](http://Google.de)

The underscore tags aren't parsed correctly by markdown (or by redcarpet) because they're separated from the bold tag by the whitespaces added by the cleaner.

Emphasis trailing/leading whitespace

In relation to #37 I found an issue. It seems there is only proper clearing of whitespace with double emphasis:

2.1.1 :025 > ReverseMarkdown.convert '<strong> test </strong>'
 => " **test** " 
2.1.1 :026 > ReverseMarkdown.convert '<em> test </em>'
 => "_ test _"

From reading the cleaner.rb file I see that most places the emphasis is only defined as double asterisks, neither single or triple. I suggest that the RegExp is changed to handle this more carefully.

Process footnotes

I need to process footnotes and came up with this extension, if you think it could be a generic way to process footnotes, I'll be happy to open a PR.

# frozen_string_literal: true

require 'reverse_markdown'

# Sent as a patch to reverse_markdown
# https://github.com/xijo/reverse_markdown/issues/101
module ReverseMarkdown
  module Converters
    class Footnote < A
      def convert(node, state = {})
        # If the link has a circular reference, we need to check if it's
        # inside a paragraph or it's the first element of a paragraph or
        # list item.
        if node['id'] && node['href']&.start_with?('#')
          parent = node.parent

          # The link could be contained in a <sup>
          until %[p li].include?(parent.name) do
            parent = parent.parent

            # Don't go further than this
            break if parent.name == 'body'
          end

          first_child = parent.first_element_child

          # If it's the first link on the parent, it's the footnote
          # itself, otherwise it's the reference.
          if first_child == node || first_child.children.include?(node)
            "[^#{node['href'].tr('#', '')}]:"
          else
            "[^#{node['id']}]"
          end
        # Just process the link.
        else
          super
        end
      end
    end

    register :a, Footnote.new
  end
end

Edit: The footnote was pointing to it's id instead of the href for the reference. Also the markdown syntax was incorrect.

could there be an option for inline links

[an url](http://example.com/link)

Because for long documents, the reference style is hard to edit. Thank you!

Github tables without a header row

I have some tables that are coded in HTML without like this...

<table>
<tr>
<td><b> header 1 </b></td> <td> <b> header 2 </b> </td>
<td> item 3 </td> <td> item 4 </td>
</tr>
</table>

| header 1 | header 2 |
| item 3 | item 4 |

Suggestion: github_flavored should treat the first row as the header row to match the Github markdown spec for tables. The markdown shown above is not being converted back to HTML properly by Redcarpet or the github_markup gem (Ruby). Thanks!

unknown end tag: br

Please add line breaks.

Add option to convert HTML entities such as   to space or unicode non-breaking space

Is there a way to get   to be converted to space?

Pass unsupported tags instead of dropping them

What's the logic behind dropping all unsupported tags, rather than passing them through?

HTML is valid within markdown. If I have a non-markdownable entity (e.g. a script tag) reverse_markdown should pass it through to the output (or at least give me the option to do so).

Whitespace before links

When links are directly preceded by non-whitespace characters, a space is added before the link. Hence, a link wrapped in quotes, parens, etc, will come out appearing poorly formatted. Since whitespace is not required before a link in markdown, it shouldn't be added when reversing.

irb(main):004:0> ReverseMarkdown.convert '<p>I like this "<a href="http://daringfireball.net/projects/markdown/">markdown</a>" stuff!</p>'
=> "I like this \" [markdown](http://daringfireball.net/projects/markdown/)\" stuff!\n\n"

Expected:
=> "I like this \"[markdown](http://daringfireball.net/projects/markdown/)\" stuff!\n\n"

An exception may be where a ! precedes the link in the HTML, making the converted markdown an inline image instead of a link.

Doesn't handle within

ReverseMarkdown.convert("<strong>hello<br><br>world</strong>")

**hello  
  
world**

Expected:

**hello**  
  
**world**

Untidy tags produce invalid Markdown

I may work on a patch when I have the time, but I'll leave this here for discussion :)

I'm converting a large site with user-edited HTML content through a WYSIWYG editor, so I'm finding many cases where stuff like this word is actually word, which this gems converts into _wo__rd_. I think it'd be good to add an option to sanitize HTML by removing empty tags and tidying a bit before, but I'm not sure if it'd be a task for reverse markdown. Maybe passing something who can do the cleanup, like Loofah?

Document GitHub style fenced code blocks support

There seems to be a :github_style_code_blocks option, but it isn't documented. Would be a good selling point if it were in the README!

Text inside backticks is escaped

When I try to convert a  element containing text inside backticks, underscores (and asterisks)
are escaped:

ReverseMarkdown.convert("<p>`foo_bar`</p>")
=> "`foo\\_bar`"

Then, when I try to convert back the result with RedCarpet then I get this:

=> "<p><code>foo\\_bar</code></p>\n"

which is inconsistent with the reversed string due to the escaping.
Shouldn't reverse markdown recognise the backticks and skip escaping inside them?

Consider changing the license to something listed on Open Source Initiative list of approved licenses

As per https://opensource.org/minutes20090304 this was rejected as redundant with Fair license https://opensource.org/licenses/Fair

Would you consider using Fair license or CC0 or X11 or Apache 2.0 instead?

https://www.gnu.org/licenses/license-list.html#WTFPL FSF recommends using X11 license or Apache 2.0

This is stopping gitlab from updating licensee which now include a dependency on reverse_markdown

https://gitlab.com/gitlab-org/gitaly/-/issues/2856#note_438114315

When encountering a formatting tag, spaces are lost

When passing in text with formatting, such as  or <emm>, any spaces around the tags are lost, causing the words to run together.

Example:

Elephants are large land mammals in two extant genera of the family Elephantidae: Elephas and Loxodonta, with the third genus Mammuthus extinct.

produces the following output:

**Elephants**are large land mammals in two extant genera of the family Elephantidae:*Elephas*and*Loxodonta*, with the third genus*Mammuthus*extinct.

Unwanted new line characters within lists with paragraphs

The library is adding what I believe to be unintended newline characters, when parsing a document with li > p structure with newlines in between the 2 nodes:

[6] pry(main)> ReverseMarkdown.convert("<ul> <li><p>a</p></li></ul>")
=> "- a\n\n"
[7] pry(main)> ReverseMarkdown.convert("<ul><li><p>a</p></li></ul>")
=> "- a\n\n"
[8] pry(main)> ReverseMarkdown.convert("<ul><li> <p>a</p></li></ul>")
=> "-  \n\na\n\n"
[9] pry(main)> ReverseMarkdown.convert("<ul><li>\n<p>a</p>\n</li></ul>")
=> "- \n\na\n\n"

The 2 first examples work as intended, but if you add a space or newline character between the <li> and the  the library changes its behaviour and introduces the problem.

Funnily enough this scenario was accounted for in the list specs here, but the corresponding assertion is being skipped here - and it has been this way since 2012 (cd24cc3).

Anyway I'll open a PR shortly with a fix proposal. 👍

HTML string containing underscores gets escaped and shown in output markdown

Ruby 2.3.0

Rails 4.2.5

Rails Console output:

2.3.0 :005 > html_str = "<p><strong>Username</strong> : %{user_name}</p>"
 => "<p><strong>Username</strong> : %{user_name}</p>" 
2.3.0 :006 > md = ReverseMarkdown.convert(html_str)
 => " **Username** : %{user\\_name}\n\n" 
2.3.0 :007 > puts md
 **Username** : %{user\_name}
 => nil

As can be seen when my HTML string contains words separated by underscore(s), the underscore(s) gets escaped which is correct. But is there any way we can hide those in the output string? I have a use-case wherein I want to convert the HTML string to its Markdown version and render the markdown version as it is.

Incorrect markdown with non-breaking space before

abc  => _abc _, which is invalid according to redcarpet.

Unsupported tags: h5, h6, i, b

The README says you support h5 and h6, but they don't work:

ReverseMarkdown.parse("<h5>hi</h5>")
#=> "hi"
ReverseMarkdown.parse("<h6>hi</h6>")
#=> "hi"

Additionally, b and i tags aren't supported, although strong and em are.

Raise on character encoding errors

I've been using Reverse Markdown and it works great most of the time. I've run into one issue that I thought I'd get your opinion on.

Sometimes the HTML documents I'm converting have character encoding problems, leading to th dreaded Argument Error: invalid byte sequence in UTF-8.

In other places I'm fixing this by coercing the lines of a file to UTF8 as I read them. I've discovered that when you parse a line you can generally just force_encoding on it, and that will convert typographic marks and whatnot pretty well, but occasionally you'll run into issues where it's not enough and you have to be more aggressive, ie. the following:

def clean_line(line)
  # encoding must be utf8, if non-utf8 characters are encountered we remove them.
  # Weirdly though, this can fail, but then doesn't blow up until you call something else on the string...
  line.force_encoding("UTF-8").strip # strip will make this raise if it didn't work
rescue
  # ... in that case we want to selectively remove the offending characters.
   line.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
end

I end up using this same code to scrub HTML before I enter it into ReverseMarkdown, but it would probably be more efficient to handle it inside the gem - and would save other people from this same headache.

Are you interested in handling encoding errors inside the gem? If yes, you can use that code, or I can try to circle back with a PR. If not, no worries, just thought it might be worth considering.

Thanks for a great gem!

Blockquote not correctly closed

Hi I have the following HTML:

<blockquote>  This is a quote </blockquote>

<img alt="alt" src="https://path/to/file.jpg" />

The generated markdown for this is

> This is a quote ![alt](https://path/to/file.jpg)

Somehow the newline before the image is missing after a blockquote. It happens for all blockquote elements I tested that where followed by an image. If there is text after the quote everything works fine.

I'm using reverse_markdown version 1.0.3 and ruby 2.3.1p112

Improper space parsing within codeblock

The use of String#squeeze in the file below is messing with codeblock's indentation.

reverse_markdown/lib/reverse_markdown/converters/text.rb

Line 51 in 7da325d

text.tr("\n\t", ' ').squeeze(' ')

I.E.

# Original code block
def original_value
  if assigned?
    original_attribute.original_value
  else
    type_cast(value_before_type_cast)
  end
end

# After ReverseMarkdown
def original_value
 if assigned?
 original_attribute.original_value
 else
 type_cast(value_before_type_cast)
 end
end

Links to ids don't produce Markdown links

When thehref is to an HTML id, no link is generated

link_to_external_site = '<a href="https://example.com#hallo">Hallo!</a>'
link_to_id = '<a href="#hallo">Hallo!</a>'

ReverseMarkdown.convert(link_to_external_site, github_flavored: true, tag_border: '')
=> "[Hallo!](https://example.com#hallo)"

ReverseMarkdown.convert(link_to_id, github_flavored: true, tag_border: '')
=> "Hallo!"

In the second case, I expected [Hallo!](#hallo), not Hallo!.

Indenting seems off around code blocks

Input:

If you don't already have a <code>php.ini</code> file you'll need to create one by copying the default:
<code>
sh-3.2# if ( ! test -e /private/etc/php.ini ) ; then cp /private/etc/php.ini.default /private/etc/php.ini; fi
</code>

Restart Apache:

Output:

If you don't already have a  `php.ini`  file you'll need to create one by copying the default:  `sh-3.2# if ( ! test -e /private/etc/php.ini ) ; then cp /private/etc/php.ini.default /private/etc/php.ini; fi`  Restart Apache:  `sh-3.2# apachectl restart`

Expected either:

If you don't already have a `php.ini` file you'll need to create one by copying the default:  

    sh-3.2# if ( ! test -e /private/etc/php.ini ) ; then cp /private/etc/php.ini.default /private/etc/php.ini; fi

Restart Apache:

    sh-3.2# apachectl restart

or one using GitHub style comment blocks that I can't seem to figure out how to format using markdown.

[Question] How to change \n by ?

It markdown will return the \n, however, not all markdown renders will understand \n. So, is it possible to change \n by something else? Is that a config for this?

Hyphen(-) issue

Hyphen(-) in markdown should be show as a list, i think we should escape hyphen(-) to "-", otherwise after converted, hyphen(-) will be show as a list, but i expected only a normal hyphen(-), not list.

Too much blankspace gets stripped

Coming from forem/forem#8457, I think that ReverseMarkdown strips too much blankspace in some scenarios. Here's a failing test:

  it 'keeps whitespace surrounding links' do
    result = ReverseMarkdown.convert("a\n<a href='1'>link</a>\nis good\nbut blankspace is better")
    expect(result).to eq "a [link](1) is good but blankspace is better\n\n"
  end

The output is a[link](1)is good but blankspace is better\n\n. This happens because the text converter calls remove_border_newlines, and the fact that the middle line is a link means that it will be its own nokogiri node, and the three nodes will be joined with no whitespace. I tried changing remove_border_lines to squeeze instead of removing everything, but this doesn't work: it keeps whitespace in scenarios where it shouldn't:

firstsecondthird becomes first\n\nsecond\n\n third\n\n.

I still want to investigate this further, but I decided to post this now to share my findings.

Improve handling of new lines

Using this HTML:

<p><strong>Some text<br><br>other text</strong></p>

I end up with:

**Some text

other text**

New lines inside emphasis tags should be handled specifically to avoid this problem.

uninitialized constant ReverseMarkdown::Mapper::Digest (NameError)

So a follow up to #19, I'd done a little regexing to convert my <code> tags into git hub style code blocks and then ran into the following error:

.rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/reverse_markdown-0.4.6/lib/reverse_markdown/mapper.rb:23:in `block in process_root': uninitialized constant ReverseMarkdown::Mapper::Digest (NameError)
    from .rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/reverse_markdown-0.4.6/lib/reverse_markdown/mapper.rb:22:in `gsub!'
    from .rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/reverse_markdown-0.4.6/lib/reverse_markdown/mapper.rb:22:in `process_root'
    from .rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/reverse_markdown-0.4.6/lib/reverse_markdown.rb:15:in `parse'
    from ../fixmarkdown.rb:12:in `block in <main>'
    from ../fixmarkdown.rb:5:in `glob'
    from ../fixmarkdown.rb:5:in `<main>'

Seems like a require 'digest' is missing some place.

Should the .config settings be "sticky"?

require 'reverse_markdown'

ReverseMarkdown.config do |config|
  config.unknown_tags     = :bypass
  config.github_flavored  = true
end
p ReverseMarkdown.convert("<pre>a = 5</pre>")
p ReverseMarkdown.convert("<pre>a = 5</pre>")

Produces this output:

"```\na = 5\n```\n"
"    a = 5\n\n"

Only the first one is fenced.

It strikes me that the .config settings should be sticky until changed.

img tag is not implemented correctly?

>> require 'reverse_markdown'
=> true

test 1

>> s = '<img src="./images/1.jpg">'
=> "<img src=\"./images/1.jpg\">"
>> ReverseMarkdown.parse s
=> "![./images/1.jpg] "

test 2

>> ss = '<img src="./images/1.jpg" alt="some pic">'
=> "<img src=\"./images/1.jpg\" alt=\"some pic\">"
>> ReverseMarkdown.parse ss=> "![some pic][./images/1.jpg] "

test 3, copy from test spec comes with revers_markdown

>> a = '<p><img src="http://foo.bar/dog.png" alt="My Dog" title="Ralph"></p>'
=> "<p><img src=\"http://foo.bar/dog.png\" alt=\"My Dog\" title=\"Ralph\"></p>"
>> ReverseMarkdown.parse a
=> "\n\n![My Dog][http://foo.bar/dog.png] "

Command-line HTML to Markdown now available!

I've added markdown support to my html_massage gem, of course using reverse_markdown.

So you can do:

gem install html_massage
html_massage markdown http://en.wikipedia.org/wiki/Hylos

And you will get back the markdown version of the given URL. It works pretty well, better on some sites than others. The results of the command above, for example, are here: https://gist.github.com/31baaa60d7354a72dab1

Just thought people in this project would want to know!

I'm also happy to add this info to this project's README (or wiki) -- what do you think @xijo?

Strongs add whitespace

Loving the new 5.x features. One thing I noticed in word-to-markdown's unit tests after bumping:

Converting the following:

<p class="P1">This word is <strong class="T1">bold</strong>.</p></body>

Resulted in (4.x top, 5.x bottom):

<"This word is **bold**."> expected but was
<"This word is **bold** .">.

(Note the extra space before the period)

Slack flavored Markdown

Hi, I was recently working with slack integration and had to convert HTML into Slack Flavored Markdown, which is a bit different than GitHub Flavored.

For the urgent need, I forked this project and made the required changes to make it work with Slack. I thought this might be good starting point for me to contribute to open source. If you guys think that there is a use for Slack Flavored Markdown, then I'll be more than happy to open a PR for it.

Slack only allows a small subset of Markdown features. It includes things like Bold, Italic, Ordered List, Unordered List, Quote, Code Block.

Ruby verbose mode warnings

If run with "ruby -w" (or "$VERBOSE = true"), reverse_markdown causes the following warnings:

reverse_markdown/cleaner.rb:13: warning: ambiguous first argument; put parentheses or even spaces
reverse_markdown/cleaner.rb:29: warning: ambiguous first argument; put parentheses or even spaces
reverse_markdown/cleaner.rb:35: warning: ambiguous first argument; put parentheses or even spaces
reverse_markdown/cleaner.rb:41: warning: ambiguous first argument; put parentheses or even spaces
reverse_markdown/cleaner.rb:47: warning: ambiguous first argument; put parentheses or even spaces

Maybe those could be eliminated for cleanliness? These warnings pop up if you are developing a script that requires reverse_markdown and run it with "ruby -w".

Newlines whitespace is stripped instead of replaced by a single space

It seems that whitespace consisting of newlines only is stripped, instead of being replaced by a single space as expected.

Actual:

irb(main):002:0> ReverseMarkdown.convert "hello\nworld\n"
=> "helloworld\n\n"

Expected:

"hello world\n\n"

no new line char in code inside <pre> tag

Input

<p>I ran into a weird situation today. Active Record objects stored in vars are removed when I switched from one tenant
    to another on the fly. This will create a weird test-failing scenario and you never know why its happening.</p>
<pre>
  def switch(name)
    yield(name)
  end
  
  @customers = [1,2,3]
  switch('shiva') do |name|
    puts name
    puts @customers
  end
  
  # Gives output
  # shiva
  # 1
  # 2
  # 3
</pre>

<p>&nbsp;</p>
<p>but when I do this</p>

<pre>
  @roles     = Role.all
  @customers = org.branches
  @list      = [1, 2]
  
  puts 'Role count' + @roles.count.to_s
  puts 'Customer count' + @customers.count.to_s
  puts 'list count' + @list.count.to_s
  
  Apartment::Tenant.switch!(org.database_name)
  puts '-------'
  puts 'Role count' + @roles.count.to_s
  puts 'Customer count' + @customers.count.to_s
  puts 'list count' + @list.count.to_s
</pre>

<p>output is</p>

<pre>
  Role count2
  Customer count2
  list count2
  -------
  Role count0
  Customer count0
  list count2
</pre>

<p>&nbsp;</p>
<h2>Reason</h2>

<pre>
  # apartment-2.2.0/lib/apartment/adapters/abstract_adapter.rb
  #   Switch to a new tenant
  #
  #   @param {String} tenant name
  #
  def switch!(tenant = nil)
    run_callbacks :switch do
      return reset if tenant.nil?
  
      connect_to_new(tenant).tap do
        <strong>Apartment.connection.clear_query_cache</strong>
      end
    end
  end
</pre>

is rendered into

I ran into a weird situation today. Active Record objects stored in vars are removed when I switched from one tenant to another on the fly. This will create a weird test-failing scenario and you never know why its happening.

    def switch(name) yield(name) end @customers = [1,2,3] switch('shiva') do |name| puts name puts @customers end # Gives output # shiva # 1 # 2 # 3

&nbsp;

but when I do this

    @roles = Role.all @customers = org.branches @list = [1, 2] puts 'Role count' + @roles.count.to\_s puts 'Customer count' + @customers.count.to\_s puts 'list count' + @list.count.to\_s Apartment::Tenant.switch!(org.database\_name) puts '-------' puts 'Role count' + @roles.count.to\_s puts 'Customer count' + @customers.count.to\_s puts 'list count' + @list.count.to\_s

output is

    Role count2 Customer count2 list count2 ------- Role count0 Customer count0 list count2

&nbsp;

## Reason

    # apartment-2.2.0/lib/apartment/adapters/abstract\_adapter.rb # Switch to a new tenant # # @param {String} tenant name # def switch!(tenant = nil) run\_callbacks :switch do return reset if tenant.nil? connect\_to\_new(tenant).tap do **Apartment.connection.clear\_query\_cache** end end end

Expected

It should have rendered newline chars \n inside <pre> tag where there is no <code> tag inside.
The HTML is extracted from by blog at https://cbabhusal.wordpress.com . Wordpress does not put <code> tag inside <pre> tag.

Code

ReverseMarkdown.convert(post['content'])

Improper nested ol/ul parsing

The Gem (although awesome) seems to struggle a bit with nested lists, both numbered and unnumbered.

Take this example HTML:

  <ol>
    <li>One
      <ol>
        <li>Sub one
          <ol>
            <li>Sub sub one</li>
            <li>Sub sub two</li>
          </ol>
        </li>
      </ol>
    </li>
  </ol>

Line breaks add whitespace

The above is outputted as:

2. Two
  1. Sub one
    1. Sub sub one
     2. Sub sub two

Note the 5 spaces before the "sub sub two" list item. Boggled my mind how it could get a number of spaces that wasn't a multiple of 2 (the indent function), but it turns out the line break after the preceding </li> inserts and extra space before the subsequent <li> at some point in the pipeline. Completely stripping new lines before passing to reverse markdown solves the problem.

Numbering

The numbering of numbered lists properly resets when a subsequent ol is nested deeper than the preceding ol, but seems to continue numbering if it is shallower.

1. One
  1. Sub one
  2. Sub two

3. Two
  1. Sub one
    1. Sub sub one
    2. Sub sub two

  3. Sub two

4. Three

I tried to implement the same logic in my own Gem before realizing that reverse markdown should handle it. Perhaps it'd make more sense to store the current_li value in an array, where each element represented one level of nesting (the value being the current number). Upon discovering a less indented list item, you'd reset all elements in the array after the current nesting level. Not a show stopper, because still valid markdown, but still an odd behavior.

Using the Gem to convert Word to Markdown. Awesome stuff!

Are you willing to license this under the MIT license or similar?

Hi, thanks for your work on this project! I am interested in contributing, and wondering if you would be willing to license this under an open source license. If yes, I am happy to create a pull request if helpful with the standard MIT (or other) license verbiage. Thanks for considering!

ReverseMarkdown::Cleaner#clean_tag_borders aggressive with cleanup

Found a bug caused by the clean_tag_borders method which does this:

>> require 'reverse_markdown'
=> true
>> ReverseMarkdown.convert('<html><body><a href="http://blog.99.co/wp-content/uploads/2014/04/suomayamuseum_mexico_city.jpg__1072x0_q85_upscale.jpg">test</a></body></html>')
=> " [test](http://blog.99.co/wp-content/uploads/2014/04/suomayamuseum_mexico_city.jpg__1072x0_q85_upscale.jpg)"
>> ReverseMarkdown.convert('<a href="http://blog.99.co/wp-content/uploads/2014/04/tohoku_earthquake_tsunami_japan.jpg__1072x0_q85_upscale.jpg"><img class="alignnone size-full wp-image-160" alt="tohoku_earthquake_tsunami_japan.jpg__1072x0_q85_upscale" src="http://blog.99.co/wp-content/uploads/2014/04/tohoku_earthquake_tsunami_japan.jpg__1072x0_q85_upscale.jpg" width="1072" height="603" /></a>')
=> " [![tohoku_earthquake_tsunami_japan.jpg __1072x0_q85_upscale](http://blog.99.co/wp-content/uploads/2014/04/tohoku_earthquake_tsunami_japan.jpg__ 1072x0_q85_upscale.jpg)](http://blog.99.co/wp-content/uploads/2014/04/tohoku_earthquake_tsunami_japan.jpg__1072x0_q85_upscale.jpg)"

Notice the extra space added between the two __ in the links and hence the link is broken. Looking at a fix now, will submit a pull request if I find something quick.