Git Product home page Git Product logo

docx2gfm's Introduction

docx2gfm's People

Contributors

spier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

activeliang matkm

docx2gfm's Issues

Cannot handle filenames containing spaces

When passing a filename that contains spaces, docx2gfm shows a warning and creates an empty markdown file.

For example

$ docx2gfm -f my_\ file\ with\ spaces.docx
pandoc: my_: openBinaryFile: does not exist (No such file or directory)

---
layout: post
title: "YOUR POST TITLE"
comments: true
categories: [tag1, tag2, multi word tag]
author: <a href="link to twitter, personal blog, linkedin, etc">YOUR NAME</a>
image: "/images/own/post_directory/logo_image_NOT_SVG_FORMAT.png"
---

This can likely be fixed by making sure that the filename is quoted correctly when passing it along to pandoc.

Make it easier to add the images from the docx

pandoc comes with an option --extract-media=DIR that allows to extract the images from the docx to a local folder. Could this be used to make it easier to create the final markdown with the images in the correct folder?

Escaping of markdown special characters

Some markdown special characters get escaped in the step convert_to_reference_style_links(content).

For example:
Goodbye\!

Currently we are performing a manual cleanup of those but that is quite error-prone.

I filed an issue with commonmarker to see if this behavior is actually correct:
gjtorikian/commonmarker#91

Gemify this tool

As a user I want to use this tool without cloning the repo. Publishing this tool as a gem (on https://rubygems.org/) would achieve this.

However publishing this tool as a gem is pretty much the same as open sourcing it, as the source inside a gem is publicly available. Hence I should first make it public, then gemify it.

Bug in link extraction?

I think there might be a bug in the link extraction when the link contains a closing bracket at the end.
e.g. https://en.wikipedia.org/wiki/Beirut_(disambiguation)

The link extraction that we do seem so swallow the last ).
Need to reproduce this in a test.

beginning of docx is removed

See assets/sample.docx

When running ruby docx2gfm.rb assets/sample.docx the following happens:

  • the title of the doc is removed (this happens for any docx I think)
  • the first paragraph and image is removed (this only happens for some docx files)

Is this a bug in pandoc?

Reduce the cleanup steps that are using regexps

Can I move the cleanup steps that are regexps into pandoc itself?
See function cleanup_content(content).

This would assume that pandoc is making mistakes in the docx to gfm conversation.

option for jekyll front-matter generation

docx2gfm currently generates a jekyll YAML front matter and puts it at the very beginning of the markdown output.

We could create an option that allows to toggle this behavior on/off for people that really just need gfm and not jekyll output? That behavior also seems more consistent with the name of this tool, as it is docx2gmf and not docx2jekyll :)

"end list" marker

Pandoc adds a funny marker sometimes.
<!-- end list -->

We are removing it manually right now, but then it gets re-introduced by commonmarker. So I suspect it is part of the markdown specs?

Add some tests

No tests so far. Yeah :)

Things to test:

  • the extra cleanup that I am performing to the pandoc output
  • behavior of the CLI options

Quotation marks are escaped.

Single quotation marks are escaped with a backslash.

i.e. the text Let's link turns into Let\'s link.

This is not really a problem, as the backslash should not get rendered by markdown, as it just escapes the next character. I don't know why single quotation marks need escaping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.