This is an issue to discuss whether / how to implement a CommonMark + directives parser for Sphinx, as @chrisjsewell and I had discussed earlier.
The problem
The recommonmark
project piggy-backs on the commonmark-py
project to parse markdown. It then defines a Sphinx parser that sub-classes the docutils
parser and defines methods that convert the commonmark-py AST into docutils AST (https://github.com/readthedocs/recommonmark/blob/master/recommonmark/parser.py#L21).
Under the hood it's still using docutils methods since they sub-class the docutils parser, and as a result there is some weird behavior (like nested_parse
expecting rST in the content blocks).
One solution
@chrisjsewell proposed writing our own CommonMark -> docutils AST parser, and then adding on the syntax for roles and directives. This would be two things:
- A Sphinx parser that reads in markdown, and uses:
- Our own Praser that behaves like a docutils parser, but under the hood is utilizing a more modern state-machine software (https://github.com/pytransitions/transitions) to parse markdown.
The hope is that this parser would be easier to maintain, understand, and grow as we wished to support new syntax. It would be a collection of "markdown -> docutils AST" rules, rather than relying on an intermediate AST as the commonmark-py
project does.
A question - could we continue using commonmark-py
?
As I was looking through documentation, I am wondering whether we could still use the commonmark-py
machinery to parse basic commonmark syntax, and then use our own statemachine parser to handle the "extra" grammar elements like roles and directives.
Basically, I'm wondering whether we could do the same thing that recommonmark
does, but instead of sub-classing a docutils Parser, we sub-class a parser that knows how to parse only the subset of markdown that commonmark doesn't cover.
If this were possible, I feel like we wouldn't need to worry about re-writing the test suite of commonmark-py
, and we could then focus only on the extra syntax needed for things like roles and directives. We could then also have a markdown parser under the hood for the nested_parse
sections.
Note - it may also be illustrative to look at how the commonmark-py parser does its parsing - I believe that code starts here: https://github.com/readthedocs/commonmark.py/blob/c4c5b0df72961663060c65ed0858840b5e031b10/commonmark/blocks.py#L881
And the blocks.py module in general defines how they parse markdown...maybe we could re-use (or explicitly use) some of it...
I'm curious what @chrisjsewell thinks about this - mostly I am trying to find ways that we don't have to write our own from-scratch markdown parser as I'm a bit worried about all the edge-cases we'll have to consider :-)
Pros and cons of each markdown reader in Python
Find it here: https://github.com/ExecutableBookProject/meta/wiki/Resources:-Markdown-(MD)#markdown-parsers-in-python