Git Product home page Git Product logo

org-mode-hs's Introduction

org-mode-hs

This repository provides a parser and exporters for Org Mode documents. The Org document is parsed into an AST similar to org-element’s, and the exporters are highly configurable using HTML, Markdown or LaTeX templates.

Table of contents

org-cli (horg)

You can find compiled binaries for Linux and MacOS on the releases page.

Usage

The horg CLI tool has some basic help pages, to see them type:

# Global help
horg -h
# Exporter help
horg export -h

To convert to HTML, for instance, you can type:

# Save to out.html
horg export html -i myfile.org -o out.html

# Print to stdout
horg export html -i myfile.org --stdout

Using Pandoc

The Pandoc exporter does not output Pandoc formats directly, but rather, it generates a JSON AST that can be fed into Pandoc. You can pipe this JSON into Pandoc to convert to Markdown or any other supported format:

horg export pandoc -i myfile.org --stdout | pandoc --from json --to markdown

It can happen that the JSON API version is incompatible with your installed version of Pandoc. The CI compiles horg with the latest Pandoc API available at the build time, so using the latest released version of the Pandoc binary has a good chance of fixing the problem.

Installation from source

Using cabal

git clone https://github.com/lucasvreis/org-mode-hs.git
cd org-mode-hs
cabal install horg

Please note that this library and some of its dependencies are not on Hackage yet, so you need to clone this repository first.

Customizing templates

You can use the horg init-templates command to populate a .horg directory in the current directory with the default templates, which you can then modify.

Detailed documentation on how the templates work is TODO.

org-parser library

How to test and play with it

Testing the parser in ghci

This assumes you have cabal installed.

Clone the parser repository, cd into org-parser and run cabal repl test inside it. Cabal will be busy downloading dependencies and building for a while. You can then call the convenience function prettyParse like so:

prettyParse orgDocument [text|
This is a test.
|]

You can write the contents to be parsed between [text| and |]. More generally, you can call

prettyParse [parser to parse] [string to parse]

Where [parser to parse] can be basically any of the functions from Org.Parse.Document, Org.Parser.Elements or Org.Parser.Objects whose types are wrapped by the OrgParser or Marked OrgParser monads. You don’t need to import those modules yourself as they are already imported in the test namespace.

Unit tests

You can view the unit tests under org-parser/test. They aim to touch as much corner cases as possible against org-element, so you can take a look there to see what already works, and how well it works.

Progress

In the spec terms (see below the table for other features), the following components are implemented:

ComponentTypeParse
HeadingXX
SectionXX
Affiliated KeywordsXX
GreaterBlockXX
DrawerXX
FootnoteDefinitionXX
ItemXX
ListXX
PropertyDrawerXX
TableXX
BabelCallparsed as keyword
Comment BlockXX
ClockXX
Example BlockXX
Export BlockXX
Src BlockXX
Verse BlockX
PlanningXX
CommentXX
FixedWidthX (ExampleBlock)X
HorizontalRuleXX
KeywordXX
LaTeXEnvironmentXX
NodePropertyXX
ParagraphXX
TableRowXX
TableHRuleXX
OrgEntityXX
LaTeXFragmentXX
ExportSnippetXX
FootnoteReferenceXX
InlineBabelCallXX
InlineSrcBlockXX
RadioLinkwontfix
PlainLinkwontfix
AngleLinkX (Link)X
RegularLinkX (Link)X
ImageXX
LineBreakXX
MacroXX
CitationXX
RadioTargetwontfix
TargetXX
StatisticsCookieXX
SubscriptXX
SuperscriptXX
TableCellXX
TimestampXX
PlainXX
MarkupXX

(Thanks @tecosaur for the table)

Going beyond what is listed in the spec

org-element-parse-buffer does not parse everything that will eventually be parsed or processed when exporting a document written in Org-mode. Examples of Org features that are not handled by the parser alone (so aren’t described in the spec) include content from keywords like #+title:, that are parsed “later” by the exporter itself, references in lines of src or example blocks and link resolving, that are done in a post-processing step, and the use of #+include: keywords, TODO keywords and radio links, that are done in a pre-processing step.

Since the aspects listed above are genuine org-mode features, and not optional extensions, its preferable that should be resolved in the AST outputted by this parser. Below is a table with more Org features that are not listed in the spec but are planned to be supported:

FeatureImplemented?
​=#+include:= keywordsnot yet
Src/example blocks switches and referencesyes
Resolving all inner linkssome
Parsing image links into =Image=​syes
Processing radio linksno; conformant implementation requires parsing twice. May be added under a flag.
Per-file TODO keywordsnot yet (on the way, some work is done)
Macro definitions and substitutionnot yet (on the way, some work is done)

Comparasion to Pandoc

The main difference between org-parser and the Pandoc Org Reader is that this one parses into an AST is more similar to the org-element’s AST, while Pandoc’s parses into the Pandoc AST, which cannot express all Org elements directly. This has the effect that some Org features are either unsupported by the reader or “projected” onto Pandoc in ways that bundle less information about the Org source. In contrast, this parser aims to represent Org documents more faithfully before “projecting” them into formats like HTML or the Pandoc AST itself. So you can expect more org-specific features to be parsed, and a hopefully more accurate parsing in general.

Also, if you are developer mainly interested in rendering Org documents to HTML, Pandoc is a very big library to depend upon, with very long build times (at least in my computer, sadly).

Indeed, my initial plan was to fork the Org Reader and make it a standalone package, but this quickly proved unfeasible as the reader is very tangled with the rest of Pandoc. Also, some accuracy improvements to the reader were hard to make without deeper changes to the parser. For example, consider the following Org snippet:

This is a single paragraph. Because this single paragraph
#+should not be ended by this funny line, because this funny
line is not a keyword. Not even this incomplete
\begin{LaTeX}
environment should break this paragraph apart.

This single paragraph is broken into three by Pandoc, because it looks for a new “block start” (the start of a new org element) in each line. If there is a block start, then it aborts the current element (block) and starts the new one. Only later the parser decides if the started block actually parses correctly until its end, which is not the case for the \begin{LaTeX} in this example.

Another noteworthy difference is that haskell-org-parser uses a different parsing library, megaparsec. Pandoc uses the older parsec, but also bundles many features on its own library.

org-exporters library

This library provides functions for post-processing of the Org AST and exporting to various formats with ondim.

Defining a new export backend

Basically:

  • Use the ~ondim~ library to create a Ondim template system for the desired format, if it does not already exist.
  • Import Org.Exporters.Common and create an ExportBackend for your format.
  • Create auxiliary functions for loading templates and rendering the document.

org-mode-hs's People

Contributors

lucasvreis avatar srid avatar

Stargazers

Ulric Wilfred avatar Jinxuan Zhu avatar tgunn avatar Noah Diewald avatar S. Rey-Coyrehourcq avatar Wing Kwok avatar  avatar Shayon avatar Merghadi Abdelaziz avatar Thanawat Techaumnuaiwit avatar  avatar Oleg Pykhalov avatar Kevin Brubeck Unhammer avatar Sam Stites avatar Vedang Manerikar avatar Daniel Kahlenberg avatar Akira Komamura avatar Tristan de Cacqueray avatar  avatar  avatar

Watchers

 avatar GuangTao Zhang avatar

Forkers

gtrunsec srid akirak

org-mode-hs's Issues

Standard properties annotations

Annotate the character position of elements and objects in the AST (like Org does). I'm constantly reminded of this since I think this would be extremely useful. Some applications:

  • Paragraph-accurate "synctex" in organon (perhaps even split AST into words like pandoc does)
  • Support for org-remark (marginalia) annotations via the templates
  • Support for Org line-number links (not a priority, I don't like this sort of link)

"Functored" AST: back to pandoc-types?

Pandoc is considering adapting its types to be polymorphic over Inline and Block (jgm/pandoc-types#99, jgm/pandoc-types#98). This is very interesting as it may allow other libraries to extend the Pandoc AST. It would be nice if org-parser could drop part of its AST and use an extended version of Pandoc. That would contribute to a more unified package ecosystem and facilitate conversion between the libraries.

Unfortunately, I think most types & constructors in Org.Types would have to remain anyway, or be turned into patterns, else we lose the "org-element alikeness and expressivity".

One interesting example is uniorg in the context of unified. I will have a look at how uniorg reuses the types of unified, but JS is much more flexible in this aspect.

Cannot build from source

Hi! I'm trying to build the library by following the instructions in the README.

Version information:

➤ ghc --version
The Glorious Glasgow Haskell Compilation System, version 9.2.7

➤ cabal --version
cabal-install version 3.6.2.0
compiled using version 3.6.2.0 of the Cabal library

Installed via ghcup on Arch Linux. These are the errors I get:

Errors

When first compiling, an error appears in org-exporter/src/Org/Exporters/Pandoc.hs. To try to resolve this I just commented out the line.

src/Org/Exporters/Pandoc.hs:30:16: error:
    Not in scope: data constructor ‘P.Null’
    Perhaps you meant ‘P.Cell’ (imported from Text.Pandoc.Definition)
    Module ‘Text.Pandoc.Definition’ does not export ‘Null’.
   |
30 |       nullEl = P.Null
   |                ^^^^^^
Error: cabal: Failed to build org-exporters-0.1 (which is required by exe:horg

then after commenting out the offending line, the issue seems to be taken care of, but another error appears in org-cli:

app/org-cli.hs:81:40: error:
    • Couldn't match expected type ‘Text.Pandoc.Format.FlavoredFormat’
                  with actual type ‘Text’
    • In the first argument of ‘TP.getWriter’, namely ‘fmt’
      In a stmt of a 'do' block: (w, ext) <- TP.getWriter fmt
      In the first argument of ‘TP.runIOorExplode’, namely
        ‘do (w, ext) <- TP.getWriter fmt
            tpl <- TP.compileDefaultTemplate fmt
            utpl <- case tplo of
                      Nothing -> pure Nothing
                      Just tfp -> do ...
            let tpl' = fromMaybe tpl utpl
                wopt = ...
            ....’
   |
81 |               (w, ext) <- TP.getWriter fmt
   |                                        ^^^

The problem seems to be related to pandoc in both cases. If there's any other information you need then please let me know!

Internal links (cross references)

It's not clear to me what the behavior should be, given that Org mode itself seems confusing when exporting cross references to HTML.

For instance, Org will substitute links to internal references with a counter for most elements, but with \eqref for equations. We shouldn't do that since it's dependent on export. The Id of InternalTarget should be used for \eqref or something else on export. Still proper numbering for equations is too complicated. For other elements you can reference anything with #+name but this counter is only updated in the presence of #+caption. I will have to figure out how numbering for <<targets>> works.

Rename?

With #6, this is becoming more than just a parser. org-mode and orgmode and already taken in Hackage. Some ideas:

  • org-hs
  • Just org?
  • Leave it as it is :(

Detach most Future things from the parser

Move them to a new module, Transforms, and keep them modular. What can be done outside the parser should be detached.

The list includes:

  • #11
  • orgStateMacros
  • orgStateInternalTargets
  • orgStateKnownAnchors
  • processSpecialStrings
  • documentFootnotes, orgStateFootnotes: remove the field from OrgDocument
  • orgStateKeywords, documentKeywords: remove the field too
  • orgStateExcludeTags, orgStateExcludeTagsChanged
  • orgStateSrcLineNumber

Interpret Ondim templates inside Org documents

It should be fairly simple to parse Org export blocks as templates to be interpreted by Ondim.

To separate them from ordinary blocks they could use custom languages like ondim-html and ondim-pandoc, this way they won't interfere with emacs' org exporters either.

Edit: perhaps it's better to use source blocks with :exports none attribute, with the advantage that org-mode would do syntax highlighting by default and we could use custom attributes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.