stencila / encoda Goto Github PK

↔️ A format converter for Stencila documents

Home Page: https://stencila.github.io/encoda/

License: Apache License 2.0

JavaScript 0.13% Makefile 0.01% Dockerfile 0.02% TypeScript 5.30% Shell 0.04% HTML 9.55% Jupyter Notebook 83.87% TeX 0.06% XSLT 1.03%

converter document nodejs pandoc remark semantic

encoda's Introduction

Programmable, reproducible, interactive documents

👋 Intro • 🚴 Roadmap • 📜 Docs • 📥 Install • 🛠️ Develop

🙏 Acknowledgements • 💖 Supporters • 🙌 Contributors

👋 Introduction

Stencila is a platform for creating and publishing, dynamic, data-driven content. Our aim is to lower the barriers for creating truly programmable documents, and to make it easier to publish them as beautiful, interactive, and semantically rich, articles and applications. Our roots are in scientific communication, but our tools are useful beyond.

This is v2 of Stencila, a rewrite in Rust focussed on the synergies between three recent and impactful innovations and trends:

Conflict-free replicated data types (CRDTs) for de-centralized collaboration and version control.
Large language models (LLMs) for assisting in writing and editing, prose and code.
The blurring of the lines between documents and applications as seen in tools such as Notion and Coda.

We are embarking on a rewrite because CRDTs will now be the foundational synchronization and storage layer for Stencila documents. This requires fundamental changes to most other parts of the platform. Furthermore, a rewrite allow us to bake in, rather than bolt on, new modes of interaction between authors and LLM assistants and add mechanisms to mitigate the risks associated with using LLMs (e.g. by recording the actor, human or LLM, that made the change to a document). Much of the code in the v1 branch will be reused (after some tidy-ups and refactoring), so v2 is not a complete rewrite.

🚴 Roadmap

Our general strategy is to iterate horizontally across the feature set, rather than fully developing features sequentially. This will better enable early user testing of workflows and reduce the risk of finding ourselves painted into an architectural corner. So expect initial iterations to have limited functionality and be buggy.

We'll be making alpha and beta releases of v2 early and often across all products (e.g. CLI, desktop, SDKs). We're aiming for a 2.0.0 release by the end of Q3 2024.

🟢 Stable • 🔶 Beta • ⚠️ Alpha • 🚧 Under development • 🧪 Experimental • 🧭 Planned • ❔ Maybe

Schema

The Stencila Schema is the data model for Stencila documents (definition here, generated reference documentation here). Most of the schema is well defined but some document node types are still marked as under development. A summary by category:

Category	Description	Status
Works	Types of creative works (e.g. `Article`, `Figure`, `Review`)	🟢 Stable (mostly based on schema.org)
Prose	Types used in prose (e.g. `Paragraph`, `List`, `Heading`)	🟢 Stable (mostly based on HTML, JATS, Pandoc etc)
Code	Types for executable (e.g. `CodeChunk`) and non-executable code (e.g. `CodeBlock`)	🔶 Beta (may change)
Math	Types for math symbols and equations (e.g. `MathBlock`)	🔶 Beta (may change)
Data	Fundamental data types (e.g. `Number`) and validators (e.g. `NumberValidator`)	🔶 Beta (may change)
Flow	Types for control flow within a document (e.g. `If`, `For`, `Call`)	🔶 Beta (may change)
Style	Types for styling parts of a documents (`Span` and `Division`)	🔶 Beta (may change)
Edits	Types related to editing a documents (e.g. `InstructionBlock`, `DeleteInline`)	🔶 Beta (may change)

Storage and synchronization

In v2, documents can be stored as binary Automerge CRDT files, branched and merged, and with the ability to import and export the document in various formats. Collaboration, including real-time, is made possible by exchanging fine-grained changes to the CRDT over the network. In addition, we want to enable interoperability with a Git-based workflow.

Functionality	Description	Status
Documents read/write-able	Able to write a Stencila document to an Automerge binary file and read it back in	⚠️ Alpha
Documents import/export-able	Able to import or export document as alternative formats, using tree diffing to generate CRDT changes	⚠️ Alpha
Documents fork/merge-able	Able to create a fork of a document in another file and then later merge with the original	🧭 Planned
Documents diff-able	Able to view a diff, in any of the supported formats, between versions of a document and between a document and another file	🧭 Planned
Git merge driver	CLI can act as a custom Git merge driver	🧭 Planned
Relay server	Documents can be synchronized by exchanging changes via a relay server	🧭 Planned
Rendezvous server	Documents can be synchronized by exchanging changes peer-to-peer using TCP or UDP hole punching	❔ Maybe

Formats

Interoperability with existing formats has always been a key feature of Stencila. We are bringing over codecs (a.k.a. converters) from the v1 branch and porting other functionality from encoda to Rust.

Format	Encoding	Decoding	Notes
JSON	🟢	🟢
JSON5	🟢	🟢
JSON-LD	🟢	🟢
CBOR	🟢	🟢
CBOR+Zstandard	🟢	🟢
YAML	🟢	🟢
Plain text	🔶	-
HTML	🚧	🧭
JATS	🚧	🚧	Planned for completion. Port decoding and tests from `encoda`.
Markdown	⚠️	⚠️
R Markdown	🧭	🧭	Relies on Markdown; `v1`
Myst Markdown	🚧	🚧	In progress; PR
Jupyter Notebook	🧭	🧭	Relies on Markdown; `v1`
Scripts	🧭	🧭	Relies on Markdown; `v1`
Pandoc	🧭	🧭	Planned. `v1`
LaTeX	🧭	🧭	Relies on Pandoc; `v1`; discussion
Org	🧭	🧭	Relies on Pandoc; PR
Microsoft Word	🧭	🧭	Relies on Pandoc; `v1`
ODT	🧭	🧭	Relies on Pandoc
Google Docs	🧭	🧭	Planned `v1`
PDF	🧭	🧭	Planned, relies on HTML; `v1`
Codec Plugin API	🧭	🧭	An API allowing codecs to be developed as plugins in Python, Node.js, and other languages

Kernels

Kernels are what executes the code in Stencila CodeChunks and CodeExpressions, as well as in control flow document nodes such as ForBlock and IfBlock. In addition, there are kernels for rendering math (e.g. MathBlock) and styling (e.g. StyledInline) nodes.

Kernel	Purpose	Status
Bash	Execute Bash code	🔶 Beta
Zsh	Execute Zsh code	❔ Maybe; `v1`
Python	Execute Python code	🔶 Beta
R	Execute R code	⚠️ Alpha
QuickJs	Execute JavaScript in embedded sandbox	🔶 Beta
Node.js	Execute JavaScript in a Node.js env	🔶 Beta
Deno	Execute TypeScript code	❔ Maybe; `v1`
SQLite	Execute SQL code	🧭 Planned; `v1`
Jupyter kernels	Execute code in Jupyter kernels	🚧 In progress; PR
Rhai	Execute a sand boxed, embedded language	🔶 Beta
AsciiMath	Render AsciiMath symbols and equations	🔶 Beta
TeX	Render TeX math symbols and equations	🔶 Beta
Graphviz	Render Graphviz DOT to SVG images	⚠️ Beta
Jinja	Interpolate document variables into styling and other code	⚠️ Beta
Style	Transpile Tailwind and CSS for styling	🔶 Beta
HTTP	Interact with RESTful APIs	❔ Maybe; `v1`

[TIP] Run stencila kernels (or cargo run -p cli kernels in development) for an up to date list of kernels, including those available through plugins.

Tools

Tools are what we call the self-contained Stencila products you can download and use locally on your machine to interact with Stencila documents.

Environments	Purpose	Status
CLI	Manage documents from the command line and read and edit them using a web browser	⚠️ Alpha
Desktop	Manage, read and edit documents from a desktop app	⚠️ Alpha repo
VSCode extension	Manage, read and edit documents from within VSCode	⚠️ Alpha

SDKs

Stencila's software development kits (SDKs) enable developers to create plugins to extend Stencila's core functionality or to build other tools on top of. At this stage we are planning to support Python, Node.js and R but more languages may be added if there is demand.

Language	Description	Status
Python	Types and function bindings for using Stencila from Python	⚠️ Alpha PyPI
TypeScript	JavaScript classes and TypeScript types for the Stencila Schema	⚠️ Alpha NPM
Node.js	Types and function bindings for using Stencila from Node.js	⚠️ Alpha NPM

Testing and auditing

Making sure Stencila v2 is well tested, fast, secure, and accessible, is important. Here's what where doing towards that:

What	Description	Status
Property-based testing	Establish property-based (a.k.a generative) testing for Stencila documents	🟢 Done
Round-trip testing	Establish property-based tests of round-trip conversion to/from supported formats and reading/writing from/to Automerge CRDTs	🟢 Done here and here
Coverage reporting	Report coverage by feature (e.g. by codec) to give developers better insight into the status of each	🟢 Done Codecov
Dependency audits	Add dependency audits to continuous integration workflow.	🟢 Done
Accessibility testing	Add accessibility testing to continuous integration workflow.	🟢 Done here
Performance monitoring	Establish continuous benchmarking	🟢 Done here
Security audit	External security audit sponsored by NLNet.	🧭 Planned Q2 2023 (after most `v2` functionality added and before public beta)
Accessibility audit	External accessibility audit sponsored by NLNet.	🧭 Planned Q3 2023 (before `v2.0.0` release)

📜 Documentation

At this stage, documentation for v2 is mainly reference material, much of it generated:

More reference docs as well as guides and tutorial will be added over the coming months. We will be bootstrapping the publishing of all docs (i.e. to use Stencila itself to publish HTML pages) and expect to have an initial published set in.

📥 Install

Although v2 is in early stages of development, and functionality may be limited or buggy, we are releasing alpha versions of the Stencila CLI and SDKs. Doing so allows us to get early feedback and monitor what impact the addition of features has on build times and distribution sizes.

CLI

Windows

To install the latest release download stencila-<version>-x86_64-pc-windows-msvc.zip from the latest release and place it somewhere on your PATH.

MacOS

To install the latest release in /usr/local/bin,

curl --proto '=https' --tlsv1.2 -sSf https://stencila.dev/install.sh | sh

To install a specific version, append -s vX.X.X. Or, if you'd prefer to do it manually, download stencila-<version>-x86_64-apple-darwin.tar.gz from the one of the releases and then,

tar xvf stencila-*.tar.gz
cd stencila-*/
sudo mv -f stencila /usr/local/bin # or wherever you prefer

Linux

To install the latest release in ~/.local/bin/,

curl --proto '=https' --tlsv1.2 -sSf https://stencila.dev/install.sh | sh

To install a specific version, append -s vX.X.X. Or, if you'd prefer to do it manually, download stencila-<version>-x86_64-unknown-linux-gnu.tar.gz from the one of the releases and then,

tar xvf stencila-*.tar.gz
mv -f stencila ~/.local/bin/ # or wherever you prefer

Docker

The CLI is also available in a Docker image you can pull from the Github Container Registry,

docker pull stencila/stencila

and use locally like this for example,

docker run -it --rm -v "$PWD":/work -w /work --network host stencila/stencila --help

The same image is also published to the Github Container Registry if you'd prefer to use that,

docker pull ghcr.io/stencila/stencila

SDKs

Python

Use your favorite package manager to install Stencila's SDK for Python:

python -m pip install stencila

[!NOTE] If you encounter problems with the above command, you may need to upgrade Pip using pip install --upgrade pip.

poetry add stencila

Node

Use your favorite package manager to install @stencila/node:

npm install @stencila/node

yarn add @stencila/node

pnpm add @stencila/node

TypeScript

Use your favorite package manager to install @stencila/types:

npm install @stencila/types

yarn add @stencila/types

pnpm add @stencila/types

🛠️ Develop

Code organization

This repository is organized into the following modules. Please see their respective READMEs, where available, for guides to contributing to each.

schema: YAML files which define the Stencila Schema, an implementation of, and extensions to, schema.org, for programmable documents.
json: A JSON Schema and JSON LD @context, generated from Stencila Schema, which can be used to validate Stencila documents and transform them to other vocabularies
rust: Several Rust crates implementing core functionality and a CLI for working with Stencila documents.
python: A Python package, with classes generated from Stencila Schema and bindings to Rust functions, so you can work with Stencila documents from within Python.
ts: A package of TypeScript types generated from Stencila Schema so you can create type-safe Stencila documents in the browser, Node.js, Deno etc.
node: A Node.js package, using the generated TypeScript types and bindings to Rust functions, so you can work with Stencila documents from within Node.js.
prompts: Prompts for used to instruct AI assistants in different contexts and for different purposes.
docs: Documentation, including reference documentation generated from schema and CLI tool.
examples: Examples of documents conforming to Stencila Schema, mostly for testing purposes.
scripts: Scripts used for making releases and during continuous integration.

Continuous integration and deployment

Several Github Action workflows are used for testing and releases. All products (i.e CLI, Docker image, SKDs) are released at the same time with the same version number. To create and release a new version:

bash scripts/bump-version.sh <VERSION>
git push && git push --tags

Workflow	Purpose	Status
`test.yml`	Run linting, tests and other checks. Commit changes to any generated files.
`pages.yml`	Publish docs, JSON-LD, JSON Schema, etc to https://stencila.dev hosted on GitHub Pages
`version.yml`	Trigger the `release.yml` workflow when a version tag is created.
`release.yml`	Create a release, including building and publishing CLI, Docker image and SDKs.
`install.yml`	Test installation and usage of CLI, Docker image and SDKs across various operating systems and language versions.

🙏 Acknowledgements

Stencila is built on the shoulders of many open source projects. Our sincere thanks to all the maintainers and contributors of those projects for their vision, enthusiasm and dedication. But most of all for all their hard work! The following open source projects in particular have an important role in the current version of Stencila. We sponsor these projects where, and to an extent, possible through GitHub Sponsors and Open Collective.

	Link	Summary
	Automerge	A Rust library of data structures for building collaborative applications.
	Clap	A Command Line Argument Parser for Rust.
	NAPI-RS	A framework for building pre-compiled Node.js addons in Rust.
	PyO₃	Rust bindings for Python, including tools for creating native Python extension modules.
	Rust	A multi-paradigm, high-level, general-purpose programming language which emphasizes performance, type safety, and concurrency.
	Serde	A framework for serializing and deserializing Rust data structures efficiently and generically.
	Similar	A Rust library of diffing algorithms including Patience and Hunt–McIlroy / Hunt–Szymanski LCS.
	Tokio	An asynchronous runtime for Rust which provides the building blocks needed for writing network applications without compromising speed.

💖 Supporters

We wouldn’t be doing this without the support of these forward looking organizations.

🙌 Contributors

Thank you to all our contributors (not just the ones that submitted code!). If you made a contribution but are not listed here please create an issue, or PR, like this.

encoda's People

Contributors

Stargazers

Watchers

Forkers

arinbasu jeffslofish rgaiacs 0xflotus articlehosting aswinawien rgieseke cosformula scottaubrey

encoda's Issues

Improve start-up time

The start up time of the CLI is curiously high. On Linux, using the binary it takes about 1s:

bash -c "time for i in {1..10}; do bin/stencila-convert --version; done"
0.33.0
0.33.0
0.33.0
0.33.0
0.33.0
0.33.0
0.33.0
0.33.0
0.33.0
0.33.0

real    0m9.936s
user    0m12.604s
sys     0m0.868s

In comparison, nixster takes 0.1s and dockter takes 0.25s (both are built using pkg).

I experimented with commenting out import './boot' from cli.ts and the time (with a binary) was almost exactly the same. However when I commented out import { convert } from './index' it fell to 0.15s. So it may simply be the amount of code that needs to be interpreted by Node at startup time that is causing the slowness. Would using dynamic require() calls be a way around this?

Note sure if we should address this now - that could be premature optimisation.

Ability to package CLI as a standalone binary

To make it easy to serve and use the Convert CLI without requiring an installation of Node, we'd like to package it up as a binary package using PKG

This will be needed fairly soon to be used from Stencila Hub.

DSVConverter: implement import

DSV (Delimiter Separated Values)
The most common and popular are comma and tab separated values. However, the converter should support other delimiters as well.

Converting cite2c citations/references from Jupyter

Currently live citations in Jupyter are done via the extension.

Reduce brand tone on default output themes

The Word and HTML styling is too garish for my tastes and as a user the very first thing I would do is go and change the styles - particularly of the headings. But I think we should discuss that in another PR.
#59 (review)
:)

The current default themes are based on the Stencila website which, while great for marketing materials, is a little too harsh and distracting for consumption of prose/long-form content.

We'd like to turn down the volume on the branding, the objective being for generated documents to remain identifiable as coming from Stencila without distracting from the content.

Only build and bundle for the host platform

Currently we are building MacOS, Windows and Linux binaries all on Linux (Travis CI). We shouldn't be doing this because we are now using native binaries (Chromium and Pandoc at present).

FWIW, the change to pkg options needed is:

pkg --targets=host --out-path=bin .

Type safety & Exception handling

This is a placeholder issue to discuss points raised in #65 (comment):

Re. Type safety and compile time guarantess:

Type: It's impossible to correctly define a type signature for a function which throws an error in TS. Return type of true | never is as close as we can get to give a hint to users who're reading this code, but when used the type signature hint gets reduced to just true by the compiler, so it's not tremendously helpful.

Re: Where to throw exceptions:

Really depends on where we assign value, but for me an important consideration for composeable/functional paradigms is being able to trust a function to execute and pass along values, or lack thereof.
Exception at a utility/low function level go against that dependability, and undermines the compile time guarantees provided by a type system.
Some ways of designing within these constraints include using things like Either and a "Functional Core, Reactive Shell" architecture.
But this is a broad topic unrelated to this specific PR, so I will go ahead and merge, and create a separate issue to discuss.

Originally posted by @alex-ketch in #65

Tests: Add more fixtures (test cases)

🐶 CSS styles for textual nodes

Related to #43

When converting from plain text formats (Markdown) to rich text formats (Word/etc), we would like to style the output for better readability and aesthetics.
Initially we will have a Stencila branded style by default, but should eventually allow selecting from various themes and publication formatting templates.

DSVConverter: implement export

DSV (Delimiter Separated Values)
The most common and popular are comma and tab separated values. However, the converter should support other delimiters as well.

Publish artifacts to NPM & Github Releases

Once #45 is completed we should publish the build artifacts, both the standalone binary and the JS packages, to NPM and GitHub releases.

For automating the release process we'd like to use Semantic Release.

rpng: fails for characters that are not Latin-1

For example:

node dist/cli "<p>An emoji: 🎉</p>" --from html --to rpng ./temp.png

Throws the exception:

Error: Only Latin-1 characters are permitted in PNG tEXt chunks. You might want to consider base64 encoding and/or zEXt compression

I suggest we go with zEXt since it has the additional advantage of keeping images smaller.

Remove deprecated schema properties from `xlsx` converter

👍 This will require an change in the xlsx converter. https://github.com/stencila/convert/blob/master/src/xlsx.ts#L121-L161

Originally posted by @nokome in stencila/schema#59

The following properties have been removed from Schemas:

Table: cells
TableCell: position

Spruce up the CLI

Currently the CLI is pretty bare bones, which is fine for now. But we may want to spruce it up a little using Pastel (React on the command line!).

🐶 CSS styles for rPNGs: Code Block

When we generate an rPNG, we use a headless browser to render an HTML representation of the contents and then capture the result as a screenshot.

For better UX, readability, and branding, we should style the generated images to match the corresponding styles in Stencila Hub.

MardownConverter: Add "success" message after converting plain Markdown

At the moment there is no feedback whilst for example for ipynb conversion
Success: import from "hello-world.ipynb" to "hello-world.dar"

GDoc to Markdown incorrectly generates ordered list instead of unordered

The attached Google Docs JSON has an unordered list in it that gets converted to an ordered list when converting to markdown.

Output is:

---
title: real converter test
authors: []
---

# This is a file created in the HUB

A para inserted by Nokome

## It will upload to GOOGLE DOCS

1.  test
2.  test
3.  and final test

Expected output is:

---
title: real converter test
authors: []
---

# This is a file created in the HUB

A para inserted by Nokome

## It will upload to GOOGLE DOCS

-  test
-  test
-  and final test

gdoc.json.txt

CSVYConverter: implement import

MarkdownConverter: fix blockquote-lazy conversion

Use generative testing

One of the current testing approaches is to to have many fixtures and automatically iterate through them, doing conversions and comparisons between formats as implied by file name extensions. That's OK but it's cumbersome maintaining an adhoc set of fixtures. Particulary when there are changes to encodings.

We have recently added a custom jest matcher which tests for invert-ability: https://github.com/stencila/convert/blob/90586a9820598ebb36c45169377c24777a001b89/tests/matchers.ts#L9-L30

Currently we are just passing that some sample node trees. However, give we have a schema, we could use json-schema-faker, to generate many node trees and check that the compiler can invert them (ie. unparse and then parse them and get the same node back) e.g.

await expect(jats).toInvert(sampleNode)

LaTeXConverter: tex_math_dollars, enable $ and $$

As requested by Jean

One vote for tex_math_dollars, so that we can use $ and $$, which is a widely supported syntax.

SheetScriptConverter: implement export

hello-world.md converted to output.docx on MacOS High Sierra can't be opened on MS Word

Downloaded the latest stencila CLI and attempted a convert to docx on the example hello-world.md file. The resulting file is rejected by MS Word for Mac 16.14.1:

Importing in Pages also fails with an "Pages can't read the file" error.

Here is the actual file:
output.docx

Markdown: support content blocks?

One of the things we have been thinking about recently is how to model transclusion of various types of content into documents. In JSON, a transclusion node within a document might look something like:

{
    "type": "Include",
    "source": "my-code-example.py"
}

My initial thought on how to represent an Include node in Markdown was to use Commonmark's generic extension (using remark-generic-extensions) i.e.

!include[my-code-example.py]

However, @alex-ketch pointed out that iA writer takes an elegant approach and here (inspired by a John Gruber tweet) of simply parsing URLs / filesystem paths and including the content e.g.

./my-code-example.py

Or, using a URL

http://github.com/.../my-code-example.py

Of course both could be supported (one explicit, the other more implicit).

Converting docx to HTML fails with EPIPE

$ npx ts-node --files src/cli --from docx --to html upload-test-new.docx

Error: write EPIPE
    at WriteWrap.afterWrite (net.js:799:14)

upload-test-new.docx

DocumentRmdConverter: translate cells with a fig.caption option to a repro-fig

An R Markdown cell with the fig.cap option set should be converted into a JATS4M fig[fig-type=repro-fig]

e.g.

```{r fig.width=7,fig.height=6,fig.cap="Figure caption"}
hist(rnorm(min(1000, 1e5)), breaks=40, col=hsv(0.6, 0.9, 1), xlab="Value", main="")
```

<fig id="f1" fig-type="repro-fig">
  <caption>
    <title>Figure caption</title>
  </caption>
  <alternatives>
    <code specific-use="source" language="r">#: fig.width=7,fig.height=6
hist(rnorm(min(1000, 1e5)), breaks=40, col=hsv(0.6, 0.9, 1), xlab="Value", main="")</code>
    <code specific-use="output" language="json">{}</code>
  </alternatives>
</fig>

Document JATS exporter: ensure <abstract> element exists

It seems that Texture now requires a <abstract> element e.g.

<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <article-meta>
      <abstract>
        <p>Abstract paragraph.</p>
      </abstract>
    </article-meta>
  </front>

IPYNB: support importing of title and authors

Currently, the R Markdown converter (DocumentRmdConverter) supports importing document title and authors from YAML metadata headers. It would be great to have similar functionality for Jupyter documents.

The Jupyter spec mentions an authors field in the notebook's metadata but seems to be permissive with respect to how that is represented (i.e. string v more complex objects). Other relevant discussions include:

At this stage, propose to add simple support for importing title and authors from the notebook metadata by generating a Pandoc YAML header. Once that is implemented as a 'proof-of-concept' consult with Jupyter community on long term preferred options for more detailed structure e.g. affiliations.

FunctionJsDocConverter: add parsing of Markdown

Currently the <description> tags in the FunctionSchema.rng over in stencila/stencilaonly allow for plain text. It would be useful to allow richer content and to introduce Markdown parsing toFunctionJsDocConverter` when that is done.

JsDocConverter: implement

Why?

Currently, Function definitions need to be authored in XML which is unwieldy and could discourage developers from writing them.

JsDoc is a convenient, widely used format for writing function documentation. It's @tag format is used for documentation in other languages (e.g. roxygen for R).

This is a high priority because we need to build up the libcore function library.

Approach

Import : The doctrine package can parse JsDoc strings. We then convert that to XML.
Export: No plans to support export to JsDoc at this time

CSVY: implement codec

Fix Windows tests

Currently failing because of carriage return character \r: https://ci.appveyor.com/project/nokome/convert/build/1.0.31#L75

SheetDSVConverter: empty cells ignored during conversion

Empty cells are ignored during the conversion which breaks the converted Stencila Sheet (Invalid dimension error). Empty cells should be converted to <cell> </cell>.

DOCXConverter: implement

SheetScriptConverter: implement import

DocumentMdConverter: store language and indicate an executable cell

Code cells in md files aren't converted to executable cells (like is done for Rmd files by DocumentXmdConverter) but just to <code language=....> </code>

Since we may use md files as a source for conversion (esp. containing cells in multiple languages) it would be good to support conversion to code cells in Stencila and indicate that they are executable (where relevant?).

docx: reinstate converter

Our .docx converter uses Pandoc. It needs to be refactored to our new API and the Pandoc JSON to Stencila JSON conversion reimplemented based on our new schema.

Fix Appveyor builds

The tests on Appeyor (Windows CI) fail due to line endings (i.e. \r\n versus \n):

https://ci.appveyor.com/project/nokome/convert

Change binary name from stencila-convert to convert

For ergonomics it would be nicer to have a more succinct name.

xlsx: parse workbook properties

When parsing xlsx (and ods files) the workbook.Props should be used to populate properties of the CreativeWork created in:

https://github.com/stencila/convert/blob/3934898f58be39a91478fe21739f4726d6e28006/src/xlsx.ts#L45

Handle escaped string while parsing CSVs

https://github.com/stencila/convert/blob/ff0d789081b46788ac5df1043505300fda90c806/src/util.ts#L171-L176

Unfortunately this does not account for escaped text like:
header1, header2, header3
text, "some text, it contains a comma", another text 

Originally posted by @alex-ketch in #65

Use `-` as placeholders for stdin and stdout

Currently, we are using -- (two dashes) to signify input from stdin, or output to stdout. e.g.

echo "# Heading" | stencila-convert --from md -- ./myheading.docx

The -- is necessary to indicate that ./myheading.docx is the desired output filename, not the input filename.

However, other CLIs, notably pandoc and tar use a single dash - to indicates this. To be consistent, we'll change to a single dash.

Flatten the fixtures

Improve speed of rPNG generation

Currently rPNG generation is quite slow. This is not a huge issue but could become one with documents that contain many reproducible elements (e.g. lots of within text code expressions providing analysis results). The pandoc.test.ts which involves generation of six rPNGs is noticebly slow https://travis-ci.org/stencila/convert/jobs/532638335#L144

A little test

Here are Markdown document with some non-textual nodes:

A !number(1) and a !boolean(true) and an !array(1,2,3,4,5) and an !object("a":1, "b":2)

When converted to Word bin/stencila-convert temp.md ./temp.docx the "non-textual" nodes get converted to rPNGs:

Which takes an average of 3.5s (and shows spike in network traffic associated with fetching highlight.js presumably):

bash -c "time for i in {1..10}; do bin/stencila-convert temp.md ./temp.docx; done"

real    0m35.710s
user    0m21.296s
sys     0m4.856s

If you remove the exclamation marks (i.e. remove the Commonmark inline extension syntax):

A number(1) and a boolean(true) and an array(1,2,3,4,5) and an object("a":1, "b":2)

Then there are no rPNGs and the time drops to 1.1 seconds (only slightly above the base startup time, #60).

bash -c "time for i in {1..10}; do bin/stencila-convert temp.md ./temp.docx; done"

real    0m11.092s
user    0m13.652s
sys     0m1.008s

Potential solution

It seems that the first approach would to lazy, and only once, initialise a rendering page. That is, do this https://github.com/stencila/convert/blob/99975bbdcd3526d98041d98bf0f098b867d770d4/src/rpng.ts#L208-L216 outside of the unparse function, and only as needed.

List formats recognized by the converter in help

Provide a install script and instructions

I think we should provide an install script and README instructions like we have for Dockter. See the README there and https://github.com/stencila/dockter/blob/master/install.sh (we'd just need to do some renaming and add a windows case to that script and add it here).

FunctionJsDocConverter: add @see and @related tag

The FunctionSchema.rng allows for listing the name of related functions:

  <define name="function:relateds">
    <element name="relateds">
      <zeroOrMore>
        <ref name="function:related"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="function:related">
    <element name="related">
      <text/>
    </element>
  </define>

JsDoc function documentation should support this using @related, and alias @see, tags.

Error decoding CodeBlocks in Markdown

When converting this Markdown with four consecutive code blocks...

```
date
```

```bash
date
```

```sh
date --utc
```

```bash pause=2
date --help
```

using...

npx ts-node --files src/cli convert ./test.md - --to json

I got...

{
  "type": "Article",
  "authors": [],
  "content": [
    {
      "type": "CodeBlock",
      "language": null,
      "value": "date"
    },
    {
      "type": "CodeBlock",
      "language": "bash",
      "meta": {
        "--utc": ""
      },
      "value": "date"
    },
    {
      "type": "CodeBlock",
      "language": "sh",
      "meta": {
        "--utc": ""
      },
      "value": "date --utc"
    },
    {
      "type": "CodeBlock",
      "language": "bash",
      "meta": {
        "pause": "2"
      },
      "value": "date --help"
    }
  ]
}

There are several issues:

the first block has "language": null instead of not having that property at all
the second and third blocks have meta that has 'leaked' from the code of the third block

xlsx to PDF conversion fails with

Reported by burque505 - Winter Laite

Even a spreadsheet with only 6 cells, each containing only text, fails with the error mentioned. The json created, however, contains all the data.(...)
EDIT:
Code problem solved with bbcode tag (trying spoiler out also). By the way, I get a useable conversion by first converting xlxs -> md and then md -> pdf, but no charts so far.

EDIT: problematic file attached

TM.xlsx

Generate PDFs using Puppeteer and Paged.js

Given that we integrate Puppeteer, there is an opportunity to generate PDFs using that, rather than requiring users to have a Latex tool like TeXLive installed. This could be combined with Paged.js to create beautiful paginated PDFs of documents.

Relevant links:

stencila / encoda Goto Github PK

encoda's Introduction

👋 Introduction

🚴 Roadmap

Schema

Storage and synchronization

Formats

Kernels

Tools

SDKs

Testing and auditing

📜 Documentation

📥 Install

CLI

SDKs

🛠️ Develop

Code organization

Continuous integration and deployment

🙏 Acknowledgements

💖 Supporters

🙌 Contributors

encoda's People

Contributors

Stargazers

Watchers

Forkers

encoda's Issues

Why?

Approach

A little test

Potential solution

Recommend Projects

Recommend Topics

Recommend Org