mstade / markette Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 398 KB

Deliciously minimalistic markup.

License: MIT License

markette's People

Contributors

Stargazers

Watchers

markette's Issues

Selection language for profiles

In #5 there's a brief discussion on creating a language for document selections, kind of like CSS. This could be very useful for defining processor clues in human readable profile documents. This probably benefits from being a separate spec altogether, but I'm leaving a note in the backlog just to keep track of it anyway. It's not planned for the first draft.

Metadata and/or preface

It's common for documents – particularly larger ones, such as books – to include metadata. Such metadata might include a list of authors, date published, addresses of different kinds (URLs, email, physical) etc. However, even smaller documents often include metadata. Consider something like a project description, i.e. README, which may contain links to licenses or other projects; an example of this would be RFCs.

What might be a good way to represent this in a structured manner such that it's easily extracted, yet not restrictive in authoring. Some considerations to make:

Should be possible to read before any other part of a document; particularly important to streaming parsers where the metadata block might contain functionality modifiers (e.g. profile links.)
- However, some content (such as README files) may actually be less accessible if this is a hard requirement (i.e. a required initial block if too large might deter a reader from the meat of the document.)
Might include multiple disparate types of metadata, yet still grouped because they are in fact metadata; thus it might be necessary to allow any type of content, not just simplify to key/value pairs.
- Key/value pairs are common for technical data which may or may not be important for rendering, but is important for machine consumption. This would allow accurate representation of HTTP messages, for instance. It might be useful to consider this a common special case.

Internationalization and bi-directional text

Consider whether the syntactical elements of this format should take in to account non-english languages. Is this format too tied to the Germanic family of languages? For instance, how do you define lists in arabic languages? Will the primitives have to change depending on language or is there such a thing as a universal format for these elements?

So many questions.. Much more research needs to be done, preferably with input from writers of other languages than the Germanic families.

Document outline

Sectioning of a document should be implicitly defined. If not, there would have to be some sort of way of explicitly marking the beginning and end of a section. This is, I think, markup for the sake of markup which goes against the goals of this project.

HTML5 has some pretty good ideas on how to do document outlines, it's worth looking at the spec for inspiration.

Character encoding

It should be specified what character encoding should be used for parsing and saving documents. For instance, HTML has a fairly extensive set of rules for determining the character encoding. There are some lessons that could be learned from this, such as determining encoding in confidence levels. However, because we're trying not to tie this format to HTTP, it's probably wise not to include assumptions about headers and such. As well, HTTP regulates these things anyway, so it's meaningless to be particularly detailed in this regard I think.

I reckon the simplest possible, and probably most portable, suggestion is to say that one should always assume the document is UTF-8 unless otherwise specified. This is essentially what the HTML spec does, albeit in many more words. Now the question is, what does "unless otherwise specified" actually mean? Possibly, this relates to #1 in that the document itself may include metadata to suggest the encoding to be different than the assumed UTF-8. This has the ironic property that in order to parse the document, one must first parse the document. However, since UTF-8 is compatible with so many encodings, it's possible that starting off with UTF-8 is more than enough and then allow the parser to adjust and possibly re-parse a document using some other encoding in case it's told to. I'm not sure this is important though, since this is largely a technical detail and the mission here is to provide a format that allows for simple human authoring and reading, while still providing enough structure for machines to be fairly smart about documents.

My thoughts boil down to the following spec:

Parsers should assume that documents are encoded using UTF-8. Parsers may pre-scan the document to determine whether a different encoding is used, however it is not required to do so. It is recommended that a parser never pre-scans more than the first 1024 bytes of a document, since this should be more than enough data to confidently determine the encoding.

Serializers should encode documents using UTF-8. Serializers may use a different encoding, but it is not recommended since this specification does not detail a pre-scanning algorithm for determining any encodings other than UTF-8.

Is this good enough or should there be more specific instructions for parsing/serializing documents? I'd like to avoid encoding metadata if possible, whether through a BOM or included in the actual document text, since it makes things simpler and since UTF-8 is more or less ubiquitous anyway. (This last statement probably needs citation.)

Profiling (was extensions)

This format needs the ability to be extended through document profiles, so that people can define additional structures on top of the primitives. The need for this is manifested by the existence of a gazillion markdown clones; most of which define new semantics rather than new syntax. Additionally, there could be some overlap between this and vocabularies like <schema.org> which could make this format able to use existing vocabularies with ease. Not sure yet how this might look without markup-for-markup's-sake syntax.

This is arguably the most important feature of this format.

Some thoughts:

Extensions define structure and semantics, not new primitives
Be flexible; allow any kind of extension that doesn't change the underlying format

Examples of where the format could be extended through profiles to provide additional semantics:

What's in a name?

This format needs a name that doesn't suck. Some considerations:

Should be memorable; implies that it's short and uncomplicated
Google friendly, which likely excludes cute word overloading
- However it doesn't exclude cute word subclassing (markdown being the canonical example)
Human friendly; implies non-technical

Inspirational resources and prior art

It's important to consider prior art in this process, not just for inspiration but also in order to learn from the successes for failures of other formats. Let this issue be a collection of links and discussions for future reference.

Content classes

There are six classes of content:

Metadata (#1)
Flow
Sectioning
Heading
Phrasing
Hypermedia

All content is described by at least one class, but most belong to two or more. For instance, all content except for document metadata (see #1) is considered flow content.

Metadata content (#1)

Metadata content describes additional information about other content; that is, it's data about data. It can either provide additional information about the document as a whole, in which case it is classified solely as metadata and not as any other type of content. It may also provide additional information about flow content and phrasing content, in which case it's classified as metadata and flow or phrasing content. (Phrasing content is also flow content, which means that phrasing metadata is by definition also flow content.)

Flow content

All content except for document metadata is considered flow content. In HTTP parlance one can say that flow content is the body of a document whereas non-flow metadata content is the document headers.

Sectioning content

Sectioning content defines the scope of other content. A section relates to other sections either by being a sibling or a child. Only heading sections can describe parent-child relationship, whereas other kinds of sectioning content describe sibling relationship. All sectioning content is also flow content; however, this is the only other content class it belongs to.

Heading content

Heading content defines the scope of content following it; thus, it is also sectioning content. There are six priority levels, all of which create a new section. The header explicitly marks the start of a section, however there is no way to explicitly mark its end. If the priority of a header is lesser than the previous header, the section created should be considered a subsection to the previous header's section. The exception to this rule is if there is no previous header, in which case the heading is considered top level even if there are headings of higher priority following it.

Heading sections is different to other kinds of sectioning content, in that it describes a parent-child relationship between its content and its parent's contents.

Phrasing content

Any content that is not sectioning or metadata-only content is also phrasing content. The majority of all primitives are phrasing content.

Hypermedia content

Hypermedia is content that is linked to other content in some way. The links may be internal, i.e. content linking to other content within the document; or external, i.e. content linking to content external to the current document.

Because all hypermedia define relationships in some way, it is by definition interactive. It describes the process by which a reader of the document should interact with the content. This process may or may not be the same for different readers. For instance, a human reader might read a document and see that there's an image embedded. The image is defined elsewhere, referred to by the use of a URL, and the reader may look at it by using a web browser. A machine reader however, might download the image automatically and generate a different representation of the document altogether, such as a web page or printed document.

Another example of hypermedia might be technical documentation, which has machine code embedded. For a human reader this might be instructional in the sense that they might use that code in some other context, such as implementing a program. A machine however might automatically read and execute the embedded code. Some hypermedia controls, such as machine executable instructions, raises obvious security concerns. Others, such as footnotes, might not.

Hypermedia affordances can be further classified using one or more H-factors.