Document structure¶

import pandoc
from pandoc.types import *

Meta-model¶

Pandoc models every document as a tree of elements. Each element has a well-defined type such as paragraph, image, note link, etc. and of course the document type. These elements are combined using a well-defined set of rules which defines the document meta-model¹.

Pandoc can be used as a converter between different document formats; this usage requires very little knowledge about the document structure. However, if one wishes to analyze, create or transform documents, some working knowledge of this meta-model becomes necessary.

Haskell & Python¶

The primary source of information about pandoc's meta-model is the hierarchy of types defined by the pandoc-types Haskell package. The meta-model, represented by a collection of Haskell types, is described in the documentation of the Text.Pandoc.Definition module.

However, this source of information requires some understanding of the Haskell programming language. The pandoc Python library brings to Python this hierarchy of types; it also offers an alternate and interactive way to become familiar with the meta-model. This is what we describe in the following sections.

Documents¶

Explore¶

The basic idea here is that you can create markdown documents that feature exactly the kind of document constructs that you are interested in, and then read them as pandoc documents to see how they look. By construction, these documents converted from markdown will be valid, i.e. consistent with the pandoc meta-model. And since you can display them, it's a great way to build some understanding of how things work.

For example, the plain text "Hello World!" is represented in the following manner:

>>> text = "Hello, World!"
>>> doc = pandoc.read(text)
>>> doc
Pandoc(Meta({}), [Para([Str('Hello,'), Space(), Str('World!')])])

We can see that this document is an instance of the Pandoc type, which contains some (empty) metadata and whose contents are a single paragraph which contains strings and spaces.

It's possible to explore interactively this document in a more precise manner:

>>> doc
Pandoc(Meta({}), [Para([Str('Hello,'), Space(), Str('World!')])])
>>> meta = doc[0]
>>> meta
Meta({})
>>> meta[0]
{}
>>> contents = doc[1]
>>> contents
[Para([Str('Hello,'), Space(), Str('World!')])]
>>> paragraph = contents[0]
>>> paragraph
Para([Str('Hello,'), Space(), Str('World!')])
>>> paragraph[0]
[Str('Hello,'), Space(), Str('World!')]
>>> world = paragraph[0][2]
>>> world
Str('World!')

I recommend that you try to reproduce the process above for small documents that feature titles, headers, emphasized text, lists, etc. to become familiar with the way that these constructs are described in pandoc documents.

Create¶

At this stage, even if we have not yet described formally the meta-model, we have already gathered enough knowledge to build a simple plain text document from scratch.

>>> text = [Str("Python"), Space(), Str("&"), Space(), Str("Pandoc")]
>>> paragraph = Para(text)
>>> metadata = Meta({})
>>> doc = Pandoc(metadata, [paragraph])

We can check that our document is valid and describes what we are expecting by converting it to markdown and displaying the result:

>>> print(pandoc.write(doc)) # doctest: +NORMALIZE_WHITESPACE
Python & Pandoc

Types¶

Explore¶

The insights gathered in the previous sections were a good starting point to get a feel for the possible document structure. Now, to be certain that we always deal with valid documents, we need to explore the document meta-model itself, i.e. the hierarchy of pandoc types, such as Pandoc, Meta, Para, Str, Space, etc. Luckily for us, these types are self-documented: in the Python interpreter they are represented by a type signature. This signature described how they can be constructed.

For example, the top-level type Pandoc is represented as:

>>> Pandoc
Pandoc(Meta, [Block])

which means that a Pandoc instance is defined by an instance of Meta (the document metadata) and a list of blocks. In our exemple above, the metadata was not very interesting: Meta({}). Still, we can make sure that this fragment is valid: the Meta type signature is

>>> Meta
Meta({Text: MetaValue})

which reads as: metadata instances contain a dictionary of Text keys and MetaValue values. In our example, this dictionary was empty, hence we don't need to explore the structure of Text and MetaValue any further to conclude that the fragment is valid.

Now, let's explore the content of the document which is defined as a list of blocks. The Block type signature is

>>> Block
Block = Plain([Inline])
      | Para([Inline])
      | LineBlock([[Inline]])
      | CodeBlock(Attr, Text)
      | RawBlock(Format, Text)
      | BlockQuote([Block])
      | OrderedList(ListAttributes, [[Block]])
      | BulletList([[Block]])
      | DefinitionList([([Inline], [[Block]])])
      | Header(Int, Attr, [Inline])
      | HorizontalRule()
      | Table(Attr, Caption, [ColSpec], TableHead, [TableBody], TableFoot)
      | Figure(Attr, Caption, [Block])
      | Div(Attr, [Block])

Each "|" symbol in the signature represents an alternative: blocks are either instances of Plain or Para or LineBlock, etc. In our example document, the only type of block that was used is the paragraph type Para, whose signature is:

>>> Para
Para([Inline])

Paragraphs contain a list of inlines. An inline is

>>> Inline
Inline = Str(Text)
       | Emph([Inline])
       | Underline([Inline])
       | Strong([Inline])
       | Strikeout([Inline])
       | Superscript([Inline])
       | Subscript([Inline])
       | SmallCaps([Inline])
       | Quoted(QuoteType, [Inline])
       | Cite([Citation], [Inline])
       | Code(Attr, Text)
       | Space()
       | SoftBreak()
       | LineBreak()
       | Math(MathType, Text)
       | RawInline(Format, Text)
       | Link(Attr, [Inline], Target)
       | Image(Attr, [Inline], Target)
       | Note([Block])
       | Span(Attr, [Inline])

In our plain text example, only two types of inlines where used: strings Str and white space Space. Since

>>> Str
Str(Text)
>>> Text
<class 'str'>

we see that Str merely wraps an instance of Text which is simply a synonym for the Python string type. On the other hand, the white space is a type without any content:

>>> Space
Space()

We now have successfully discovered all pandoc types used in our simple "Hello world!" document. Again, I recommend that you reproduce this process for all document constructs that you are interested in.

Kinds of Types¶

The types defined in pandoc.types are either data types, typedefs or aliases for Python built-ins.

>>> from pandoc.types import *

The Pandoc type is an example of data type:

>>> issubclass(Pandoc, Type)
True
>>> issubclass(Pandoc, Data)
True

Data types come in two flavors: abstract or concrete. The signature of abstract data types lists the collection of concrete types they correspond to:

>>> Inline # doctest: +ELLIPSIS
Inline = Str(Text)
       | Emph([Inline])
       | Underline([Inline])
       | Strong([Inline])
...
>>> issubclass(Inline, Type)
True
>>> issubclass(Inline, Data)
True

The concrete types on the right-hand side of this signature are constructor (concrete) types. The abstract type itself is not a constructor ; it cannot be instantiated:

>>> issubclass(Inline, Constructor)
False
>>> Inline()
Traceback (most recent call last):
...
TypeError: Can't instantiate abstract class Inline

The constructors associated to some abstract data type are concrete:

>>> issubclass(Str, Type)
True
>>> issubclass(Str, Data)
True
>>> issubclass(Str, Constructor)
True

They can be instantiated and the classic inheritance test apply:

>>> string = Str("Hello")
>>> isinstance(string, Str)
True

Constructor types inherit from the corresponding abstract data type:

>>> issubclass(Str, Inline)
True
>>> isinstance(string, Inline)
True

Typedefs are also another kind of abstract type. They are merely introduced so that we can name some constructs in the type hierarchy, but no instance of such types exist in documents. For example, consider the Attr and Target types:

>>> Attr
Attr = (Text, [Text], [(Text, Text)])
>>> Target
Target = (Text, Text)

They are pandoc types which are not data types but typedefs:

>>> issubclass(Attr, Type)
True
>>> issubclass(Attr, Data)
False
>>> issubclass(Attr, TypeDef)
True
>>> issubclass(Target, Type)
True
>>> issubclass(Target, Data)
False
>>> issubclass(Target, TypeDef)
True

They enable more compact and readable types signatures. For example, with typedefs, the Link signature is:

>>> Link
Link(Attr, [Inline], Target)

instead of Link((Text, [Text], [(Text, Text)]), [Inline], (Text, Text)) without them.

To mimick closely the original Haskell type hierarchy, we also define aliases for some Python primitive types. For example, the Text type used in the Str data constructor is not a custom Pandoc type:

>>> Str
Str(Text)
>>> issubclass(Text, Type)
False

Instead, it's a mere alias for the builtin Python string:

>>> Text
<class 'str'>

A document model represents a given document. The document meta-model represents the document model itself, i.e. the set of all valid documents. ↩