Git Product home page Git Product logo

impresso-schemas's Introduction

Impresso website

Where are pages stored on Github?

Jekyll collection folder in this Github repo
current home index
new home index_
pages /pages
lab post /_labs
blog posts /_posts
people /_posts
partners /_posts
events /_posts

Setting up development environment with Jekyll

This website uses Jekyll 3.9.3. In terminal, browse to local site directory, then use the following commands:

bundle install
bundle exec jekyll serve

Your local site resides here now: localhost:4000

Details on setting up your GitHub Pages site locally with Jekyll

Setting up development environment with Docker

How to add an item to the Timeline

To add a new event to the timeline on the homepage, follow these steps:

  1. Create a new Markdown file in the _events directory with a filename following the format YYYY-MM-DD-short-slug.md. For example, 2027-01-01-new-event.md. Use the earliest date of the event in the filename, as it is just being used to sort filenames in the directory. Note: Jekyll would not render the file if the date in the filename is in the future, so you can safely create the file even if the event is not scheduled yet. If you want the event to be displayed as a separate page, don't forget to add the date field to the front matter!

  2. In the newly created Markdown file, add the following front matter at the beginning of the file:

---
title: 'Event Title'
date: YYYY-MM-DD # Publication date of the event, optional if date in the filename refers to past date
start_date: YYYY-MM-DD # Start date of the event
end_date: YYYY-MM-DD # End date of the event
human_date: Month Year # Human-readable date, e.g., January 2027
---

Replace 'Event Title' with the actual title of the event, set the start_date and end_date to the appropriate dates in the format YYYY-MM-DD. The human_date is a human-readable date, e.g., January 2027 and it will be the only label displayed in the timeline. Adjust the human_date to reflect the month and year of the event. For example, for an event titled "New Event" scheduled for one unspecified day in January 2027, the front matter would look like:

---
title: 'New Event'
date: 2023-11-29
start_date: 2027-01-01
end_date: 2027-30-01
human_date: Second or third week of January 2027
---

If there is a blogpost associated with the event, add the link to the blogpost field in the front matter using the relative path to the blogpost. For example, if the blogpost md file is located at _posts/2027-01-01-new-event.md, add the proper url to the front matter, following the permalink: /:categories/:year/:month/:day/:title:output_ext pattern. For example, the front matter would look like this:

---
title: 'New Event'
date: 2023-11-29
start_date: 2027-01-01
end_date: 2027-30-01
human_date: Second or third week of January 2027
+blogpost: /news/2027/01/01/new-event.html
---

How to add an announcement and link to a blogpost

Announceents are displayed on the homepage. To add a new announcement, create a new Markdown file in the _announcements directory with a filename following the format YYYY-MM-DD-short-slug.md. For example, 2027-01-01-new-announcement.md. Use the date of the announcement in the filename, as it is just being used to sort filenames in the directory. Note: Jekyll would not render the file if the date in the filename is in the future, so you can safely create the file even if the announcement is not scheduled yet. If you want to link the announcement irectly with a blogpost, use the blogpost field in the front matter using the relative path to the blogpost. For example, if the blogpost md file is located at _posts/2027-01-01-new-event.md, add the proper url to the front matter, following the permalink: /:categories/:year/:month/:day/:title:output_ext pattern. For example, the front matter would look like this:

---
title: New announcement
blogpost: /news/2027/01/01/new-event.html
---

How to add a page, its list of seealso pages, and link to it from the menu

Create the page in the pages directory. The filename should be the same as the title of the page, with dashes instead of spaces. For example, if the page title is "About the Project", the filename should be about-the-project.md. Add the permalink to the front matter to the page:

---
title: 'About the Project'
permalink: /about-the-project/
---

Then add an entry to the menu in the _data/navigation.yml file:

- title: About the Project
  url: /about-the-project/

The page frontmatter can contain the seealso table of links - the links being the exact permalink of the page to link to:

---
title: 'About the Project'
permalink: /about-the-project/
+ seealso:
+   - /project/objectives/
+   - /project/design/
---

If you need to add a page inside a subdirectory, for example, `/project/objectives/`, you need to add the `parentUrl` to the front matter of the page:

```diff
---
title: 'Objectives'
permalink: /project/objectives/
+ parentUrl: /project/
---

The folder structure of the pages directory should in principle reflect the menu structure. For example, the page /project/objectives/ should be located in the pages/project/objectives.md file. Note: the permalink will tell eventually Jekyll to generate the page at the specified URL, even if the page is located in a subdirectory.

How to add a blogpost

Create a new Markdown file in the _posts directory with a filename following the format YYYY-MM-DD-short-slug.md. For example, 2027-01-01-new-event.md. Use the date of the event in the filename, as it is just being used to sort filenames in the directory. Note: Jekyll would not render the file if the date in the filename is in the future, so you can safely create the file even if the event is not scheduled yet. If you want the event to be displayed as a separate page, don't forget to add the date field to the front matter!

In the newly created Markdown file, add the following front matter at the beginning of the file:

---
title: 'Blogpost Title'
date: YYYY-MM-DD # Publication date of the blogpost, optional if date in the filename refers to past date
---

To add a cover figure, add the images to the /assets/images directory and add the figure field to the front matter:

---
title: 'Blogpost Title'
date: YYYY-MM-DD # Publication date of the blogpost, optional if date in the filename refers to past date
+ figure:
+   - src: figure1.png
+     alt: 'Figure 1'
+     caption: 'Figure 1: Caption of the figure'
---

impresso-schemas's People

Contributors

aflueckiger avatar e-maud avatar mromanello avatar piconti avatar pstroe avatar simon-clematide avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

impresso-schemas's Issues

accommodate relative coordinates in page and content item schemas

problem

  • our schemas currently assume that coordinates are always absolute (integers), expressed in terms of pixels;
  • however, we now have one use case for relative coordinates, expressed as percentages (in IIIF .../pct:41.6,7.5,40,70/full/...)

proposed solution

  • add a c_type field to page (a.k.a canonical) and content item (a.k.a. rebuilt) schemas
  • accepted values for c_type: pixel or pct

What do you think? (ping @simon-clematide)

schema naming

how about introducing some sub-directories in json so as to keep schema file names more compact?

e.g.:

json/topic_model/topic_assignment.schema.json

instead of topic_model_topic_assignment.schema.json ?

Plus we will need json/canonical, json/rebuilt, etc.

extension of schemas for classical commentaries

Hello @e-maud @simon-clematide ๐Ÿ™‚ (and ping @sven-nm)

In our Ajax Multi-Commentary project we starting to use the impresso text-importer + JSON schemas to convert the OCR of our commentaries into a canonical format, from which then other various formats can be rebuilt.

The very first thing that needs to be adapted are the JSON schemas. We will need an equivalent of the newspaper issue schema for each commentary (our atomic document unit), while the commentary page schema will require only minor adaptations (compared to the newspaper page schema). Overall, this extension will help us think how to make these impresso schemas more generic in the future.

We will make our changes (additions) in a branch of this repository, and then later on we can discuss with a PR how to better integrate our schemas with the others.

Update schema to match canonical formats updates

As part of impresso-text-acquisition's issue #117, it was decided to update some of properties of the Page and Issue canonical schemas.
Additionally, based on impresso-text-acquisition's issue #74, another change might also made to the schema.

Those changes are:

  • Add an optional string property iiif_manifest_uri to the Issue Canonical Schema
  • Add the optional c (ccordinates) property to the Issue Canonical Schema (under item).
  • Add the optional property iiif_img_base_uri to the Page Canonical Schema
  • Notify the the iiif property of the Page Canonical Schema is deprecated, and that iiif_img_base_uri should be used instead whenever it's present.
  • Add an optional reading order property to the Issue Canonical Schema (most probably under item.metadata). It will be needed for all issues for which the CI Id ordering does not follow a logical page-to-page order. Its placement might change after some discussions.

rethink content item types

Content item types are currently found in two schemas:

  • issue schema
  • content item schema

The controlled vocabulary currently contains the following values (i.e. types):

  • article
  • ad
  • image
  • table
  • death_notice
  • weather
  • page

The problem (from a modelling perspective) is that structural types (e.g. page, image) are mixed with semantic ones (e.g. article, weather). A cleaner solution would be to separate the two by a more precise modelling.

schema for iconographic material

Objective: JSON schema to represent iconographic material of impresso archives

First sketch:

requirements:
- keep the link between illustration and associated article if present
- keep good coordinates relatives to IIIF

which structure:
- same as per text rebuilt: one file per iconographic content item ('ici') ?
- one file per issue, summarizing all ici per page ?

basic info
- ici id
- issue date
- coordinates
- versioning info
- title or caption if available
- link to related article

[topic schema] topic id

@simon-clematide thanks for the updated topic schema.
We still need a unique id for topics such as tm001_tp001_frpreviously. Shall I build it from the available infos or you provide it? If yes, 2 padded 0 are enough for the topic number I guess?
I am building a small example in sandbox and had already added such topics descriptors to articles, but can easily change them.

schema for linguistic annotations

Hello @pstroe, @simon-clematide , @mromanello,

Many thanks @pstroe for this schema! Here come a few comments:

  • the schema could be in another repository than newspapers, e.g. ling_annotations. In my understanding newspapers folder gather schemas related to the description of this object, while additional layers are apart (like topic_model).

  • the json schema is not a valid json:
    - there is a pb with a missing } at the end
    - and with the timestamp pattern. This one works and is shorter:
    "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$"

  • the title of the P schema should be part of speech (copy-paste from previous)

  • would it be possible to have a example?

  • overall we should think of the future usage of this annotation. If for internal sharing purposes, I think it perfectly does the job. However, if we want to represent more spanning annotations and confidence levels, it will not be sufficient I think. E.g. for named entities we might have several system outputs.

Topic Assignments

@e-maud @mromanello

Not sure whether the topic assignments per documents should use the "composed id keys" => ("tm001_tp01_fr") or simple "numeric topic keys".
Pro composed keys:

  • "globally" (across all topic models) unique
  • can be directly linked to current id format in topic_model_topic_description

Pro numeric topic keys

  • a lot less verbose (even a sparse topic model with only a handful of topics per document will lead to quite some data if we have millions of articles)
  • direct connection to numeric ids of topic model
  • easy to turn them into a vector format (topic id => vector index)

Bridging:

  • We could add a format string in the JSON that makes the connection between the numeric topic id and the global topic name explicit:
  • "tmtpid_format":"f"{topic_model}_tp{topic:0{math.ceil(math.log(number_of_topics,10))}d}_{lang}"
  • which would produce "tm001_tp002_fr" if the following names would be defined for instance as , number_of_topics = 50, lang="fr", topic_model = "tm001", topic=2

accommodating IIIF info in Page and Issue schemas

We need to make explicit in the Issue/Page schemas how to construct IIIF links to page images.

In the Issue schema, we currently treat images as yet another content item type (here below an example):

  • t (tile) and l (language) refer to the caption/title of image, if provided
  • images have type tp == image
  • iiif-link points to the image manifest (the fair assumption is that images never span multiple pages)
  • (the rest has same semantics as per schema)
{
      "m": {
        "id": "luxzeit1858-1858-12-1-a-i0018",
        "t": null,
        "l": "n/a",
        "tp": "image",
        "pp": [
          1
        ],
        "iiif_link": "https://iiif.eluxemburgensia.lu/iiif/2/ark:%2f70795%2fv1tvx5%2fpages%2f1/info.json"
      },
      "l": {
        "id": "MODSMD_PICT1"
      },
      "c": [
        1410,
        292,
        205,
        349
      ]
    }

In the *Page schema I propose to add a iiif field for the manifest link of page image as a top level property, in addition to cdt, id, r as we currently have.

Any thoughts/comments ?

license is needed

...before we can sharing with others. What about MIT ? I'd slightly prefer it to a viral open source license like GPL as it gives us more flexibility.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.