Git Product home page Git Product logo

glosario's Introduction

glosario

glosario is an open source glossary of terms used in data science that is available online and also as a library in both R and Python. By adding glossary keys to a lesson's metadata, authors can indicate what the lesson teaches, what learners ought to know before they start, and where they can go to find that knowledge. Authors can also use the library's functions to insert consistent hyperlinks for terms and definitions in their lessons in any of several (human) languages.

Contributing

You do not need to know any particular programming language to contribute to Glosario: anyone possessing a basic familiarity with the GitHub web interface can get involved! We have prepared a detailed and accessible guide for contributing, which has been translated into several languages. Contributions are welcome in any language, not only those represented in that document. If you need help with your contribution, feel free to come ask questions on the #glosario Slack channel (if you are not a member of The Carpentries Slack you can join by filling this form).

Lessons

R Markdown and Jupyter notebooks allow authors to place structured metadata in files. We propose the following metadata (written as YAML):

glossary:
  sources:
  - http://some_glossary.org/something/
  language: fr
  requires:
  - aggregation_function
  - call_stack
  defines:
  - closure
  - name_collision
  1. The source key is required.
    • It must introduce a list containing at least one URL.
    • Those URLs must resolve to glossaries as described in the next section.
    • Those glossaries are searched in order from first to last to find definitions.
  2. The language key is required, and must be a single ISO 639 language code (e.g., fr for French).
  3. The keys requires and defines are optional.
    • Either may introduce an empty list.
    • The values under these keys are keys into a shared glossary (discussed in the next section).
  4. We expect the terms identified under requires to be used without being defined in this lesson (i.e., the lesson author assumes users already know them).
  5. All of the terms identified under defines must be hyperlinked in the lesson.
    • The target of the hyperlink for the term's definition must be GLOSSARY_SITE#glossary_key, where GLOSSARY_SITE is one of the sites listed under the sources key and glossary_key is an exact match for one of the defines keys.

We will provide simple tools to that all of the terms listed in a lesson's metadata are linked correctly in its body. We will also provide shortcuts to make it easy to create correctly-formatted links, so that authors can write things like:

The computer uses a `r link('call stack', 'call_stack')` to keep track of function calls.

Glossaries

Any site where glossary URLs resolve can be used as a glossary. As a working model, this project implements a glossary of terms used in data science and data engineering.

  1. The master copy of the glossary lives in glossary.yml. Its format is described below.
  2. This file is turned into a single-page GitHub Pages site using Jekyll.
  3. It is also turned into a Python package called glosario and an R package with the same name.

A glossary entry is structured like this:

- slug: cran
  ref:
    - base_r
    - tidyverse
  en:
    term: "Comprehensive R Archive Network"
    acronym: "CRAN"
    def: >
      A public repository of R [packages](#package).
  • The value associated with the slug key identifies the entry.
    • It must be unique within the glossary.
    • It must be in lower case and use only letters, digits, and the underscore (to be compatible with Jekyll's automatic slug creation).
    • It becomes the fragment identifier in the online version of the glossary.
  • The entry may have a ref key. If it is present, its value must be a list of identifiers of related terms in this glossary.
  • Every other top-level key must be an ISO 639 language code such as en or fr.
    • Every entry must have at least one such language section.
  • Within each language section for each term:
    • The value of term is the term being defined. This key must be present.
    • The key acronym is optional. If present, its value is the acronym for this term.
    • The value of def is the definition. This key must be present, and the value may contain local links to other terms in this glossary (i.e., links starting with #) and/or links to outside sources.

Open issues

  1. Should we provide one function for interactive definition lookup that searches keys and terms, a separate function for each, or some kind of keyword arguments to control the scope of search?

  2. Should we integrate definition lookup with existing help systems? For example, should define('something') in RStudio put the definition in the help pane (and if so, should it hyperlink to terms that the definition depends on)?

Use Cases

  1. Linking to a definition.

    1. Amari writes a lesson in R Markdown that introduces some new terms.
    2. She has defined the language to be Spanish using the glossary/language key in the YAML header, but has not changed any other settings.
    3. She adds an inline code block `r gdef('linear-model', 'Linear models')` to her lesson.
    4. When she knits her document, the inline code block produces the HTML <a href="http://carpentries.org/glossary/es/#linear-model" class="glossary-definition">Linear Models</a>
  2. Checking a lesson.

    1. Beatriz has made some changes to a lesson she inherited from Amari, and wants to check that it is still consistent.
    2. She runs a command-line script that:
      1. Reads the R Markdown file.
      2. Extracts the terms under the glossary/defines key.
      3. Searches the body of the document for calls to gdef(...).
      4. Checks that every term listed in glossary/defines is referenced in the document body, and that every term referenced in the document body is mentioned in glossary/defines.
  3. Finding lessons.

    1. Amari writes a lesson in R Markdown. She adds the glossary key to its YAML metadata and indicates that the lesson requires the term correlation and defines the term regression.
    2. Beatriz is writing a lesson on linear models. She adds YAML metadata indicating that the lesson requires the term regression.
    3. To find prerequisite lessons she can recommend to her students, Beatriz runs a command-line script that:
      1. Uses rmarkdown::yaml_front_matter(filename) to reads metadata from all of the lessons she has archived.
      2. Lists all of the lessons that state they define the term regression.
  4. Summarizing a lesson.

    1. Amari has written a lesson in R Markdown that includes YAML metadata stating that it defines correlation and causation.
    2. She adds a code chunk to the end of her lesson that includes a call to glosario::summarize_terms().
    3. When she knits the document to HTML, this code chunk inserts a definition list dl at that point. Its entries are the definitions of all of the terms listed under the glossary/defines key in the page's YAML header in alphabetical order by term according to the rules for glossary/language.

FAQ

  • Why not just link to Wikipedia? We expect that many glossary definitions will do so, but Wikipedia articles are explanations, not definitions.

  • YAML is hard for people to edit—why not use something else for the glossary file? Because other formats are just as hard to edit (e.g., JSON) or make one-to-many relationships hard to express (e.g., CSV).

  • Why use Jekyll for the online version? It is the default for GitHub Pages.

Collaborators

SADiLaR is one of the collaborators in the finalisation and expansion of the Glosario Project to African Languages. SADiLaR is a research infrastructure established by the Department of Science and Innovation of the South African government as part of the South African Research Infrastructure Roadmap (SARIR).

Credits

glosario's People

Contributors

annajiat avatar baileythegreen avatar batoolmm avatar beatrizmilz avatar blacktack2 avatar callumrollo avatar demar01 avatar dsmits avatar elletjies avatar feddelegrand7 avatar fmichonneau avatar froggleston avatar gvwilson avatar ian-flores avatar jduckles avatar jsteyn avatar konrad avatar leticiadasilva avatar marcosvital avatar mariehouillon avatar mounabelaid avatar nicoguaro avatar nongeso avatar npalopoli avatar timtomch avatar tobyhodges avatar tomkellygenetics avatar villares avatar vscharf7 avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glosario's Issues

2020-10-28: Using existing definitions

I was wondering if there is any precedent on using existing definitions and citing where they come from or if all definitions must be completely new and written based on your own knowledge.

26-11-20: Arabic terms for merged PR #218 and PR #223 aren't rendered in the website.

This is to report a bug in the glosario repo.

It seems that recently merged PRs (PR #218 and PR #223) for Arabic translation aren't rendered on the website. In order to examine the issue, I tried to build the website locally but it's complaining.

GitHub Metadata: No GitHub API authentication could be found. Some fields may be missing or have incorrect data.
  Liquid Exception: "<USERNAME>/<PROJECT>" is invalid as a repository identifier. Use the user/repo (String) format, or the repository ID (Integer), or a hash containing :repo and :user keys. in /_layouts/page.html
jekyll 3.9.0 | Error:  "<USERNAME>/<PROJECT>" is invalid as a repository identifier. Use the user/repo (String) format, or the repository ID (Integer), or a hash containing :repo and :user keys.

This may have to be related to the changes introduced in PR #207 for letter links. I also can't see the letter links in my Chrome browser.

My apologies if no enough details are given in this report.

15/08/2020: Idea for Template - languages other than English

Hi!
I would like to say an idea.
I think it would be good to show in the pages of Glosario of other languages rather than English, the term in English aswell.

A example situation to ilustrate: Gabriela speaks portuguese. She is reading a tutorial about R in english and she wants to know what is a current working directory. She opens the Glosario in portuguese ( https://carpentries.github.io/glosario/pt/ ) and tries to find it. But in portuguese this term is diretório de trabalho , and does not find the definition. If glosario also showed the term in English (because is the language most used in tutorials, courses, books), it would be easier to find.

Example:
image

I'm not sure if this goes out of the scope of Glosario, is just an idea that I think it could make easier to people consult Glosario while studying DS.

Thank you!

Internal links not formatted by Jekyll

  1. make site
  2. View _site/index.html.

Internal links like [parameter](#paramter) are reproduced as-is rather than being turned into HTML links by the markdownify filter in Jekyll.

2020-10-31: Error vs Exception

The terms error and exception are both used in the glossary—which is fine—but they seem to be conflated and used interchangeably, probably due to a mix of authors.

Because the two terms are not equal, I propose either:

  • only including one, so terms like error_handling and exception_handler are consistent, or
  • clearly delineating the differences between the terms.

2020-09-28: DefinedTerm & DefinedTermSet for metadata

What's the rationale behind your suggestion of glossary metadata in the README vs. using existing standards such as DefinedTermSet and DefinedTerm from Schema.org? Or is the intention that tools would convert this YAML metadata description into HTML itemscope/itemprop/meta tags or a JSON-LD blob, which would then use these specifications, in the page created from the source file? (e.g. When the RMarkdown file is knitted to HTML.)

Local sync of glossary.yml

In carpentries/glosario-r#11, the mechanism for updating the glossary is governed by a github action that will update the internal glossary daily.

In carpentries/glosario-py#1, it is suggested to remove the glossary.yml file from the repo and have it dynamically built.

I think it would be a good idea to think about how we go about allowing people to synchronize the glossary locally so that we can decouple the data from the API.

My thoughts on how to go about this are largely centered around the patterns I see from R users and reproducibility (I admit that I do not know much about the python side of things):

  1. people don't update their packages that often or don't know how to update their packages.
  2. there is no clear indicator of what version of the glossary people have on their machines, so if it defaults to the one in the package, then definitions that exist in the global glossary may be missing in their local version.
  3. if {glosario} is released on CRAN, it will be updated every two months (as per CRAN's policy) at most, but the main glosario repository will be constantly updating.

These situations mean that if Belle installs the package on March 4th and Sebastian installs the package on July 17th, they will have two different versions of the glossary on their machines. Let's say they contribute a few new definitions to the main glossary on July 16th, but neither of them see these definitions on their lesson because the package was only pushed to CRAN on July 1st.

I think it would be good to consider these situations before we release this to CRAN and coordinate with the python implementation so that we can reduce the friction that users see.

Modified from a comment originally posted by @zkamvar in carpentries/glosario-r#11 (comment)

2020-10-20: Syntax issues with glossary.yml

My son and I are developing a program to help me with the translation. However, in analysing the glossary.yml file programmatically we have come across some issues:

  1. Some def: > lines miss the >
  2. Some terms are not enclosed in quotes
  3. Some definitions are empty

In the case of the first two issues, the program will flag an error which I fix manually. In the third case, we discovered that the program omits the record so I have now added them back and inserted the text <fixit>. We can amend the program to do that automatically.

Would anyone have a problem with us doing this? Or could anyone foresee any problems that these changes might cause?

Your input will be greatly appreciated :-)

Symlink ./glossary.yml to ./_data/glossary.yml to avoid duplication

This repo currently stores two versions of the glossary: one in ./glossary.yml (for editing) and one in ./_data/glossary.yml (because Jekyll only reads data files from _data, and we need that for rendering the online GitHub Pages version of the glossary).

  1. Should we replace the duplicate in ./_data/glossary.yml with a symlink to ../glossary.yml?
  2. Will this be OK on Windows? (Or does this not matter?)
  3. Should we rename the file glosario.yml for consistency's sake?
  4. What impact will this have on the build?

2020-10-31: Some glossary terms are overloaded terms (have other meanings in this context)

What is the appropriate way to address terms that have more than one meaning within the context represented by the scope of this glossary?

Element, for instance, is currently defined in relation to HTML/XML, but is also used to refer to the items inside of lists, sets, tuples, et cetera in different languages. Moreover, the definition for empty_vector currently reflects this second usage:

A vector that contains no elements. Empty vectors have a type such as logical or character, and are not the same as null.

What is the appropriate way to add multiple definitions? Can more than one def: > key be used per entry in the glossary?

This is not the only example where more than one meaning is possible, either.

Capitalize the first letter? ; Space between related terms

I'm not sure if I should ask this in a Issue..

I have 2 small questions:

  • Is there any standard that we should follow if the first letter of the term should be capitalized or not? (ex: we should write Git or git?)I would like to know to make the next contributions correct.

  • Could we add a space after each related term? Here it looks like it's all together. Example: gitgithub
    image

Thank you :)

[2020-12-01]: [Add Amharic as a language]

Hi!

Our colleagues in Ethiopia would like to contribute to the glossary! I am currently waiting for the blurb to be translated - will share as soon as I have received it.

Thank you!

2020-10-19: No bidirectional (BIDI) text support

As @BatoolMM brought up in #136:

I noticed that adding eng words within any Arabic sentence changes the order of the words when it appears on the site. The order of the words changes in comparison to the one in the glossary.yml. I tried to inspect the site (HTML), found it the same as glossary.yml. So, I tried removing the eng words from the translation. Personally, I think it's better to add some eng terms when explaining or defining the term. For example, when I mention python, I'd write the Arabic translation and between round bracket in English
بايثون (python)

The issue is well described in the w3 page on i18n.

There are resources for HTML and CSS solutions in https://w3c.github.io/typography/#bidi_text and https://drafts.csswg.org/css-writing-modes-3/#text-direction

From a little digging, we might be able to modify the template to use a yaml flag in the language pages to specify which direction the text should go using the <bdo> element.

From the first test, which works in all browsers, using the above text, we can see the difference:

This HTML code:

<blockquote>
<p dir="rtl">
  <bdo dir="ltr">
        التعبير المنطقي عبارة عن تعبير يُستخدم لإنشاء جمل إما
        <b>(true or false)</b> تحمل القيمة صح أو القية خطأ
        تُتستخدم التعبيرات المنطقية مع العبارات الشرطية في محركات البحث والخوارزميات
        وتُسمى التعبيرات المنطقية أيضا تعبيرات المقارنة والتعبيرات الشرطية والتعبيرات العلائقية
  </bdo>
</p>
<p dir="ltr">
  <bdo dir="rtl">
        التعبير المنطقي عبارة عن تعبير يُستخدم لإنشاء جمل إما
        <b>(true or false)</b> تحمل القيمة صح أو القية خطأ
        تُتستخدم التعبيرات المنطقية مع العبارات الشرطية في محركات البحث والخوارزميات
        وتُسمى التعبيرات المنطقية أيضا تعبيرات المقارنة والتعبيرات الشرطية والتعبيرات العلائقية
  </bdo>
</p>
</blockquote>

produces this text:

التعبير المنطقي عبارة عن تعبير يُستخدم لإنشاء جمل إما (true or false) تحمل القيمة صح أو القية خطأ تُتستخدم التعبيرات المنطقية مع العبارات الشرطية في محركات البحث والخوارزميات وتُسمى التعبيرات المنطقية أيضا تعبيرات المقارنة والتعبيرات الشرطية والتعبيرات العلائقية

التعبير المنطقي عبارة عن تعبير يُستخدم لإنشاء جمل إما (true or false) تحمل القيمة صح أو القية خطأ تُتستخدم التعبيرات المنطقية مع العبارات الشرطية في محركات البحث والخوارزميات وتُسمى التعبيرات المنطقية أيضا تعبيرات المقارنة والتعبيرات الشرطية والتعبيرات العلائقية

2020-09-10: "source" | "sources"

The README states

glossary:
  sources:
  - http://some_glossary.org/something/
  language: fr
  requires:
  - aggregation_function
  - call_stack
  defines:
  - closure
  - name_collision
  1. The source key is required.

which seems internally inconsistent. Should the lesson metadata have key source or sources?

spell checking

We can't run glossary.yml through a conventional spell check because it is multilingual. We therefore need something that will pull the content in a specific language and run the bodies through spell checking in that language (preferably after stripping Markdown interlinks).

Automating the translation workflows with Crowdin

I'm very grateful of the work y'all have done with glosario. I hope to find some fun uses for it.

Ultimately, it would be really nice if glosario could be used with GNU gettext language services. Over in python land this would be a dependency free solution for localization services; tools like sphinx and nikola provide translation services that glosario could improve.

I've noticed a few projects using Crowdin for translation services. jupyterlab is currently racing across a few languages. Crowdin provides UI that alleviates managing translations and has Github Actions to integrate with a project. They handle the conventional .pot, .po, and .mo file extensions used by gettext.

2020-08-28: Standardizing adding maintainers

I think we need to standardize the process of adding maintainers by language. I suggest to ask the 1st person that makes a contribution in that language, and if they deny, moving to the 2nd person and so on. I'm open to suggestions or other structures but at the moment I think this is the easiest/quickest.

2020-11-28: Ending of verbs in Japanese translations

I wonder if there is any rule on which ending style of verbs, either so-called 'desu-masu' or 'dearu', to be used in Japanese translations. It seems that translations of the lesson materials are in 'desu-masu' style. So should translations in Glosario also in 'desu-masu' style? Thank you for your advice in advance.

[Py] Write tests

Currently all methods implemented in glossary are not tested.

2020-09-20: tooling needs for books and websites

I'm working on two book projects right now where I'm integrating Glosario. They have identical use cases:

  1. We extract glossary keys from our source files (regular expressions for the win).
  2. We combine the Glosario glossary with project-specific definitions.
  3. We use those keys to select entries from the combined glossary.
  4. We get the transitive closure using those entries as a starting set.
  5. We save the resulting set of entries in a project-specific format.

I think steps 1 and 5 will always be project specific, but we should provide tools for doing steps 2-4. More specifically:

  1. glosario should come with command-line tools that merge glossary files (step 2) and that select-and-expand a set of glossary entries given a set of slugs (steps 3-4). These can be written in any portable language; I vote for Python because it's preinstalled on MacOS and Linux, but R isn't.
  2. glosario-r and glosario-py should include functions that do steps 2-4 given either/both files and in-memory YAML.
  3. We should provide examples of %include files for common web templating systems like Jekyll that people can use as starting points for turning .yml data files in our glossary format into web pages. These should probably (?) live in the main glosario repository in an examples directory.

7-10-2020: Adding a definition in Arabic to the existing terms

It would be amazing if Arabic is also included in glosario. However, Arabic uses unique characters, might not be supported in glosario. If it's alright with you, I am happy to add a definition in Arabic (ar) to the existing terms in the yml file?

2020-11-4: When building the Jekyll site locally, how can I get pages other than the main one to work?

Note: this is a problem I have seen with all Jekyll sites I have worked with (not a large number, but bigger than one), so it may be that I have done something wrong.

I was hoping to make and try out some changes, but the changes I wanted to try are at the individual language level, and those pages render on GitHub, but not locally. Whenever I google this issue, I can only find threads where people had the opposite problem, so I really don't know where to begin.

Any ideas?

2020-09-17: Markdown not processed in terms or cross-references

Modify glossary.yml to use code font and italics in a term:

- slug: abandonware
  en:
    term: "`abandonware` *is great*"
    def: >
      Software that is no longer being maintained.

The formatting characters are shown literally, i.e., Markdown is not translated in term names.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.