Git Product home page Git Product logo

schemasheets's Introduction

Schemasheets - make datamodels using spreadsheets

Tests PyPI PyPI - Python Version PyPI - License Code style: black

linkml logo google sheets logo

Create a data dictionary / schema for your data using simple spreadsheets - no coding required.

About

Schemasheets is a framework for managing your schema using spreadsheets (Google Sheets, Excel). It works by compiling down to LinkML, which can itself be compiled to a variety of formalisms, or used for different purposes like data validation

Documentation

See the Schema Sheets Manual

Quick Start

pip install schemasheets

You should then be able to run the following commands:

  • sheets2linkml - Convert schemasheets to a LinkML schema
  • linkml2sheets - Convert a LinkML schema to schemasheets
  • sheets2project - Generate an entire set of schema files (JSON-Schema, SHACL, SQL, ...) from Schemasheets

As an example, take a look at the different tabs in the google sheet with ID 1wVoaiFg47aT9YWNeRfTZ8tYHN8s8PAuDx5i2HUcDpvQ

The personinfo tab contains the bulk of the metadata elements:

record field key multiplicity range desc schema.org
> class slot identifier cardinality range description exact_mappings: {curie_prefix: sdo}
>
id yes 1 string any identifier identifier
description no 0..1 string a textual description description
Person n/a n/a n/a a person,living or dead Person
Person id yes 1 string identifier for a person identifier
Person, Organization name no 1 string full name name
Person age no 0..1 decimal age in years
Person gender no 0..1 decimal age in years
Person has medical history no 0..* MedicalEvent medical history
Event grouping class for events
MedicalEvent n/a n/a n/a a medical encounter
ForProfit
NonProfit

This demonstrator schema contains both record types (e.g Person, MedicalEvent) as well as fields (e.g. id, age, gender)

You can convert this like this:

sheets2linkml --gsheet-id 1wVoaiFg47aT9YWNeRfTZ8tYHN8s8PAuDx5i2HUcDpvQ personinfo types prefixes -o personinfo.yaml

This will generate a LinkML YAML file personinfo.yaml from 3 of the tabs in the google sheet

You can also work directly with TSVs:

wget https://raw.githubusercontent.com/linkml/schemasheets/main/tests/input/personinfo.tsv 
sheets2linkml personinfo.tsv  -o personinfo.yaml

We recommend using COGS to synchronize your google sheets with local files using a git-like mechanism

Finding out more

schemasheets's People

Contributors

cmungall avatar djarecka avatar hrshdhgd avatar sierra-moxon avatar sujaypatil96 avatar turbomam avatar vladimiralexiev avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

schemasheets's Issues

problematic urllib3 or chardet versions for new verbatim stuff

when running schemasheets/get_metaclass_slotvals.py or schemasheets/verbatim_sheets.py

/Users/MAM/Library/Caches/pypoetry/virtualenvs/schemasheets-FMUhH2LU-py3.9/lib/python3.9/site-packages/requests/init.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
warnings.warn(

linkml2sheets not picking up on annotation's inner_keys

even with an inner_key specification:

slot display_hint
> slot annotations
> inner_key: display_hint
poetry run linkml2sheets specification.tsv \
  		--schema path/to/nmdc.yaml  \
  		--output-directory output \
  		--overwrite

generates

slot display_hint
> slot annotations
> inner_key: display_hint
ess dive datasets  
has credit associations {'display_hint': Annotation(tag='display_hint', value='Other researchers associated with this study.', extensions={}, annotations={})}
study image  
relevant protocols  
funding sources  
applied role  
applied roles {'display_hint': Annotation(tag='display_hint', value='Identify all CRediT roles associated with this contributor. CRediT Information: https://info.orcid.org/credit-for-research-contribution ; CRediT: https://credit.niso.org/', extensions={}, annotations={})}
applies to person  

etc.

linkml2sheets doesn't work when given a directory of templates

@putmantime and I have observed that running linkml2sheets on a directory of templates doesn't work, even when all of the individual templates do work

the linkml2sheets help gives this example:

linkml2sheets -s my_schema.yaml sheets/*.tsv -d sheets --overwrite

In the nmdc-schema repo, the following two work

schemasheets/tsv_output/slots.tsv: clean_schemasheets
	linkml2sheets \
		--schema src/schema/nmdc.yaml \
		--output-directory schemasheets/tsv_output/ \
		schemasheets/templates/slots.tsv

schemasheets/tsv_output/classes.tsv: clean_schemasheets
	linkml2sheets \
		--schema src/schema/nmdc.yaml \
		--output-directory schemasheets/tsv_output/ \
		schemasheets/templates/classes.tsv

but this doesn't work

schemasheets/tsv_output/all.tsv: clean_schemasheets
	linkml2sheets \
		--schema src/schema/nmdc.yaml \
		--output-directory schemasheets/tsv_output/ \
		schemasheets/templates/*.tsv

Even though

ls -l schemasheets/templates 

-rw-r--r--@ 1 MAM staff 71 Aug 16 17:22 classes.tsv
-rw-r--r--@ 1 MAM staff 58 Aug 16 17:26 prefixes.tsv
-rw-r--r--@ 1 MAM staff 2005 Aug 16 18:01 slots.tsv

The error is

Traceback (most recent call last):
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MTtWF7zd-py3.9/bin/linkml2sheets", line 8, in
sys.exit(export_schema())
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MTtWF7zd-py3.9/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MTtWF7zd-py3.9/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MTtWF7zd-py3.9/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MTtWF7zd-py3.9/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MTtWF7zd-py3.9/lib/python3.9/site-packages/schemasheets/schema_exporter.py", line 297, in export_schema
exporter.export(sv, specification=f, to_file=outpath)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MTtWF7zd-py3.9/lib/python3.9/site-packages/schemasheets/schema_exporter.py", line 90, in export
writer.writerow(row)
File "/usr/local/Cellar/[email protected]/3.9.13_2/Frameworks/Python.framework/Versions/3.9/lib/python3.9/csv.py", line 154, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/usr/local/Cellar/[email protected]/3.9.13_2/Frameworks/Python.framework/Versions/3.9/lib/python3.9/csv.py", line 149, in _dict_to_list
raise ValueError("dict contains fields not in fieldnames: "
ValueError: dict contains fields not in fieldnames: 'class'
make: *** [schemasheets/tsv_output/all.tsv] Error 1

add a column for directing rows to different YAML files

The NMDC and MIxS models (and presumably many more) consist of several YAML files.

In order to support faithful round-tripping, we could add a column that specifies that a row from a template should to to a particular YAML file when running sheets2linkml

Three tests failing on Mark's laptop but in GH actions or several other people's computers

FAILED                               [ 30%]
test_schema_exporter.py:174 (test_types)
self = SchemaMaker(schema=SchemaDefinition(name='TEMP', id_prefixes=[], definition_uri=None, local_names={}, conforms_to=None...), element_map=None, metamodel=None, cardinality_vocabulary=None, default_name=None, unique_slots=None, gsheet_id=None)
file_name = '/Users/MAM/Documents/gitrepos/schemasheets/tests/output/mini.tsv'
delimiter = '\t'

    def merge_sheet(self, file_name: str, delimiter='\t') -> None:
        """
        Merge information from the given schema sheet into the current schema
    
        :param file_name: schema sheet
        :param delimiter: default is tab
        :return:
        """
        logging.info(f'READING {file_name} D={delimiter}')
        #with self.ensure_file(file_name) as tsv_file:
        #    reader = csv.DictReader(tsv_file, delimiter=delimiter)
        with self.ensure_csvreader(file_name, delimiter=delimiter) as reader:
            schemasheet = SchemaSheet.from_dictreader(reader)
            line_num = schemasheet.start_line_number
            # TODO: check why this doesn't work
            #while rows and all(x for x in rows[-1] if not x):
            #    print(f'TRIMMING: {rows[-1]}')
            #    rows.pop()
            logging.info(f'ROWS={len(schemasheet.rows)}')
            for row in schemasheet.rows:
                try:
>                   self.add_row(row, schemasheet.table_config)

../schemasheets/schemamaker.py:105: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = SchemaMaker(schema=SchemaDefinition(name='TEMP', id_prefixes=[], definition_uri=None, local_names={}, conforms_to=None...), element_map=None, metamodel=None, cardinality_vocabulary=None, default_name=None, unique_slots=None, gsheet_id=None)
row = {'Desc': 'my string', 'Extends': 'string', 'Type': '', 'base': '', ...}
table_config = TableConfig(name=None, columns={'Type': ColumnConfig(name='Type', maps_to='type', settings=ColumnSettings(curie_prefix...], all_of=[]), is_element_type=None)}, column_by_element_type={'type': 'Type'}, metatype_column=None, name_column=None)

    def add_row(self, row: Dict[str, Any], table_config: TableConfig):
>       for element in self.row_focal_element(row, table_config):

../schemasheets/schemamaker.py:111: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = SchemaMaker(schema=SchemaDefinition(name='TEMP', id_prefixes=[], definition_uri=None, local_names={}, conforms_to=None...), element_map=None, metamodel=None, cardinality_vocabulary=None, default_name=None, unique_slots=None, gsheet_id=None)
row = {'Desc': 'my string', 'Extends': 'string', 'Type': '', 'base': '', ...}
table_config = TableConfig(name=None, columns={'Type': ColumnConfig(name='Type', maps_to='type', settings=ColumnSettings(curie_prefix...], all_of=[]), is_element_type=None)}, column_by_element_type={'type': 'Type'}, metatype_column=None, name_column=None)
column = None

    def row_focal_element(self, row: Dict[str, Any], table_config: TableConfig,
                          column: COL_NAME = None) -> Generator[None, Element, None]:
        """
        Each row must have a single focal element, i.e the row is about a class, a slot, an enum, ...
    
        :param row:
        :param table_config:
        :return:
        """
        vmap = {}
        main_elt = None
        if table_config.metatype_column:
            tc = table_config.metatype_column
            if tc in row:
                typ = self.normalize_value(row[tc], table_config.columns[tc])
                if not table_config.name_column:
                    raise ValueError(f'name column must be set when type column ({tc}) is set; row={row}')
                name_val = row[table_config.name_column]
                if not name_val:
                    raise ValueError(f'name column must be set when type column ({tc}) is set')
                if typ == 'class':
                    vmap[T_CLASS] = [self.get_current_element(ClassDefinition(name_val))]
                elif typ == 'slot':
                    vmap[T_SLOT] = [self.get_current_element(SlotDefinition(name_val))]
                else:
                    raise ValueError(f'Unknown metatype: {typ}')
        if table_config.column_by_element_type is None:
            raise ValueError(f'No table_config.column_by_element_type')
        for k, elt_cls in tmap.items():
            if k in table_config.column_by_element_type:
                col = table_config.column_by_element_type[k]
                if col in row:
                    v = self.normalize_value(row[col])
                    if v:
                        if '|' in v:
                            vs = v.split('|')
                        else:
                            vs = [v]
                        if elt_cls == Prefix:
                            if len(vs) != 1:
                                raise ValueError(f'Cardinality of prefix col must be 1; got: {vs}')
                            pfx = Prefix(vs[0], 'TODO')
                            self.schema.prefixes[pfx.prefix_prefix] = pfx
                            vmap[k] = [pfx]
                        elif elt_cls == SchemaDefinition:
                            if len(vs) != 1:
                                raise ValueError(f'Cardinality of schema col must be 1; got: {vs}')
                            self.schema.name = vs[0]
                            vmap[k] = [self.schema]
                        else:
                            vmap[k] = [self.get_current_element(elt_cls(v)) for v in vs]
        def check_excess(descriptors):
            diff = set(vmap.keys()) - set(descriptors + [T_SCHEMA])
            if len(diff) > 0:
                raise ValueError(f'Excess slots: {diff}')
        if column:
            cc = table_config.columns[column]
            if cc.settings.applies_to_class:
                if T_CLASS in vmap and vmap[T_CLASS]:
                    raise ValueError(f'Cannot use applies_to_class in class-focused row')
                else:
                    cls = self.get_current_element(ClassDefinition(cc.settings.applies_to_class))
                    vmap[T_CLASS] = [cls]
        if T_SLOT in vmap:
            check_excess([T_SLOT, T_CLASS])
            if len(vmap[T_SLOT]) != 1:
                raise ValueError(f'Cardinality of slot field must be 1; got {vmap[T_SLOT]}')
            main_elt = vmap[T_SLOT][0]
            if T_CLASS in vmap:
                # TODO: attributes
                c: ClassDefinition
                for c in vmap[T_CLASS]:
                    #c: ClassDefinition = vmap[T_CLASS]
                    if main_elt.name not in c.slots:
                        c.slots.append(main_elt.name)
                    if self.unique_slots:
                        yield main_elt
                    else:
                        c.slot_usage[main_elt.name] = SlotDefinition(main_elt.name)
                        main_elt = c.slot_usage[main_elt.name]
                        yield main_elt
            else:
                yield main_elt
        elif T_CLASS in vmap:
            check_excess([T_CLASS])
            for main_elt in vmap[T_CLASS]:
                yield main_elt
        elif T_ENUM in vmap:
            check_excess([T_ENUM, T_PV])
            if len(vmap[T_ENUM]) != 1:
                raise ValueError(f'Cardinality of enum field must be 1; got {vmap[T_ENUM]}')
            this_enum: EnumDefinition = vmap[T_ENUM][0]
            if T_PV in vmap:
                for pv in vmap[T_PV]:
                    #pv = PermissibleValue(text=v)
                    this_enum.permissible_values[pv.text] = pv
                    yield pv
            else:
                yield this_enum
        elif T_PREFIX in vmap:
            for main_elt in vmap[T_PREFIX]:
                yield main_elt
        elif T_TYPE in vmap:
            for main_elt in vmap[T_TYPE]:
                yield main_elt
        elif T_SUBSET in vmap:
            for main_elt in vmap[T_SUBSET]:
                yield main_elt
        elif T_SCHEMA in vmap:
            for main_elt in vmap[T_SCHEMA]:
                yield main_elt
        else:
>           raise ValueError(f'Could not find a focal element for {row}')
E           ValueError: Could not find a focal element for {'Type': '', 'base': '', 'uri': '', 'Desc': 'my string', 'Extends': 'string'}

../schemasheets/schemamaker.py:318: ValueError

The above exception was the direct cause of the following exception:

    def test_types():
        """
        tests a specification that is dedicated to types
        """
        sb = SchemaBuilder()
        schema = sb.schema
        # TODO: add this functionality to SchemaBuilder
        t = TypeDefinition('MyString', description='my string', typeof='string')
        schema.types[t.name] = t
>       _roundtrip(schema, TYPES_SPEC)

test_schema_exporter.py:184: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_schema_exporter.py:94: in _roundtrip
    schema2 = sm.create_schema(MINISHEET)
../schemasheets/schemamaker.py:61: in create_schema
    self.merge_sheet(f, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = SchemaMaker(schema=SchemaDefinition(name='TEMP', id_prefixes=[], definition_uri=None, local_names={}, conforms_to=None...), element_map=None, metamodel=None, cardinality_vocabulary=None, default_name=None, unique_slots=None, gsheet_id=None)
file_name = '/Users/MAM/Documents/gitrepos/schemasheets/tests/output/mini.tsv'
delimiter = '\t'

    def merge_sheet(self, file_name: str, delimiter='\t') -> None:
        """
        Merge information from the given schema sheet into the current schema
    
        :param file_name: schema sheet
        :param delimiter: default is tab
        :return:
        """
        logging.info(f'READING {file_name} D={delimiter}')
        #with self.ensure_file(file_name) as tsv_file:
        #    reader = csv.DictReader(tsv_file, delimiter=delimiter)
        with self.ensure_csvreader(file_name, delimiter=delimiter) as reader:
            schemasheet = SchemaSheet.from_dictreader(reader)
            line_num = schemasheet.start_line_number
            # TODO: check why this doesn't work
            #while rows and all(x for x in rows[-1] if not x):
            #    print(f'TRIMMING: {rows[-1]}')
            #    rows.pop()
            logging.info(f'ROWS={len(schemasheet.rows)}')
            for row in schemasheet.rows:
                try:
                    self.add_row(row, schemasheet.table_config)
                    line_num += 1
                except ValueError as e:
>                   raise SchemaSheetRowException(f'Error in line {line_num}, row={row}') from e
E                   schemasheets.schemamaker.SchemaSheetRowException: Error in line 2, row={'Type': '', 'base': '', 'uri': '', 'Desc': 'my string', 'Extends': 'string'}

../schemasheets/schemamaker.py:108: SchemaSheetRowException

Generated linkml schema has title in range instead of name

When we were trying to generate the linkml schema for GA4GH VA in this repo: ga4gh-va, and you look at the source schema here, you'll see when the range for some induced slots were populated with their title, i.e., space separated names rather than their name.

For example: for induced slot variability in the schema, range gets populated as Data Item instead of DataItem.

CC: @gaurav

Improve #23 with a test

Improve #23 (modelling of example values) with a test (as opposed to the current illustration in the Makefile)

cast minimum_value and maximum_value to int

When we add number values for minimum_value and maximum_value properties, like say 0 and 999, when the linkml model is created the program writes to the linkml file as strings rather than numbers.

Feature Requests for GA4GH-VA shcema Web Docs

Summarizing requests related to Web documentation content and format in this ticket. Providing as one long list for now, but happy to break out into tickets for specific feature requests as needed. @sujaypatil96 @sierra-moxon hope we can coordinate soon on these!

Content/Sections I’d like to see in each Class page (in the following order):

  1. Definition:
    a. already provided, and looks fine
    b. content comes from the s/s "description" field.

  2. UML-style diagram:
    a. already provided using YUML, but I find these YUML diagrams hard to read and not all that useful.
    b. It sounds like a new framework will be used to generate diagrams in the near future, so I will hold off on and requests here until I see how the new diagrams look.

  3. Parents:
    a. already present as a section on the page, and looks fine

  4. Description:
    a. A new section with the title "Description".
    b. This should contain content in the 'comments' field of the s/s. Ideally as a bulleted list of sentences rather than one long paragraph/block of text, for improved readability.
    c. At present text form the 'comments' column is in a table at the end of each Class page - but I’d like it front and center directly under the Definition.

  5. Implementation and Use:
    a. A new section with the title "Implementation and Use"
    b. content would ideally be derived from the s/s - but not sure how to do this in practice? . . . I hear that the Annotations feature might let me just create a new 'Implementation and Use' column and give it whatever name I want. Not sure what tooling would be needed to generate a proper section in the Class web page that holds the content.
    d. I'd also want this presented as a bulleted list of sentences/short paragraphs, rather than one long blob of text.

  6. Own Attributes:
    a. This section already exists in each Class page
    b. content of course comes from the s/s
    c. prefer 'expanded' form - not tables - as this better accommodates the types and amount of text I want to provide in describing each attribute. (see below)
    d. don't think we need the class -> attribute pattern for 'own' attributes (no need class context when you are already on the class page and the section says 'own' )

  7. Inherited Attributes
    a. This section already exists in each Class page
    b. content generated from s/s but pulling in all attributes from parents of a given class

  8. Data Examples:
    a. A new section called "Data Examples"
    b. content would be nicely formatted yaml or json data examples - e.g. like those in the VRS RTD docs here), Ideally with some lead in text that describes what is being represented (but this could be part of the data example text block, as a # comment preceding the data itself)
    c. Chris suggested a housing these in a 'Data Examples' directory in the repo - and pulling relevant examples in to a Class web page from these example files automatically. These data examples could then serve multiple purposes (documentation, texting/validation, etc.)

Content/Fields I’d like to see in for each Attribute of a class, as shown in a Class page

  1. The attribute name, description, cardinality, and range are already provided and look good as is.
  2. I’d also like to include a 'Comments:' field that holds text from the ‘comments’ column in the s/s - to provide additional clarification on meaning and usage of an attribute.

Add default template and ability to derive templates

schemasheets is powerful and flexible with its template mechanism

It would be useful to have some standard templates:

  • for people to start filling in de-novo schemas
  • for use as a starting point for linkml2sheets

These could be standard TSVs that are distributed along with the. pypi package, with convenient commands for seeding files with these.

When going from an existing linkml schema it might be useful to also autogenerate a template that includes all used metaslots

See also

no such option: -d

% poetry run sheets2linkml --help

gives the output below, but the script is called sheets2linkml, not schemasheets and the -d option doesn't seem to be implemented

/usr/local/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
Usage: sheets2linkml [OPTIONS] [TSV_FILES]...

Convert schemasheets to a LinkML schema

schemasheets -d . my_schema/*tsv

Options:
-o, --output FILENAME output file
-v, --verbose
--help Show this message and exit.

Add ignore rows feature

I love the ignore column specification. Is there some way to ignore rows? That would help illustrate content from an upstream provider that is being excluded from the model.

Could the metatype specification be repurposed to allow for ignoring rows? (If it doesn't support that already?)

Issue locating module 'fairstructure'

I installed the latest using pip install schemasheets and when attempting to execute sheets2project as per the README, I get the following:

ModuleNotFoundError: No module named 'fairstructure'

This was using the example tsvs from the repo, sheets2project -d . examples/input/*.tsv but the same behavior is seen just calling sheets2project

Column designator "type" conflicts with linkml 1.3. metaslot "type"

Given

Type
> type

"type" is a reserved word for schemasheets, as a shorthand for stating this is the name of the TypeDefinition

in linkml 1.3, "type" is introduced as a metaslot, this could cause ambiguity

proposal: in rare cases where disambiguation is required, use metaslot.type

See #74

put structured_pattern values from sheets2linkml in the syntax sub-slot

sheets2linkml does honor structured_pattern column specifications

slot structured_pattern
name {firstname} {lastname}

but just serializes them like this:

structured_pattern: {firstname} {lastname}

I also tried using syntax as the column specification but got an error

My minimal desired outcome is that the cell contents are placed in a structured pattern's syntax slot:

structured_pattern:
  syntax: {firstname} {lastname}

It would be nice to allow specifications for interpolated and partial_match, too

linkml2sheets alternative?

I have found that the experimental linkml2sheets can't sheetify several of the elements and attributes I care about, and it can't seem to do even a minimal dump on complex/large schemas like MIxS. I have written some code that approximate a LinkML/sheets round trip on the following metaclasses:

See the turbomam/linkml-abuse project.Makefile

  • annotations
  • class_definitions
  • enum_definitions
  • prefixes
  • schema_definitions
  • slot_definitions
  • subset_definitions
  • type_definitions

The prefixes and subsets don't seem to include the content of imports

I'm not using a template to determine what gets written to the sheets. I'm iterating over all slots, except for the skipped slots listed below.

If my code was going to be included in any LinkML repo, it would need refactoring for performance and readability. I can do some of that. Even as it is, I have already used this for QC'ing the MIxS schema and plan to use it for round-tripping the NMDC submission portal schema (within sheets_and_friends)

There are some minor systematic changes between the before and after schemas. Thats crudely reported in target/roundtrip.yaml

skipped slots:

  • all_of
  • alt_descriptions
  • annotations
  • any_of
  • attributes
  • classes
  • classification_rules
  • default_curi_maps
  • enum_range
  • enums
  • exactly_one_of
  • extensions
  • from_schema
  • implicit_prefix
  • imports
  • local_names
  • name
  • none_of
  • prefixes
  • rules
  • slot_definitions
  • slot_usage
  • slots
  • structured_aliases
  • subsets
  • unique_keys
  • type_uri

Option to not write "from_schema" slots in sheets2linkml rendered yaml

It doesn't appear to be possible to silence the "from_schema" in sheets2linkml command.
It is very redundant and causes the linkml yaml to balloon.
It would be great to have a parameter where this could be toggled on or off defending on how suitable it is for the given model.

Example:

  laboratory_procedure:
    name: laboratory_procedure
    from_schema: https://w3id.org/include_portal_v1_schema
  parent_sample_id:
    name: parent_sample_id
    from_schema: https://w3id.org/include_portal_v1_schema
  parent_sample_type:
    name: parent_sample_type
    from_schema: https://w3id.org/include_portal_v1_schema
  sample_availability:
    name: sample_availability
    from_schema: https://w3id.org/include_portal_v1_schema
  sample_id:
    name: sample_id
    from_schema: https://w3id.org/include_portal_v1_schema
  sample_type:
    name: sample_type
    from_schema: https://w3id.org/include_portal_v1_schema
  volume:
    name: volume
    from_schema: https://w3id.org/include_portal_v1_schema
  volume_unit:
    name: volume_unit
    from_schema: https://w3id.org/include_portal_v1_schema
  access_url:
    name: access_url
    from_schema: https://w3id.org/include_portal_v1_schema
  data_access:
    name: data_access```

unintuitively, non-string values require protection by a leading `'` in sheets2linkml gsheet-id mode

This works

sheets2linkml \
		--output $@ \
		--gsheet-id 1zsxvjvifDcmkt72v9m1_VKa2m73_THDJapJYK6dqidw core 

In that sheet, I protected numerical values and Booleans in the examples column by preceding them with '. I think the same thing is required for dates, and the affirmative boolean value must be represented as 'true, not the magic value of TRUE.

But switch term MIXS:0000001's example to 555, and you get

sheets2linkml \
		--output $@ \
		--gsheet-id 1zsxvjvifDcmkt72v9m1_VKa2m73_THDJapJYK6dqidw core_example_555_num 

Traceback (most recent call last):
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/schemasheets/schemamaker.py", line 105, in merge_sheet
self.add_row(row, schemasheet.table_config)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/schemasheets/schemamaker.py", line 111, in add_row
for element in self.row_focal_element(row, table_config):
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/schemasheets/schemamaker.py", line 233, in row_focal_element
raise ValueError(f'No table_config.column_by_element_type')
ValueError: No table_config.column_by_element_type

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/bin/sheets2linkml", line 8, in
sys.exit(convert())
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/schemasheets/schemamaker.py", line 578, in convert
schema = sm.create_schema(list(tsv_files))
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/schemasheets/schemamaker.py", line 61, in create_schema
self.merge_sheet(f, **kwargs)
File "/Users/MAM/Library/Caches/pypoetry/virtualenvs/mixs-linkml-GchukLmP-py3.9/lib/python3.9/site-packages/schemasheets/schemamaker.py", line 108, in merge_sheet
raise SchemaSheetRowException(f'Error in line {line_num}, row={row}') from e
schemasheets.schemamaker.SchemaSheetRowException: Error in line 1, row={'Structured comment name > slot > >': 'samp_size', 'Item (rdfs:label) title ': 'amount or size of sample collected', 'Definition description ': 'The total amount or size (volume (ml), mass (g) or area (m2) ) of sample collected.', 'Expected value annotations inner_key: expected_value': 'measurement value', 'Value syntax structured_pattern ': '{float} {unit}', 'Example examples internal_separator: "|"': '555', 'Section slot_group ': 'nucleic acid sequence source', 'migs_eu annotations applies_to_class: migs_eu inner_key: cardinality': 'X', 'migs_ba annotations applies_to_class: migs_ba inner_key: cardinality': 'X', 'migs_pl annotations applies_to_class: migs_pl inner_key: cardinality': 'X', 'migs_vi annotations applies_to_class: migs_vi inner_key: cardinality': 'X', 'migs_org annotations applies_to_class: migs_org inner_key: cardinality': 'X', 'mims annotations applies_to_class: mims inner_key: cardinality': 'C', 'mimarks_s annotations applies_to_class: mimarks_s inner_key: cardinality': 'C', 'mimarks_c annotations applies_to_class: mimarks_c inner_key: cardinality': 'X', 'misag annotations applies_to_class: misag inner_key: cardinality': 'C', 'mimag annotations applies_to_class: mimag inner_key: cardinality': 'C', 'miuvig annotations applies_to_class: miuvig inner_key: cardinality': 'C', 'Preferred unit annotations inner_key: preferred_unit': 'millliter, gram, milligram, liter', 'Occurrence multivalued vmap: {s: false, m: true}': 's', 'MIXS ID slot_uri ': 'MIXS:0000001', 'MIGS ID (mapping to GOLD) annotations inner_key: gold_migs_id': ''}
make: *** [generated/MIxS6_from_gsheet_templates_bad.yaml] Error 1

schemasheets export functionality missing linkml column descriptors

Using the schemasheets functionality that exports a provided linkml schema to a schemasheets specific tsv file, based on a specification tsv file is incomplete. In that, it does not make the output tsv file with the second row containing linkml column descriptors as expected.

export_spec.tsv:

Class	Field	Description	Key	Range
>class	slot	description	identifier	range

Run the export command as follows:

linkml2sheets ~/path/to/export_spec.tsv -s tests/input/personinfo.yaml -o personinfo.tsv

See the output file missing the second row with linkml column descriptors.

make all -> No such file or directory~/edirect/pytest

% make all

poetry run pytest
/usr/local/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
Creating virtualenv schemasheets-FMUhH2LU-py3.9 in /Users/MAM/Library/Caches/pypoetry/virtualenvs

FileNotFoundError

[Errno 2] No such file or directory: b'/Users/MAM/edirect/pytest'

at /usr/local/Cellar/[email protected]/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/os.py:607 in _execvpe
603│ path_list = map(fsencode, path_list)
604│ for dir in path_list:
605│ fullname = path.join(dir, file)
606│ try:
→ 607│ exec_func(fullname, *argrest)
608│ except (FileNotFoundError, NotADirectoryError) as e:
609│ last_exc = e
610│ except OSError as e:
611│ last_exc = e
make: *** [test] Error 1

three tests failing in main

FAILED tests/test_schema_exporter.py::test_types - schemasheets.schemamaker.SchemaSheetRowException: Error in line 2, row={'Type': '', 'base': '', 'uri': '', 'Desc': 'my string', 'Extends': 'string'}
FAILED tests/test_schemamaker.py::test_types - AttributeError: 'TypeDefinition' object has no attribute 'type'
FAILED tests/test_schemamaker.py::test_combined - AttributeError: 'TypeDefinition' object has no attribute 'type'

allow for roundtripping of structured_patterns using inner keys

given a schema

classes:
  Person:
    attributes:
      first:
      last:
      full:
        structured_pattern:
          syntax: "{token} {token}"

we'd like a header column of:

structured_pattern
> inner_key: syntax

such that flat values like "{token} {token}" can be used in the datafile

This should work but an exception is currently thrown

discovered by @turbomam

invoke with sheets2linkml?

As opposed to schemasheets? See #10

% sheets2linkml --help

Traceback (most recent call last):
File "/Users/MAM/my_first_ss/venv/bin/sheets2linkml", line 5, in
from fairstructure.schemamaker import convert
ModuleNotFoundError: No module named 'fairstructure'

Reorganize package structure

We have a flat list of things under

https://github.com/linkml/schemasheets/tree/main/schemasheets

  • schema_exporter: linkml2sheets
  • schemamaker: sheets2linkml
  • schemasheet_datamodel
  • sheets_to_project: primarily CLI
  • schemaview_vs_example: ???

this is a little ad-hoc. For other projects we subdivide into packages, e.g.

  • import
  • export
  • datamodel
  • cli

This may be overkill here, but we should at least have consistent naming conventions, e.g. if schema_exporter goes from the metamodel to sheets, then the opposite should be called schema_importer

Add better documentation for when to use metatype

Notes from @cmungall:

metatype is useful for cases where you want to have a single column always represent the element name, and the element type to switch depending on the row. If you do it this style, you always need a “name” column. Further up in the stack trace it was complaining about the name field missing.

Examples should roundtrip

When converting from sheets to linkml, a column that maps to examples will be hard-wired to generate Example(value=v) for each v in the value cell.

When this is reversed back from linkml to sheets, this causes an error

The behavior should be symmetric

More broadly, there should be a well-documented solution for mapping complex values to flat spreadsheet cells, rather than relying on hardwiring

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.