Symptoms JSON Type Definitions generated code is not consistent ac

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Inconsistent Python property names with JSON Type Definition,about jsontypedef/json-typedef-codegen

Comments (9)

ucarion commented on August 18, 2024

Thanks for opening an issue! Do I understand correctly that your question is why the Python codegen is doing the "opinionated" thing of renaming a JSON property like _tid to a Python property named tid?

If so, then yes this is intentional. In fact, all of the code generation targets do this, except for TypeScript. Basically, TypeScript is a special case. This is because in JavaScript-land, the conventional object model (that is, the data structures that code usually manipulates) is the exactly the JSON model. In other words, it's very common to write most of your JavaScript code to manipulate data that comes straight out of JSON.parse. It's not very conventional to do a conversion step.

By contrast, in Python it's more conventional to mostly manipulate instances of custom classes, such as those jtd-codegen generates. It's not so common, or at least it's usually considered "not ideal", to write code that manipulates the data straight out of json.loads everywhere in one's code. So a conversion step is preferred. Most languages work like Python in this regard; TypeScript is an exception.

Because it would be un-idiomatic for TypeScript code to require a conversion step, jtd-codegen has no choice but to always use the property names straight from the schema. But Python codegen, as well as other languages like Python, have to generate a new class anyway. So while we're at it, we might as well choose idiomatic names for properties. It typically doesn't make it much harder to tell what's going on, and makes the code look cleaner -- especially, for instance, if you use camelCase in your JSON property names, because then you can still manipulate familiar Python snake_case properties.

Further, in Python an underscore-prefixed property name is conventionally considered to be "private". But the notion of privacy doesn't make a ton of sense for JSON conversion types like the ones jtd-codegen generates. Also, the "opinionated" name changing allows the generated code to do "keyword dodging" -- for instance, if a JSON property name is out of this list, then jtd-codegen will choose a different name for the property, and to/from_json_data will do the conversion. This lets us make sure that jtd-codegen always generates valid code, even if a schema uses Python keywords or has names that aren't valid Python identifiers.

Does that make sense? Happy to discuss this further, especially if jtd-codegen's behavior is a source of surprise, confusion, or problems.

from json-typedef-codegen.

bolbken commented on August 18, 2024

Thank you for the detailed answer. I really like the idea behind this library and will continue to use it in a current project.

I've done my best to think about all of your points, and believe I understand a lot of the reasoning you've supplied.

My concerns primarily revolve around the opinions of the naming scheme and how "locked in" a user is to them.

To clarify, I 100% agree that keywords need to be dodged, but I don't fully agree that you should force users to abide by whether or not the code generator should manipulate their styling, underscore, or other choices if the user doesn't want the tool too.

Schemas are all about detailed/fine-grained control for back-end engineers IMO. 80% probably won't mind the "lock in" of snake case in python or camel case somewhere else, but my guess is that anyone that is building a huge project and leaning on this tool may fall into that 20% who don't want all/any/some of the "opinions" the library has on what will be generated.

This project is early, so I don't intend to be to ask-y or pushy, but I think a cli flag of some sort that selects which "opinions" should / shouldn't be enforced when the code is being generated would be improve the possible use cases to >90% satisfied.

As an example, compare this tool to code formatters or linters, they let you specify options and make your code as beautiful/ugly/broken as you/your team think it is. That's the ether of engineering IMO, if we don't share that belief its OK.

This project is a great idea regardless of whether this idea is ever implemented. If I know anything about ruby I'd offer to implement such a feature. Since I'm already behind on my current work and decently multi-lingual I'm having a hard time justifying the effort. That being said, I'm always open to convincing if you're convinced the idea has value in this project.

Thanks again for your kind, respectful description of the logic behind the generator logic and the time you've taken to read my thoughts. Keep up the good work on jtd-codegen.

from json-typedef-codegen.

bolbken commented on August 18, 2024

@ucarion, another thought here...

I've been thinking about this a bunch... should the tool be dodging keywords??? What is the actual use case? A developer would have to want their hand-crafted schema to (potentially unknowingly) be manipulated in the generated code with no knowledge at generation time???

Wouldn't an error/warning during code generation solve the problem? Would a developer really prefer for their schema terms to be manipulated without their knowledge (only to struggle later on in confusion) or should the tool have the option/default to warn/error and help the developer/automation know something is not great or broken with their schema terms.

What is the actual use case in your mind? I know the latter makes more sense to me, but it might not be how you have envisioned the project.

from json-typedef-codegen.

ucarion commented on August 18, 2024

Hi @bolbken,

I should have made this clearer, but I am thinking about your suggestions, and whether and how some variant of your ideas might be possible to implement.

One use-case that may make the problem space clearer is to consider the fact that many systems have to work with an existing schema, and do not have the liberty to change their schema. Consider, for instance, the fact that type is a very common name for a JSON property in many systems, but it's also a keyword in many programming languages. If jtd-codegen's behavior was to fail to generate valid code (such by instead returning an error) for schemas using a property named type, then jtd-codegen would fail to be useful to a large class of existing systems. The set of commonly-used JSON property names overlaps considerably with the set of keywords used in one or more popular programming languages.

It is an important use-case for jtd-codegen that it work for systems that have an existing schema that cannot be modified. For instance, it's intended to be easy to "modernize" an existing system by first running jtd-infer to generate a schema from the existing data, and then start working with the data safely using jtd-codegen-generated data structures.

Moreover, in languages like Java or C#, there are mutually-incompatible naming conventions for properties that are very unusual to deviate from. In languages like Go, the programming language itself imposes requirements on property names. In these and other similar cases, it's essentially unavoidable that some amount of name mangling is required.

from json-typedef-codegen.

ucarion commented on August 18, 2024

Perhaps one thing that I'd like to understand more of your perspective on is what you mean by:

Would a developer really prefer for their schema terms to be manipulated without their knowledge (only to struggle later on in confusion)

How does such a struggle or confusion arise? In my experience, since the generated types have type annotations, a Python code editor typically gives you autocompletions that make it fairly obvious right away that this sort of renaming is going on. And even if you don't have autocompletions, upon hitting the runtime error on first invocation, is the first instinct not to look at the generated class, and figure out that it doesn't have the properties you might have expected?

Do I perhaps misunderstand the workflow? Would love if you could illustrate a bit your user experience. Out-and-out frustration is definitely valid here -- ideally jtd-codegen would be a tool that works well and makes sense even when used intemperately or in moments of impatience.

from json-typedef-codegen.

bolbken commented on August 18, 2024

My use case involves reading data to/from api and database.

I use this tool as the single source of truth for schema property names and their basic types.

My project utilizes multiple languages, which is why I decided to try out this tool.

Currently I have a repository that only defines the schema in JTD json and I have automation deploy the schema library packages in the separate language specific artifacts pypi, npm, etc. This is the meat of why this repository is great, I have a versioned schema that it similar across multiple languages that is the single source of truth for basic types. One note here, is that I only utilize schemas with props/optional props that have basic types only (string, int32, and arrays with basic types). This is because I'd rather not define things twice (like enums, refs) in multiple files which a great way to encourage human error.

Example:

{
  "properties": {
    "_id": {},
    "name": { "type": "string" },
    "email": { "type": "string" }
  },
  "optionalProperties": {
    "createdAt": { "type": "timestamp" },
    "creator": {},
    "modifiedAt": { "type": "timestamp" },
    "modifier": {},
    "description": { "type": "string" },
    "phoneNumber": { "type": "string" },
    "location": {},
    "roles": { "elements": {} }
  }
}

NOTE: creator, modifier, and roles in my database will be stored as reference ids for a User or Role respectively. I've chosen to not define this in JTD because of the repeat definition in other files.

This results in the automation producing a python dataclass as expected... except for renaming of property names like _id to id, createdAt to created_at:

@dataclass
class User:
    id: 'Any'
    email: 'str'
    name: 'str'
    created_at: 'Optional[datetime]'
    creator: 'Any'
    description: 'Optional[str]'
    location: 'Any'
    modified_at: 'Optional[datetime]'
    modifier: 'Any'
    phone_number: 'Optional[str]'
    roles: 'Optional[List[Any]]'

Because I have chosen to not duplicate ref, enum and other definitions in the user schema JTD for maintainability concerns, I must extend the dataclass type definitions to be representative of what the database/api will actually provide. I do this in another package which we can call my 'types' package. Im using a NoSQL document db (mongo) under the hood and decided to add the prefix Document to add clarity. You'll notice here that UserDocument is also inheriting from an abstract base class Document, this abstract class defines introspective class methods like required_props or optional_props, more creational factor methods like from_json_data_with_subdocs, and fixes the type hints for the from_json_data and to_json_data that was defined on the User dataclass.

@dataclass
class UserDocument(User, Document):
    id: str
    description: Optional[str] = None
    creator: Optional[Ref["UserDocument"]] = None
    created_at: Optional[datetime] = None
    modifier: Optional[Ref["UserDocument"]] = None
    modified_at: Optional[datetime] = None
    phone_number: Optional[str] = None
    location: Optional[Location] = None
    roles: List[Ref["RoleDocument"]] = field(default_factory=list)

At this point in time the UserDocument type is complete and can be used by database, api client apps, etc.

At this point in time, my attempt to give you the implementation context is complete. Lets move on to why the property name mutation is problematic.

Here is my concern, once an app developer instantiates a UserDocument they have to dance around the mutated property name changes.

For example, I'm developing a python app that uses a Document (uses JTD generated dataclass under the hood) that has a property name createdAt in the schema (and the database) but to created_at in python. This app handles an error and publishes to a collective log aggregator that includes a the property name text 'created_at'. Say another application in typescript is consuming and reporting on these logs, now every time the typescript app parses logs, it must convert the 'created_at' name to 'createdAt' to utilize the contextual nature against any data a rest. That means the code needs to undo what the JTD code has done AND it must be aware of all the (potentially undocumented) property name mutations in another unrelated library. This is redundant, and the source of the duplicate effort is because JTD has decided that python must have snake case, no leading underscores, etc.

I hope this description helps you better understand my thoughts. I agree that type hints in IDEs are an immediate benefit of what this tool provides, but the longer term/ larger project negative implications of forced opinions cause more nuance (potentially negative) effects. That being said, every good module/library/abstraction chooses logical defaults which make it a useful abstraction (IMO, this tool does this well currently), but truly great extendible software allows the tuning of those defaults (IMO, jtd-codegen has room to grow here).

Lastly I'd like to share some (probably overquoted) wisdom I've heard that seems quite relevant here:

“There are only two hard things in computer science: cache invalidation and naming things.” — Phil Karlton

“Don’t repeat yourself. Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” — Andy Hunt and Dave Thomas

TLDR: Naming things is hard enough in software, why meddle with my names if I don't want you too? Then its your problem too.

from json-typedef-codegen.

ucarion commented on August 18, 2024

If I understand correctly, the crux of the issue is that logs contain the Python attribute names, and these are different from the JSON property names. If that's the case, why not instead log the JSON representation of the Python class, i.e. the result of to_json_data?

from json-typedef-codegen.

bolbken commented on August 18, 2024

The point here isn't the example I came up with. The point is good flexibility in design of this tool. Currently it forces users to abide by the strict rules it has in place, which are not absolutely necessary. I'm attempting to paint the broader picture that the tool should have specifiable cli options with sensible defaults rather than a singular opinionated mode of operation.

Examples:
$ jtd-codegen <args and options> --no-case-modify
$ jtd-codegen <args and options> --allow-leading-underscore

from json-typedef-codegen.

ucarion commented on August 18, 2024

While I can understand your point of view, I don't think it's very feasible today to support alternative name-manging strategies. The space of possible solutions is too great, so for now sticking to a relatively opinionated default seems like the most maintainable solution.

That said, if someone can bring to bear a working implementation of the proposal in this ticket, I am open to reviewing it.

from json-typedef-codegen.

Inconsistent Python property names with JSON Type Definition about json-typedef-codegen HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent