Git Product home page Git Product logo

project-open-data.github.io's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

project-open-data.github.io's Issues

Clarify data.json creation guidance

The guidance for creating the data.json file itself still references RDFa Lite, and does not mention including an entry for the data.json file in the data.json file as discussed in earlier workgroups.

Is data.gov *really* planning to cut over to /data inventories on 2013-11-09?

Perhaps I am taking this too literally, but http://next.data.gov/announcement/indexing-datasets-on-next-data-gov/ seems to indicate that data.gov plans to switch over to harvesting from agency.gov/data inventories instead of the current data.gov inventory on Nov 9, and "that every agency catalog will include all of the datasets originally listed on Data.gov as well as newly-listed data." This worries me because:

  • Some things in data.gov were individually registered there and may not be listed in any other internal inventory, so we need to be sure to include them;
  • we sometimes provide more complete metadata to data.gov than what the DCAT JSON calls for, so there may be loss of information;
  • It's not clear how to ensure that at least all of the existing data.gov entries are retained other than by obtaining and converting to JSON a one-time bulk listing of each agency's known datasets (not sure if data.gov CKAN offers that), and the future maintenance of that list is then unclear;
  • If all we're doing is replacing one "manually-updated central database" [as they put it] with multiple manually-updated agency-specific databases then this is merely an under-the-hood change in the architecture which may by helpful to data.gov does not actually improve public data discoverability.

Are my concerns valid, or am I reading too much into the document at next.data.gov?

Public Inventory

Section 2 of the implementation guide:

http://project-open-data.github.io/implementation-guide/

...lays out procedures for creating the public index of agency datasets. As it is currently written, the agencies "are only required to list datasets with an “Access Level” value of “public,”".

The schema defines "public" and restricted in the following manner:

http://project-open-data.github.io/schema/

"Choices: Public (is or could be made publicly available), Restricted (available under certain conditions),"

These definitions do not adequately define "public" or "restricted."

If restricted data can be made available under certain conditions, it should be able to be listed publicly. It may be that "restricted" means "able to be released to subsets of the public, like a qualified research community, under certain conditions." If this is the intent behind the "restricted" category, that should be made clear (it's unclear what "under certain conditions" means). Even if that's the intent of the "restricted" definition, that's a form of being made public (the "could" from the public category), and should result in those datasets being listed in the public data index.

In other words, "public" and "restricted" should be better defined, and the requirement that agencies list all of their data that "could" be made public should be applied to both "public" and "restricted" datasets, as they've been defined.

The guidance should provide details in the following form:

"Datasets are considered datasets that "could" be made publicly available if:
*certain information would need to be removed from the dataset before release
*significant resources would need to be allocated to digitize or prepare the information for release
*the data can only be released to a limited community due to privacy concerns
*an extraction process can create a new dataset on top of the current dataset to provide public value
etc"

Additionally, data that is affirmatively marked as "private" should not be automatically withheld from public listing; even if an agency determines that a dataset cannot be released publicly, that is a different determination from deciding whether to publicly acknowledge the dataset's existence.

Ultimately, the labels that determine whether datasets get publicly listed should be designed based on whether it's possible to acknowledge the dataset's existence publicly, which is a different decision from whether it's possible to release it, regardless of how much extraction, transformation, digitization, or anonymization is necessary to do it. Since the current OMB directive says that the public data index should include all data that "could" be made public, that "could" should be defined clearly, and empower public oversight of agency information policy decisions.

Travis CI Builds Failing

It looks like all Travis CI builds are failing:

$ jekyll build
/home/travis/build.sh: line 165: jekyll: command not found
The command "jekyll build" exited with 127.

Versioning

Hey folks. HHS is trying to implement code according to the specification. But the specification is changing. (E.g. c9aefbe) This poses some obvious problems when trying to consume arbitrary data.json files.

A few things would make this easier:

  • Applying version numbers to schema changes.
  • Including a version number within the data.json file so consumers know what version the file conforms to.
  • Having a changelog so we know what's changed in the schema.
  • Batching changes so that we don't have as many schemas in the wild as data catalogs.

Data directories/metadata should identify copyright status of the data

Agencies publishing data directories should take the responsibility of indicating what the copyright status of the data is, namely:

  • Is the data a work of the fed govt and in the public domain.
  • Is it (or any part of it) copyrighted? By who?
  • And if so, are any copyright licenses applied to it? are any of those licenses 'certified' open licenses?

The memorandum should also have addressed the "poison pill" problem of copyright works appearing within government data.

Ideas here are riffed from James Jacobs, see this email: https://groups.google.com/d/msg/openhouseproject/x65WlQ_PLfU/v_hHoGAj0-8J

building open data communities

Where might I add a description of how to build open data communities?

Frequently, at the open data meetings, folks ask about how/why HHS created its health data consortium (private public partnership). I would like to provide a brief overview in the style of an implementation guide.

Additionally, it might valuable to provide an overview of other types of activity that encourage culture change and support open data communities within the federal government. We have an innovation council, data council, health data 'leads', federal advisory committees, interagency councils, etc.

On 'Metadata Resources', Data Quality is listed with an invalid RDF property name

On the Metadata Resources page http://project-open-data.github.io/metadata-resources/, the "RDFa Lite 1.1" property listed for the Data Quality field is not a possible property.

It says "xsd:boolean". That's the name for a data type, not a property name.

There's a note about this field on the schema page ("With respect to dcat:dataQuality, we intentionally did not use this field and instead chose a boolean."), so I get that dcat:dataQuality was specifically not intended, but the field should either be listed as "n/a" or some other property should be listed instead.

distribution vs accessURL in data.json metadata

distribution property is not part of metadata, but defined in expanded description of properties.

Not clear how multiple versions of the same dataset (e.g. JSON, XML, CSV) should be represented in data.json:
a) {"accessURL":"http://www.agency.gov/vegetables/listofvegetables.csv"} (allows only single link)
b) "distribution": [{"accessURL": "http://data.mcc.gov/example_resource/data.json", "format":"JSON", "size":"22mb"},{"accessURL":"http://data.mcc.gov/example_/data.xml", "format":"XML", "size":"24mb"}] <- there is example but not clear how to use

Single-file /data catalog not good--optional alternative suggested

Current guidance is that each agency's "/data" inventory must be a single list in a file containing multiple lines of Javascript Object Notation (JSON) summary metadata per dataset, even if our agency has tens of thousands of datasets distributed across multiple facilities and servers. I believe the single list will pose problems of inventory creation, maintenance, and usability. I enumerate my concerns below, but first I propose a specific solution.

PROPOSAL:

I recommend the single-list approach be made optional. Specifically, I suggest that the top-level JSON file be permitted to include either a list of datasets or a list of child nodes. Each node would at minimum have 'title' and 'accessURL' elements from your JSON schema (http://project-open-data.github.io/schema/), an agreed-upon value of 'format' such as "inventory_node" to indicate the destination is not a data file, and optionally some useful elements (e.g., person, mbox, modified, accessLevel, etc) describing that node. Each node could likewise include either a list of datasets or a list of children.

CONCERNS REGARDING THE SINGLE-LIST APPROACH:

(1) We should not build these inventories only to support data.gov. We want to leverage this for other efforts internal to our agencies, for PARR, and to support other external portals such as the Global Earth Observing System of Systems (GEOSS) or the Global Change Master Directory (GCMD). A distributed organization will be more useful for them (even if data.gov itself could handle a single long unsorted list.)

(2) The inventory will need to be compiled from many different sources, including multiple web-accessible folders (WAFs) of geospatial metadata, existing catalog servers, or other databases or lists. Each type of input will need some code to produce the required subset of the list, and then some software will need to merge everything into a giant list. Failure at any step in the process may cause the inventory to be out of date, incomplete, or broken entirely. A distributed organization will more easily allow most of the inventory to be correct or complete even if one sub-part is not, and will require much less code for fault-handling.

(3) Some of our data changes very frequently, on timescales of minutes or hours, while other data are only modified yearly or less frequently. A distributed organization will more easily allow partial updates and the addition (or removal) of new collections of data without having to regenerate the entire list.

(4) The inventory is supposed to include both our scientific observations and "business" data, and both public and non-public data. That alone suggests a top-level division into (for example) /data/science, /data/business, and /data/internal. The latter may need to be on a separate machine with different access control.

(5) It would be easier to create usable parallel versions of the inventory in formats other than JSON (e.g., HTML with schema.org tags) if the organization were distributed.

(6) I understand that the data.gov harvester has successfully parsed very long JSON files. However, recursive traversing of a web-based directory tree-like structure would be trivial to implement by data.gov and would be more scalable and solve many problems for the agencies and the users. data.gov's own harvesting could even be helped if the last-modified date on each node is checked to determine whether you can skip it.

Add operatingUnit Field

While datasets are ultimately owned by an agency, they are really collected and maintained on an operating unit basis. While contact names and emails may change, a dataset's associated operating unit probably will not. Making this a new, required field makes it clearer where to go with questions for the public consuming the data, the agency officials responsible for updating the metadata, and other agencies looking to access the data. It can also help agencies assess internal compliance with publishing data, and is likely to be part of an agency internal data management system for workflow purposes.

Different agencies call their sub-units different things: departments, POCs, bureaus, etc. In asking around, "operating unit" was most generic, but I'm open to an even more generic term.

What do you all think?

overlap between schema.md and metadata-resources.md

both schema.md and metadata-resources.md include a listing of the various JSON fields. I suggest merging these two documents to provide a single definition of the DCAT JSON structure and mappings to other metadata structures.

Fork and Pull Request Workflow Broken

Uncaught TypeError: Cannot call method 'get' of undefined when logged in, POST https://api.github.com/repos/project-open-data/project-open-data.github.io/forks 404 (Not Found) then Uncaught SyntaxError: Unexpected token u when logged out.

#104

Remove the uppercase field names

On the schema page (http://project-open-data.github.io/schema/), why is every field listed with an uppercase "Field" name (e.g. "Release Date") and a lowercase "JSON" name (e.g. for Released Data, the corresponding JSON name is "issued")?

The only place the uppercase names are used seems to be be the Metadata page (http://project-open-data.github.io/metadata-resources/), where it is a link to the cross-walk table to the other vocabularies.

Do they have some other purpose? If they're not supposed to ever appear in any data file, they should be removed from the specification.

Format became an array at top level and a string inside distribution

Not sure what to make of this change. The "format" field is now supposed to be an array of MIME types. If the top-level format field is supposed to correspond to the top-level accessURL, then it would only ever be a single MIME type since accessURL necessarily provides a single downloadable file. (Except in unusual and advanced use of HTTP.)

Or if it repeats the formats found in the distributions, then it's redundant.

It's also at odds with the example in the documentation for "distribution" which shows the use of distribution's "format" field as a string, and not an array.

So... just trying to figure out what I'm supposed to put in it....

Congrats on posting the updated spec though!

DCAT JSON pagination

Is the expectation that all of an agency's public datasets are listed in a single data.json (xml, html) file? this might result in huge files that are not easy to navigate for developers who are looking for some specific dataset.

has it been considered to implement a filter and pagination approach using opensearch (http://www.opensearch.org) or breaking big catalogs into chunks similar to what is done in the sitemap protocol (http://www.sitemaps.org/).

Quality over Quantity

@seanherron commented in #33 that "...we also have a deadline of November 9th for agencies to publish a ton of data..." The Implementation Guide says that we should...

"Conduct a zero-based review effort of all existing data. Give this effort a very short timeframe and the very specific goal of producing a simple list of all data assets within the agency. Stop at the due date rather than stopping at the 100 percent marker, which is very difficult to reach in a single pass. Repeat at regular intervals."

So, are we publishing a "ton of data," or are we working to produce a more usable resource of accessible and usable data assets?

@gbinal commented in #105 that "...we all know that NOAA and USGS are the two 900 lbs. gorillas when it comes to number of entries." Back when we did Geospatial One-Stop, in the earliest days of Data.gov, and now apparently in the current instantiation of what's much the same concept, we've had times where there are tons of records in whatever catalog is being aggregated from all this work. But how useful have those resources been to date? What kinds of use cases have we enabled from discovery to access to use leading to beneficial outcomes for society? How did the big aggregations of tons of records do something in and of themselves that no other pathway to the actual underlying data could accomplish?

Don't get me wrong - I fully believe in what we are doing. I do believe that we should end up with every data asset that the taxpayers (including me) have funded over all these years easily discoverable and accessible through aggregations like Data.gov and lots of other creative applications arising from this effort. We should put them out there and let smart technology and technologists come up with creative ways to narrow in on the useful from the stuff not appropriate for a particular inquiry.

However, on the data provider end, we have to make some choices about what "zero-based review effort" means. At least from our perspective in the USGS, we are trying to do something sustainable that will make "Repeat at regular intervals" mean, "as soon as data are released." We are committed to scientific integrity, validity, and excellence at every level of our organization. So, is it better for us to continue pushing every single metadata record that we can get to line up with the high-level and relatively scant POD implementation of DCAT JSON? Or should we put some amount of energy into trying to make sure that the records we do put into one or more (ref. #105) data.json files have viable and accessible distribution links, appropriate contact information, references to deeper metadata and associated scholarly publications, and the few other details such that when someone finds them they will have a pretty good shot at being able to put the data to use?

It's a balance between quality and quantity, for sure. My opinion is that balance could be characterized in a ratio like 3-1, quality over quantity. But that's my opinion. What's yours?

Just for fun, I posted a little survey (on my own time and not any official part of Project Open Data or in any way affiliated with the USGS) - http://www.surveymonkey.com/s/JGWV3NK (I'll post survey results somewhere and reference in a comment.) I'm interested in comments here and input to the survey. If you have anecdotes or artifacts that answer any of my use case questions on success (or failure) of big comprehensive cataloging efforts - past or current, I'd love to hear about them here.

Use contactPoint, not person

A group discussion nixed changing person to contactPoint. If you think this was wrong, please re-submit a new issue just for this item for broader discussion.

@MarinaMartin Could you share more information about the group's reasons?

DCAT uses contactPoint. What will this project use in its RDF serialization? Every RDF consumer would expect it to use dcat:contactPoint given that it adopts DCAT. Will the documentation explicitly describe this difference between the RDF serialization (which should use contactPoint) and other serializations (which will use person according to the above decision)? It seems simpler to have consistent terms across all serializations.

Additional Clarity on Size Field

The size field, which is intended to represent the file size of the resource, has many potential areas for improvement.

First off, we should either decide on a standard unit of measurement (like the dct:bytesize) or break out the unit of measurement from the numeric value (eg. two fields, size (numeric value) and sizeUnit (unit of measurement). This would allow for machine reading of the value, enabling users to sort and filter by the size of the resource, as well as reducing confusion when multiple standards of measurement are used.

Secondly, what is the rationale behind the cardinality enabling multiple values? Is this in relation to the possibility of having multiple accessURLs? If so, how do we draw a link between a specific accessURL and its size? We could specify that they be represented in order? I'm not a huge fan of that approach but I can't think of a better way to do it.

Finally, I think this should be renamed either bytesize (if it is represented in bytes) or filesize (if represented in other units of measurement). The rationale for this is that size can be interpreted to mean a variety of things (eg. size of geographic area covered, number of rows of data, etc). Bytesize or filesize clarify this.

DCAT JSON structure

I've been looking at implementing DCAT support in Geoportal Server (https://github.com/Esri/geoportal-server) and have a question on the expected JSON structure.

the examples+templates for DCAT JSON suggest the following structure:

{ [ catalog info, catalog item 1, catalog item 2, … ] }

all in 1 json array.

this structure seems to be what is intended (by looking at the RDF and W3C specs for DCAT):

{ catalog info, [catalog item 1, catalog item 2, ...] }

Subtle but important difference. Can someone clarify?

also, the array appears to be an unnamed JSON object. I'd suggest to assign a name to further clarify the role of different JSON elements.

{
"info" : {catalog info here},
"items" : [catalog items here]
}

this follows typical approaches in ATOM/RSS formats.

Pull request sitting in the queue

There is a growing number of pull requests that have been sitting in the queue for over a month. What is the process of reviewing/updating/including these requests?

Paging or filtering for data.json file

Is there a paging or filtering guideline for the data.json file? Catalogs could easily get up into millions. Returning all the results at once could be a performance issue for service providers and clients. Sorry in advance if this is covered somewhere else.

Add recordAccessLevel

We have accessLevel to indicate whether the data a listing refers to is public (currently or potentially publicly available), restricted (available under certain circumstances), or private (never publicly available).

But what about indicating whether the metadata entry itself is public or private?

Agencies are required to maintain a complete data inventory listing -- of all public, restricted, and private datasets -- in an Enterprise Data Inventory. However, they are only required to make public those entries where accessLevel = public.

Some agencies are interested in including a select number of restricted and private datasets in their public data inventories. But, for a variety of good reasons, they may not want to include 100% of these listings in their public data inventory.

So under this proposal, all listings where accessLevel = public would ALWAYS have recordAccessLevel = public.

But for datasets where accessLevel = restricted || private, the metadata editor would have the option of making recordAccessLevel = public to include that listing in the public data inventory. Otherwise, it would be recordAccessLevel = private.

Agencies using a DMS could theoretically already have this option available with published or unpublished entries. This would include this as part of the common core metadata.

This is a proposed required field. (Yes, we could use logic on the Data.gov harvester side to set defaults, but since part of the purpose of data.jsons in a consistent location is for other developers and data sites to crawl these inventories, the JSONs should be as complete as possible in and of themselves.)

Scalability with "Last Update" as a required field

I’m working on the data catalog using the schema published on the project open data site. There’s one thing that’s just bugging me: the “last update” field (required). If data is updated periodically, this value changes. Does this mean we have to update our catalog file weekly, daily, ad-hoc? That’s a lot of manual care and feeding. Is there an alternate interpretation of how to use this field that I’m not aware of? As you can imagine, with our BLS and unemployment insurance weekly initial claims data, our mountain of open data is updated very frequently.

Right now, I’m populating an Excel spreadsheet which, when complete, I’ll export to CSV and use that tool that’s listed on the Project Open Data site. With our list of data tables (we’ll blow past 200 for API-supported data tables this FY alone - there's also the datasets that haven't been incorporated in our API yet), even monthly updates of this file would be a lot of overhead. We also have a lot of tables that are updated weekly and at somewhat irregular intervals, so even monthly updates of the catalog will quickly get out of date.

I see the value of the field. I really do. However, I have a concern with the ability to continue to scale our open data efforts and maintain this catalog.

Unique Identifiers should probably be globally unique rather than only unique for an org

Looking at the description for "identifier" and the guidance, the suggestions are to make the identifiers unique for an Agency or Catalog.

I think we should actually change the guidance to be globally unique. This could be something as easy as adding agency name before the agency identifier, e.g. gov.hhs.aspe.1234567 instead of 1234567. Or it could simply be a URI.

There is no guarantee that catalogs won't aggregate or pull from other catalogs (and it is probably even likely). If this happens a single catalog could get two items from different agencies that have the same identifier.

One solution to this would be to make it a function of the catalog to add something to these datasets to give them unique IDs. The problem with this is that, the "uniqueness" is creating by an intermediate party (not the source of truth).

So, now imagine that I setup a catalog that is going to aggregate a lot of health data. I pull from New York State's catalog, as well as a catalog from HHS. I then get the exact same dataset (since HHS pulls from New York State's catalog un-beknownst to me), but it has two different identifiers in my system. One is NY State's original ID. The other is the ID that HHS's catalog had to create for me (to make sure it was unique within their catalog)

Seems like we could protect ourselves from a lot of identification problems down the line if we ask for global uniqueness for identifiers.

I would also suggest that as guidance, if an Agency already has an agency unique identifier, they should probably make it unique by appending their agency to it (rather than using something opaque like UUID).

webService

Can someone clarify the definition and use of the webService element?

definition: "This field will serve to delineate the web services offered by an agency and will be used to aggregate cross-government API catalogs."

from this definition it is not clear if the value should include a web service URL like this on from the USGS National Map program:
http://services.nationalmap.gov/ArcGIS/services/US_Topo/MapServer/WMSServer?request=GetCapabilities&service=WMS

or if the value should point to a DCAT formatted json file (as in the provided example value) that in itself lists references (what is an 'API Catalog'?) to web services.

in the first case, I suggest to update the example in the schema description. if the second purpose/use is intended, there still needs to be a place in the DCAT schema itself to point to web services.

possible bug with prose integration

I haven't had a chance to investigate further but have replicated one user's inability to complete the 'save' stage when offering a pull request through the 'Help Improve This Content' link on the top of the implementation guidance page. The in-progress spinner kept going indefinitely, regardless of whether the user was logged in and authenticated with GitHub or not.

Feedback on common core schema

In general, great work! A few questions/comments:

  • contactPoint is being added to DCAT. To maintain conformance, consider renaming the person property to contactPoint.
  • I prefer schema:email to foaf:mbox, but either is fine. Just letting you know that you can call that property email as the two properties are owl:sameAs, I believe.
  • Did systemOfRecords come from an existing vocabulary, or did you need to mint it for your use case? If it is new, it would be good practice to mention it, as you've done for accessLevel.
  • dcat:size has been renamed to dcat:byteSize, which is a clearer term. As mentioned in #32, byte size is much less prone to error than various abbreviations for byte size, like GB, GiB, etc.
    • size has been removed #114
  • Dublin Core has a different, more flexible way of expressing dct:temporal. You may want to consider making it an object with "start" and "end" fields. This would also solve #43
  • FYI, granularity was dropped from DCAT because it was underspecified and publishers simply didn't know how/when/why to use it. You may want to drop it as well unless you have a very clear use case in mind (that others share).
  • dataDictionary was also dropped due to inconsistent usage. dataQuality was dropped from DCAT for similar reasons, but I see you have a clear definition of what it means, so it makes sense to keep that one. Just flagging these to let you know that you can very likely get rid of them if you were already on the fence about them.
  • WebService and Feed are classes in DCAT (subclasses of Distribution to be specific), not properties. To be consistent, you can have a @type field (see #23 for why this term should be used) on each Distribution which would be used to say whether the distribution is a WebService, Feed or Download. This also solves the issue of having multiple APIs and feeds for the same dataset, which is not handled by the current spec.
  • As mentioned in #39, please use arrays and not comma-separated values.
  • It would be cool to add the three properties mentioned in #23 (@context, @id and @type)

Ref.: the latest version of DCAT

Publish a vocabulary/ontology at a stable URL

For "pod" namespace, once final 1.0 schema proposed in #44 is accepted:

pod:systemOfRecords
pod:webService
pod:accessLevel
pod:accessLevelComment
pod:primaryITInvestmentUII
pod:bureauCode
pod:programOffice
pod:dataQuality

Based on #44, am I missing anything?

Weird formatting on tables in schema.md

Having trouble figuring out why the formatting for the tables in the /schema.md are showing up inconsistently. For example, look at the contactPoint field in "Further Metadata Field Guidance..."

Does someone smarter than me have any thoughts/fixes for this?

Serialization for distribution is confusing

I don't understand what the phrase "Distribution is a concatenation" means in the usage notes. Does this mean that a Distribution is supposed to be an array of strings that can then be parsed as JSON? or that they are a proper JSON structure. It doesn't help that the JSON provided is not valid JSON. It has some backticks in it that are not correct.

Can I assume the back ticks are incorrect markdown? and it should be standard JSON? or is this incorrect?

"distribution": [
   { "accessURL": "http://data.mcc.gov/example_resource/data.json",  "format":"JSON",    "size":"22mb"},
   { "accessURL":"http://data.mcc.gov/example_/data.xml",  "format":"XML", "size":"24mb"}
]  

keyword element: string or array?

The use of the keyword element is described as "Separate keywords with commas". however, the cardinality is set to (1,n).

is the intent to have 1 keyword element with a string of comma separated keywords (as in the example in schema.md)

{"keyword":"squash,vegetables,veggies,greens,leafy,spinach,kale,nutrition,tomatoes,tomatos"}

or have 1 keyword element with an array of keywords?

{"keyword": ["squash","vegetables","veggies","greens","leafy","spinach","kale","nutrition","tomatoes","tomatos"]}

Add "review" to the list of public access level (accessLevel)

Add to the list of possible access levels (public, restricted, private) a fourth level, review. Review would mean that the dataset is being reviewed by the agency to determine if it should be public, restricted, or private. I think this would allow agencies to include more datasets in their enterprise inventories; otherwise, they might choose to exclude datasets for which this determination has not yet been made.

Policy/process for accepting contributions?

The Contributions section includes how people can make contributions, but doesn't explain the process/timeline for how contributions are reviewed, evaluated for inclusion, and accepted (or under what circumstances they won't be accepted).

Providing greater transparency into this Open Source process would help contributors know what to expect and better prepare their contributions for inclusion.

An example is the Linux kernel patch review process.

accessLevel attribute - data.json metadata

Source - http://project-open-data.github.io/schema/

Are "accessLevel" values case sensitive? It is defined as restricted dictionary of Public, Restricted, Private. But in JSON example value is "public" lowercase.

Field accessLevel
Cardinality (1,1)
Required Yes, always
Accepted Values Must be one of the following: Public, Restricted, Private
Usage Notes This field refers to degree to which this dataset could be made available to the public, regardless of whether it is currently available to the public. For example, if a member of the public can walk into your agency and obtain a dataset, that entry is public even if there are no files online. A restricted dataset is one only available under certain conditions or to certain audiences (such as researchers who sign a waiver). A private dataset is one that could never be made available to the public for privacy, security, or other reasons as determined by your agency.
Example {"accessLevel":"public"}

Replace systemOfRecords with 53 UII

It was suggested that an Exhibit 53 Unique Investment Identifier be required for each dataset, in place of the existing systemOfRecords field, which suggests linking to a dataset's System of Records notice.

Thoughts?

Clarify and consolidate ISO 8601 documentation in the schema

ISO 8601 formatted dates are used for several fields (issued, modified, temporal) but the description and examples of ISO 8601 is somewhat inconsistent. Granted, this is in part because ISO 8601 allows for a wide degree of flexibility in how it's used - both in terms of precision and time zones. I would recommend that we consolidate a description of ISO 8601 usage and give it it's own paragraph (perhaps under Rationale for Metadata Nomenclature) and use a named anchor to link to it in the descriptions for issued, modified, temporal.

I think it's also worth suggesting/encouraging a more specific use of ISO 8601, such as a full datetime down to the second in UTC like the example shown for the 'temporal' field. At the very least it would be good to format all example ISO 8601 dates the same way. I tend to prefer a date with all zeros for hours, minutes, seconds to one without hours, minutes, seconds at all. This is technically less accurate while being more precise since it could be interpreted as citing a specific time rather than citing known ambiguity (a zero rather than null) but I think the distinction is pretty obvious when you see a date with the time fields zero'ed out and I think the extra consistency is worth the slight pseudo accuracy.

clarification on "feed" field in metadata - data.json

Page - http://project-open-data.github.io/schema/

Feed filed is defined as:
"URL for an RSS feed that provides access to the dataset." - in the beginning of document
"These RSS feeds will be used to create a cross-agency RSS feed search tool." - at the bottom of the document.

What exactly does it mean? Is it RSS feed with news about dataset? Or dataset itself is just RSS feed (let's say from agency blog).

How cardinality is expressed for this field? e.g. how I reference multiple RSS feeds? (can't comma separate them, since comma character is valid in URL)

Example files are out-of-date

When changes are made like renaming keyword to keywords, removing size and granularity, etc. shouldn't the example files be updated at the same time?

Need consistent JSON serialization for properties with (0,n) + (1,n) cardinality

There are a number of fields (and I've noticed a few other more specific issues in this genre) around what to do in the case of fields that can have multiple values.

Examples are:
accessURL,
format,
keyword,
category,
references,
distribution,
feed

Some of these give guidance. "keyword" and "references" both give guidance to use commas to separate values, but many of the other ones don't.

I would propose that for the serialization, all fields that can have an arbitrary number of values should use arrays. Not commas.

That way we can be consistent, and callers don't need to parse strings to get multiple values. They can use the build in JSON constructs.

In addition, the "references" property suggests that URLs should be separated by commas. However, commas are valid characters in URLs. This means that in order to make the mechanism reliable, you need to add another layer of encoding on it, which is an unnecessary burden on developers.

Distribution File Size

In the distribution section of the core metadata specification, the accessURL is accompanied by a format and size element. What if the dataset (e.g. Landsat Imagery from USGS) is accessed in a web service where the format and size (of the response) depend on the request sent from the client application?

see this example service:
http://imagery.arcgisonline.com/arcgis/rest/services/LandsatGLSChange/NDVI_Change_1975_2005/ImageServer/exportImage?bbox=-179,-61,179,80

the Landsat Imagery runs into TeraBytes of data. An individual raster dataset extracted from this collection may result in an (almost) global .png file of 64kB (above request) or in a multi-MB tiff for a 1x1 degree portion of the world:

http://imagery.arcgisonline.com/arcgis/rest/services/LandsatGLSChange/NDVI_Change_1975_2005/ImageServer/exportImage?bbox=4%2C52%2C5%2C53&bboxSR=&size=&imageSR=&time=&format=tiff&pixelType=C128&noData=&restrictUTurns=esriNoDataMatchAny&interpolation=+RSP_BilinearInterpolation&compressionQuality=&bandIds=&mosaicRule=&renderingRule=&f=html

question: what should format and size contain in this case?

FYI: New POD file validator

I know there's a validator already (https://github.com/project-open-data/json-validator), but I wrote a new one updated for the latest schema:

http://23.21.40.82/pod/validate
(This is a development machine for hub.HealthData.gov since the validator isn't deployed yet. When it's deployed to hub.HealthData.gov I'll update the issue.)

It's part of the ckanext-datajson CKAN extension: https://github.com/HHS/ckanext-datajson/blob/master/ckanext/datajson/datajsonvalidator.py

@seanherron: I used your bureauCode list, pulling live from raw.github.com.

cc @gbinal

(I guess we can close this issue right away since there's no actual issue, but I thought this was a convenient way to reach people watching the repo.)

Add accessDetails field

For a dataset with accessLevel = restricted -- that is, a dataset that is available only to certain people and/or under certain circumstances -- agencies need a way to describe how to access the data. Maybe you call Molly at 555-DATA, maybe you sign a contract, maybe you must be a data scientist associated with a university, etc.

An agency could potentially save time, while increasing transparency, by putting instructions in this accessDetails field instead of fielding numerous individual email inquiries.

Additionally, some datasets with accessLevel = private are still available for use by other government agencies. A data owner could choose to put an explanation for these potential use cases in this field.

Some data sets have such nuanced restrictions that the documents governing access on a per-row and per-column basis number in the hundreds of pages. Creating a whole new schema to deal with this was initially tempting, but today I think being able to drop a URL to said document into this accessDetails field would do the trick.

I am proposing this as an extensible (optional) field.

Add reasonForNonRelease to schema

For datasets with accessLevel = private, an agency has to document with its Office of General Counsel (or other designated entity) why it can't be released.

While this field would not be surfaced in the public data inventory, it should be captured in an additional metadata field for the Enterprise Data Inventory, and required for datasets where accessLevel = private.

The rationale is that simply documenting a reason for not releasing a dataset with OGC does not guarantee that any identifier or other information is collected alongside that reason. This information rightly belongs in the (private) Enterprise Data Inventory.

Agencies could selectively use the recordAccessLevel metadata field proposed earlier to surface private datasets and their reason for not being released.

Not thrilled with my suggested name, feel free to propose a better one!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.