Comments (14)
Thanks @VIGS25 @natashafn and @danbri!
Let's try it on MNIST first. Here's something to start you off. You can get started in a Colab notebook and share it here, or send over a PR.
import tensorflow_datasets as tfds
def dataset_schema_from_builder(builder):
"""Builds JSON-LD from DatasetBuilder."""
info = builder.info
...
mnist_builder = tfds.builder("mnist")
dataset_schema_from_builder(mnist_builder)
from datasets.
Thanks @danbri!
Yeah, let's keep building on this. Here's an editable Colab or send a PR to add functionality to the schema_org
function.
from datasets.
I don't have experience with schema.org, but would definitely be interested in learning about it and picking up this issue.
from datasets.
Take a look at the markup helper: https://www.google.com/webmasters/markup-helper/?hl=en
Select datasets, point to your page, highlight various parts of the description, and get JSON-LD generated for you. You can then embed it in your HTML for the page.
Then we will need to check if those pages get indexed: do you have an example page? We can just search for it on google.com -- if it appears there, then the answer is yes, and the schema.org markup will be picked up on the next crawl.
from datasets.
Happy to help too. Perhaps we could collaborate in a Colab notebook if someone could start us off with a few instantiated DatasetBuilder objects?
from datasets.
@rsepassi: I went through the code for DatasetInfo
and only a small fraction of the properties mentioned in schema.org or Google Dataset type docs are used.
Is it alright if only that subset is used for generating the schema from the builder? For a start case, it does look alright though!
from datasets.
I made a quick start here, https://colab.research.google.com/drive/1b5h_9L8JMbZDasbgnPe6ZonTdoSwG6wn#scrollTo=eshIB6OyrVH5 ... with links to your API, our markup specs etc. It emits a basic schema.org description, maybe we can add more.
from datasets.
What we want is a nice page like this one.
The associated JSON-LD from the source is below. Note that we can put the contents of that nice page in here:
<script type="application/ld+json">{"@context":"http://schema.org/","@type":"Dataset","name":"IMDB Movie Reviews Dataset","description":"### Context\n\nThis is the IMDB dataset that contains the movie reviews.\n\n### Content\n\nLarge Movie Review Dataset v1.0\n\nOverview\n\nThis dataset contains movie reviews along with their associated binary\nsentiment polarity labels. It is intended to serve as a benchmark for\nsentiment classification. This document outlines how the dataset was\ngathered, and how to use the files provided. \n\nDataset \n\nThe core dataset contains 50,000 reviews split evenly into 25k train\nand 25k test sets. The overall distribution of labels is balanced (25k\npos and 25k neg). We also include an additional 50,000 unlabeled\ndocuments for unsupervised learning. \n\nIn the entire collection, no more than 30 reviews are allowed for any\ngiven movie because reviews for the same movie tend to have correlated\nratings. Further, the train and test sets contain a disjoint set of\nmovies, so no significant performance is obtained by memorizing\nmovie-unique terms and their associated with observed labels. In the\nlabeled train/test sets, a negative review has a score <= 4 out of 10,\nand a positive review has a score >= 7 out of 10. Thus reviews with\nmore neutral ratings are not included in the train/test sets. In the\nunsupervised set, reviews of any rating are included and there are an\neven number of reviews > 5 and <= 5.\n\nFiles\n\nThere are two top-level directories [train/, test/] corresponding to\nthe training and test sets. Each contains [pos/, neg/] directories for\nthe reviews with binary labels positive and negative. Within these\ndirectories, reviews are stored in text files named following the\nconvention [[id]_[rating].txt] where [id] is a unique id and [rating] is\nthe star rating for that review on a 1-10 scale. For example, the file\n[test/pos/200_8.txt] is the text for a positive-labeled test set\nexample with unique id 200 and star rating 8/10 from IMDb. The\n[train/unsup/] directory has 0 for all ratings because the ratings are\nomitted for this portion of the dataset.\n\nWe also include the IMDb URLs for each review in a separate\n[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will\nhave its URL on line 200 of this file. Due the ever-changing IMDb, we\nare unable to link directly to the review, but only to the movie's\nreview page.\n\nIn addition to the review text files, we include already-tokenized bag\nof words (BoW) features that were used in our experiments. These \nare stored in .feat files in the train/test directories. Each .feat\nfile is in LIBSVM format, an ascii sparse-vector format for labeled\ndata. The feature indices in these files start from 0, and the text\ntokens corresponding to a feature index is found in [imdb.vocab]. So a\nline with 0:7 in a .feat file means the first word in [imdb.vocab]\n(the) appears 7 times in that review.\n\nLIBSVM page for details on .feat file format:\nhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/\n\nWe also include [imdbEr.txt] which contains the expected rating for\neach token in [imdb.vocab] as computed by (Potts, 2011). The expected\nrating is a good way to get a sense for the average polarity of a word\nin the dataset.\n\nCiting the dataset\n\nWhen using this dataset please cite our ACL 2011 paper which\nintroduces it. This paper also contains classification results which\nyou may want to compare against.\n\n\n@InProceedings{maas-EtAl:2011:ACL-HLT2011,\n author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},\n title = {Learning Word Vectors for Sentiment Analysis},\n booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},\n month = {June},\n year = {2011},\n address = {Portland, Oregon, USA},\n publisher = {Association for Computational Linguistics},\n pages = {142--150},\n url = {http://www.aclweb.org/anthology/P11-1015}\n}\n\nReferences\n\nPotts, Christopher. 2011. On the negativity of negation. In Nan Li and\nDavid Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,\n636-659.\n\nContact\n\nFor questions/comments/corrections please contact Andrew Maas\[email protected]\n\n\n### Acknowledgements\n\nThe Cover Photo is by Krists Luhaers on Unsplash.\nLink to image: https://unsplash.com/photos/AtPWnYNDJnM\n\n\n### Inspiration\n\nUnable to find a IMDB movie reviews dataset in a proper format. I uploaded this. Hope this helps you!","url":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset","sameAs":"https://staging.kaggle.com/iarunava/imdb-movie-reviews-dataset","version":1,"keywords":["analysis > nlp","data type > text data"],"license":{"@type":"CreativeWork","name":"Unknown","url":""},"identifier":"38712","includedInDataCatalog":{"@type":"DataCatalog","name":"Kaggle","url":"https://www.kaggle.com"},"creator":{"@type":"Person","name":"Arunava","url":"https://www.kaggle.com/iarunava","image":"https://storage.googleapis.com/kaggle-avatars/thumbnails/1687181-kg.jpg"},"distribution":[{"@type":"DataDownload","requiresSubscription":true,"encodingFormat":"zip","fileFormat":"zip","contentUrl":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset/downloads/imdb-movie-reviews-dataset.zip/1","contentSize":"119502810 bytes"},{"@type":"DataDownload","requiresSubscription":true,"encodingFormat":"zip","fileFormat":"zip","contentUrl":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset/downloads/aclImdb.zip/1","contentSize":"119502810 bytes"}],"commentCount":0,"dateModified":"2018-07-25T08:11:18.9","discussionUrl":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset/discussion","alternateName":"Perform Sentiment Analysis and Text Classification using this Dataset","isAccessibleForFree":true,"thumbnailUrl":"https://storage.googleapis.com/kaggle-datasets-images/38712/58955/49146b29c6738b9e9efe2cb3d25aee59/dataset-card.jpg?t=2018-07-25-08-52-24","interactionStatistic":[{"@type":"InteractionCounter","interactionType":"http://schema.org/CommentAction","userInteractionCount":0},{"@type":"InteractionCounter","interactionType":"http://schema.org/DownloadAction","userInteractionCount":1791},{"@type":"InteractionCounter","interactionType":"http://schema.org/ViewAction","userInteractionCount":10412},{"@type":"InteractionCounter","interactionType":"http://schema.org/LikeAction","userInteractionCount":29}]}</script>
from datasets.
Maybe we put exactly what's in datasets.md
for each dataset into the description.
from datasets.
That would work! Description can have markdown in it (which is exactly what the Kaggle datasets do).
Another suggestion that would be easy to implement:
includedInCatalog....
This way, you control what shows up on the link button (the catalog name)
from datasets.
@rsepassi One glitch in your Colab version, "url": http://yann.lecun.com/exdb/mnist/, ... that URL needs double (and not 'single') quotes around it.
from datasets.
Done, thanks!
from datasets.
Anything more I can help with here?
from datasets.
I think we're good @danbri. But need to find the time to do this.
Need a new markdown page per dataset that includes the JSON-LD html tag.
from datasets.
Related Issues (20)
- tfds.load() does not load datasets with a capital letter HOT 2
- --num-processes causes build to error HOT 6
- Load Sentiment140 failed with HTTP 404 HOT 1
- Can not load robotics dataset HOT 4
- TFBertModel: InvalidArgumentError.__init__() missing 2 required positional arguments: 'op' and 'message' HOT 1
- Streaming dataset construction or appending to an existing dataset HOT 1
- Please support prefetch with python datasets HOT 2
- gs' not implemented HOT 2
- tfds build failed HOT 2
- need help to build a dataset from local numpy data HOT 4
- [data request] <dataset dengue> HOT 1
- [data request] <dataset educação superior no Brasil> HOT 1
- [data request] smallnorb HOT 2
- Multi-threaded compression? HOT 1
- checksum updated
- Exception ignored in: <function AtomicFunction.__del__ at 0x71926a728940> HOT 8
- canot load EMNIST dataset HOT 8
- HTTP Error 301 HOT 1
- Example serializer doesn't properly raise exception HOT 2
- [data request] <emnist>
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.