Git Product home page Git Product logo

Comments (14)

rsepassi avatar rsepassi commented on May 12, 2024 1

Thanks @VIGS25 @natashafn and @danbri!

Let's try it on MNIST first. Here's something to start you off. You can get started in a Colab notebook and share it here, or send over a PR.

import tensorflow_datasets as tfds

def dataset_schema_from_builder(builder):
   """Builds JSON-LD from DatasetBuilder."""
  info = builder.info
  ...

mnist_builder = tfds.builder("mnist")
dataset_schema_from_builder(mnist_builder)

from datasets.

rsepassi avatar rsepassi commented on May 12, 2024 1

Thanks @danbri!

Yeah, let's keep building on this. Here's an editable Colab or send a PR to add functionality to the schema_org function.

from datasets.

vsomnath avatar vsomnath commented on May 12, 2024

I don't have experience with schema.org, but would definitely be interested in learning about it and picking up this issue.

from datasets.

natashafn avatar natashafn commented on May 12, 2024

Take a look at the markup helper: https://www.google.com/webmasters/markup-helper/?hl=en

Select datasets, point to your page, highlight various parts of the description, and get JSON-LD generated for you. You can then embed it in your HTML for the page.

Then we will need to check if those pages get indexed: do you have an example page? We can just search for it on google.com -- if it appears there, then the answer is yes, and the schema.org markup will be picked up on the next crawl.

from datasets.

danbri avatar danbri commented on May 12, 2024

Happy to help too. Perhaps we could collaborate in a Colab notebook if someone could start us off with a few instantiated DatasetBuilder objects?

from datasets.

vsomnath avatar vsomnath commented on May 12, 2024

@rsepassi: I went through the code for DatasetInfo and only a small fraction of the properties mentioned in schema.org or Google Dataset type docs are used.

Is it alright if only that subset is used for generating the schema from the builder? For a start case, it does look alright though!

from datasets.

danbri avatar danbri commented on May 12, 2024

I made a quick start here, https://colab.research.google.com/drive/1b5h_9L8JMbZDasbgnPe6ZonTdoSwG6wn#scrollTo=eshIB6OyrVH5 ... with links to your API, our markup specs etc. It emits a basic schema.org description, maybe we can add more.

from datasets.

rsepassi avatar rsepassi commented on May 12, 2024

What we want is a nice page like this one.

The associated JSON-LD from the source is below. Note that we can put the contents of that nice page in here:

        <script type="application/ld+json">{"@context":"http://schema.org/","@type":"Dataset","name":"IMDB Movie Reviews Dataset","description":"### Context\n\nThis is the IMDB dataset that contains the movie reviews.\n\n### Content\n\nLarge Movie Review Dataset v1.0\n\nOverview\n\nThis dataset contains movie reviews along with their associated binary\nsentiment polarity labels. It is intended to serve as a benchmark for\nsentiment classification. This document outlines how the dataset was\ngathered, and how to use the files provided. \n\nDataset \n\nThe core dataset contains 50,000 reviews split evenly into 25k train\nand 25k test sets. The overall distribution of labels is balanced (25k\npos and 25k neg). We also include an additional 50,000 unlabeled\ndocuments for unsupervised learning. \n\nIn the entire collection, no more than 30 reviews are allowed for any\ngiven movie because reviews for the same movie tend to have correlated\nratings. Further, the train and test sets contain a disjoint set of\nmovies, so no significant performance is obtained by memorizing\nmovie-unique terms and their associated with observed labels.  In the\nlabeled train/test sets, a negative review has a score &lt;= 4 out of 10,\nand a positive review has a score &gt;= 7 out of 10. Thus reviews with\nmore neutral ratings are not included in the train/test sets. In the\nunsupervised set, reviews of any rating are included and there are an\neven number of reviews &gt; 5 and &lt;= 5.\n\nFiles\n\nThere are two top-level directories [train/, test/] corresponding to\nthe training and test sets. Each contains [pos/, neg/] directories for\nthe reviews with binary labels positive and negative. Within these\ndirectories, reviews are stored in text files named following the\nconvention [[id]_[rating].txt] where [id] is a unique id and [rating] is\nthe star rating for that review on a 1-10 scale. For example, the file\n[test/pos/200_8.txt] is the text for a positive-labeled test set\nexample with unique id 200 and star rating 8/10 from IMDb. The\n[train/unsup/] directory has 0 for all ratings because the ratings are\nomitted for this portion of the dataset.\n\nWe also include the IMDb URLs for each review in a separate\n[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will\nhave its URL on line 200 of this file. Due the ever-changing IMDb, we\nare unable to link directly to the review, but only to the movie's\nreview page.\n\nIn addition to the review text files, we include already-tokenized bag\nof words (BoW) features that were used in our experiments. These \nare stored in .feat files in the train/test directories. Each .feat\nfile is in LIBSVM format, an ascii sparse-vector format for labeled\ndata.  The feature indices in these files start from 0, and the text\ntokens corresponding to a feature index is found in [imdb.vocab]. So a\nline with 0:7 in a .feat file means the first word in [imdb.vocab]\n(the) appears 7 times in that review.\n\nLIBSVM page for details on .feat file format:\nhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/\n\nWe also include [imdbEr.txt] which contains the expected rating for\neach token in [imdb.vocab] as computed by (Potts, 2011). The expected\nrating is a good way to get a sense for the average polarity of a word\nin the dataset.\n\nCiting the dataset\n\nWhen using this dataset please cite our ACL 2011 paper which\nintroduces it. This paper also contains classification results which\nyou may want to compare against.\n\n\n@InProceedings{maas-EtAl:2011:ACL-HLT2011,\n  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},\n  title     = {Learning Word Vectors for Sentiment Analysis},\n  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},\n  month     = {June},\n  year      = {2011},\n  address   = {Portland, Oregon, USA},\n  publisher = {Association for Computational Linguistics},\n  pages     = {142--150},\n  url       = {http://www.aclweb.org/anthology/P11-1015}\n}\n\nReferences\n\nPotts, Christopher. 2011. On the negativity of negation. In Nan Li and\nDavid Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,\n636-659.\n\nContact\n\nFor questions/comments/corrections please contact Andrew Maas\[email protected]\n\n\n### Acknowledgements\n\nThe Cover Photo is by Krists Luhaers on Unsplash.\nLink to image: https://unsplash.com/photos/AtPWnYNDJnM\n\n\n### Inspiration\n\nUnable to find a IMDB movie reviews dataset in a proper format. I uploaded this. Hope this helps you!","url":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset","sameAs":"https://staging.kaggle.com/iarunava/imdb-movie-reviews-dataset","version":1,"keywords":["analysis &gt; nlp","data type &gt; text data"],"license":{"@type":"CreativeWork","name":"Unknown","url":""},"identifier":"38712","includedInDataCatalog":{"@type":"DataCatalog","name":"Kaggle","url":"https://www.kaggle.com"},"creator":{"@type":"Person","name":"Arunava","url":"https://www.kaggle.com/iarunava","image":"https://storage.googleapis.com/kaggle-avatars/thumbnails/1687181-kg.jpg"},"distribution":[{"@type":"DataDownload","requiresSubscription":true,"encodingFormat":"zip","fileFormat":"zip","contentUrl":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset/downloads/imdb-movie-reviews-dataset.zip/1","contentSize":"119502810 bytes"},{"@type":"DataDownload","requiresSubscription":true,"encodingFormat":"zip","fileFormat":"zip","contentUrl":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset/downloads/aclImdb.zip/1","contentSize":"119502810 bytes"}],"commentCount":0,"dateModified":"2018-07-25T08:11:18.9","discussionUrl":"https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset/discussion","alternateName":"Perform Sentiment Analysis and Text Classification using this Dataset","isAccessibleForFree":true,"thumbnailUrl":"https://storage.googleapis.com/kaggle-datasets-images/38712/58955/49146b29c6738b9e9efe2cb3d25aee59/dataset-card.jpg?t=2018-07-25-08-52-24","interactionStatistic":[{"@type":"InteractionCounter","interactionType":"http://schema.org/CommentAction","userInteractionCount":0},{"@type":"InteractionCounter","interactionType":"http://schema.org/DownloadAction","userInteractionCount":1791},{"@type":"InteractionCounter","interactionType":"http://schema.org/ViewAction","userInteractionCount":10412},{"@type":"InteractionCounter","interactionType":"http://schema.org/LikeAction","userInteractionCount":29}]}</script>

from datasets.

rsepassi avatar rsepassi commented on May 12, 2024

Maybe we put exactly what's in datasets.md for each dataset into the description.

from datasets.

natashafn avatar natashafn commented on May 12, 2024

That would work! Description can have markdown in it (which is exactly what the Kaggle datasets do).

Another suggestion that would be easy to implement:
includedInCatalog....
This way, you control what shows up on the link button (the catalog name)

from datasets.

danbri avatar danbri commented on May 12, 2024

@rsepassi One glitch in your Colab version, "url": http://yann.lecun.com/exdb/mnist/, ... that URL needs double (and not 'single') quotes around it.

from datasets.

rsepassi avatar rsepassi commented on May 12, 2024

Done, thanks!

from datasets.

danbri avatar danbri commented on May 12, 2024

Anything more I can help with here?

from datasets.

rsepassi avatar rsepassi commented on May 12, 2024

I think we're good @danbri. But need to find the time to do this.

Need a new markdown page per dataset that includes the JSON-LD html tag.

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.