Git Product home page Git Product logo

ud_hindi-hdtb's Introduction

Summary

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB), created at IIIT Hyderabad, India.

Introduction

The Hindi Universal Dependency Treebank was automatically converted from Hindi Dependency Treebank (HDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu. HDTB is developed at IIIT-H India.

Acknowledgments

The project is supported by NSF Grant (Award Number: CNS 0751202; CFDA Number: 47.070).

Any publication reporting the work done using this data should cite the following references:

Riyaz Ahmad Bhat, Rajesh Bhatt, Annahita Farudi, Prescott Klassen, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, Ashwini Vaidya, Sri Ramagurumurthy Vishnu, and Fei Xia. The Hindi/Urdu Treebank Project. In the Handbook of Linguistic Annotation (edited by Nancy Ide and James Pustejovsky), Springer Press

@InCollection{bhathindi,
  Title                    = {The Hindi/Urdu Treebank Project},
  Author                   = {Bhat, Riyaz Ahmad and Bhatt, Rajesh and Farudi, Annahita and Klassen, Prescott and Narasimhan, Bhuvana and Palmer, Martha and Rambow, Owen and Sharma, Dipti Misra and Vaidya, Ashwini and Vishnu, Sri Ramagurumurthy and others},
  Booktitle                = {Handbook of Linguistic Annotation},
  Publisher                = {Springer Press}
}

Martha Palmer, Rajesh Bhatt, Bhuvana Narasimhan, Owen Rambow, Dipti Misra Sharma, Fei Xia. Hindi Syntax: Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure. In the Proceedings of the 7th International Conference on Natural Language Processing, ICON-2009, Hyderabad, India, Dec 14-17, 2009.

@inproceedings{palmer2009hindi,
  title={Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure},
  author={Palmer, Martha and Bhatt, Rajesh and Narasimhan, Bhuvana and Rambow, Owen and Sharma, Dipti Misra and Xia, Fei},
  booktitle={The 7th International Conference on Natural Language Processing},
  pages={14--17},
  year={2009}
}

Changelog

  • 2024-05-15 v2.14
    • Added transliteration of lemmas and sentences.
    • Verbal lemma is infinitive instead of stem.
  • 2023-05-15 v2.12
    • Fixed: Finite verbs head clauses, hence ccomp instead of obj.
    • Two sentences split after exclamation mark.
  • 2022-11-15 v2.11
    • Fixed a number of various validation errors.
  • 2021-05-15 v2.8
    • Normalized lemmatization of punctuation symbols: LEMMA=FORM.
  • 2019-05-15 v2.4
    • Fixed some violations of the guidelines reported by the new validator.
  • 2018-04-15 v2.2
    • Repository renamed from UD_Hindi to UD_Hindi-HDTB.
  • 2017-03-01 v2.0
    • Converted to UD v2 guidelines (Dan Zeman).
  • 2015-11-01 v1.2
    • Initial release (Riyaz Bhat and Dan Zeman).
=== Machine-readable metadata =================================================
Data available since: UD v1.2
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: news
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Bhat, Riyaz Ahmad; Zeman, Daniel
Contributing: here
Contact: [email protected]
===============================================================================

ud_hindi-hdtb's People

Contributors

dan-zeman avatar riyazbhat avatar fginter avatar jnivre avatar

Stargazers

 avatar Siddharth Gupta avatar Saransh Rajput avatar Chaitanya Agarwal avatar Aiman Shivani avatar Yannis Evangelou avatar Aryaman Arora avatar David Code Howard avatar Parameswari Krishnamurthy avatar Rajaswa Patil avatar Deepak Paudel avatar Abhishek Tiwari avatar Anadi Kashyap avatar Ajay Mishra avatar Ishan Tarunesh avatar  avatar Amit Kumar Jaiswal avatar Shirish Kadam avatar  avatar Siva Reddy avatar  avatar

Watchers

Kyle Gorman avatar Robert (Munro) Monarch avatar  avatar Barbara Plank avatar Francis Tyers avatar Bohdan Moskalevskyi avatar Kamen Bonov avatar Sampo Pyysalo avatar Lilja Øvrelid avatar  avatar Martínez Alonso, Héctor avatar  avatar  avatar Bruno Guillaume avatar Martin Popel avatar  avatar Aaron Smith avatar James Cloos avatar ዮስያስ avatar Nathan Schneider avatar Sarves avatar Andre Martins avatar Flammie Pirinen avatar Evpok avatar Slav Petrov avatar Masayuki Asahara avatar Zeljko Agic avatar Amir Zeldes avatar alane suhr avatar Ronald Cardenas avatar Christopher Manning avatar Jennifer Foster avatar Siva Reddy avatar Agata Savary avatar Per Erik Solberg avatar  avatar mai-om avatar John Bauer avatar Yevgeni Berzak avatar PA avatar  avatar Mícheál John Ó Meachair avatar Djamé avatar Sara Tonelli avatar Olga Lyashevskaya avatar  avatar  avatar Benjo12 avatar Oluokun Adedayo avatar Mehmet Oguz Derin avatar Daniel Swanson avatar  avatar Gülşen Eryiğit avatar Liesbeth A avatar Niko Partanen avatar Tuğba Pamay avatar  avatar Teresa Lynn avatar  avatar Atul Kr. Ojha avatar Marie-Catherine de Marneffe avatar  avatar Taraka Rama avatar Robert Pugh avatar Colin Batchelor avatar Antonis Anastasopoulos avatar  avatar So Miyagawa avatar  avatar Jan Hajic avatar Prokopis Prokopidis avatar Maria Simi avatar  avatar  avatar Petya Osenova avatar Veronika Vincze avatar tanaka takaaki avatar Shinsuke MORI avatar Hiroshi Kanayama avatar Alessandro Lenci avatar  avatar Sumire Uematsu avatar Mayank Jobanputra avatar  avatar  avatar Cătălina Mărănduc avatar Miguel Ballesteros avatar  avatar git2go avatar  avatar Nizar Habash avatar  avatar Elizabeth Davidson avatar Mojgan Seraji avatar  avatar  avatar Kiran Dhakal avatar Johan Hall avatar kim gerdes avatar  avatar

ud_hindi-hdtb's Issues

`ccomp` should be used instead of `obj` when the target is a finite verb

I briefly looked at this corpus and something looked off to me about the use of obj. Disclaimer, I don't speak Hindi.

When a finite verb depends on another word, we should expect a clausal relation, like ccomp, xcomp, csubj, acl or advcl. See https://universaldependencies.org/u/overview/complex-syntax.html#subordination

I see that there are a lot of finite verbs governed by obj in HDTB (762 occurrences):
http://universal.grew.fr/?custom=6418627bd7d40

In other corpora (ex. GSD in other languages), it is quite rare (around 10 occurrences in total). It corresponds to reported segments, titles, parentheses or "nominalized" verbs (maybe like English gerundive). I doubt that this is the case for all these occurrences.

Even if this corpus is based on Pāṇinian grammar, I would suggest to stick to UD guidelines and replace most obj by ccomp or xcomp when the target is a finite verb.

Highly non-projective trees in HDTB

Recently I've been reviewing cases of extremely non-projective sentences in UD, as detected by my automated scripts. A number of these have turned out to be annotation errors (at least in other languages).

I'd like to share my findings for HDTB, in case they might be useful for improving the treebank. Unfortunately I don't speak Hindi; it's quite possible that these are all false positives. Feel free to ignore/close this issue if it is not helpful.

The following 7 sentences in HDTB have highly non-projective structure: train-s5175, train-s5931, train-s6774, train-s6895, train-s9724, train-s10635, test-s125.

Visualizations below:


train-s5175

चूंकि राजपुरा पटियाला में पड़ता है, इसलिए स्थायी तौर पर पटियाला पुलिस इसकी सुरक्षा में लग जाती है, जो इस बस को राज्य के अंतिम छोर यानि बाघा बार्डर तक छोड़कर आती है, जबकि एस्कार्ट जिप्सी प्रत्येक जिले में बदलती रहती है ।

hdtb_train-s5175

train-s5931

इसके अलावा पाकिस्तानी विदेश मंत्री खुर्शीद कसूरी ने पाकिस्तान में भारतीय उच्चायुक्त शिवशंकर मेनन से मुलाकात के दौरान बुधवार को कहा कि ईरान से भारत तक आने वाली गैस पाइपलाइन के पाकिस्तान में पड़ रहे हिस्से की सुरक्षा की जिम्मेदारी लेने में उन्हें बहुत खुशी होगी ।

hdtb_train-s5931

train-s6774

ललित सूरी द्वारा २९ जुलाई को रखे गये सांसदों के वेतन, भत्ते और पेंशन अधिनियम २००४ में संशोधन संबंधी निजी विधेयक पर हुई चर्चा के दौरान पक्ष - विपक्ष के सांसदों ने इस बात का खूब रोना रोया कि उनका मासिक वेतन महज १२ हजार रुपये है जो सरकारी क्लर्क की तनख्वाह से भी कम है ।

hdtb_train-s6774

train-s6895

उन्होंने बोर्ड के मुख्य कार्यकारी अधिकारी (सीईओ) को निर्देश दिया कि वे केंद्र व राज्य सरकार के संबंधित मंत्रालयों और यूपी सुन्नी सेंट्रल वक्फ बोर्ड के अधिकारियों की बैठक बुलाकर इस वक्फ संपत्ति ताजमहल की प्रबंध योजना तय करें ।

hdtb_train-s6895

train-s9724

भाजपा ने केंद्र और केरल सरकार पर भारतीय ड्राइवर एम. आर. कुट्टी की हत्या के लिए जिम्मेदार तालिबान के साथ निपटने में ढिलाई बरतने का आरोप लगाया है ।

hdtb_train-s9724

train-s10635

प्रधानमंत्री ने कहा कि नागरिक और सैन्य कार्यक्रम को अलग - अलग करने की योजना परमाणु सिद्धांत के अनुरूप होगी, जिसमें विश्वसनीय न्यूनतम परमाणु प्रतिरोधक क्षमता की बात कही गई है ।

hdtb_train-s10635

test-s125

सिन्हा ने कहा कि उनकी योजना है कि केंद्र या राज्य सरकार की मदद के बिना हिमलिंग की संरक्षा और तीर्थ - यात्रियों की सुविधा पर २० करोड़ रुपये खर्च किए जाएँ ।

hdtb_test-s125

? labeled with - lemma in train-s592

# sent_id = train-s592
# text = कैसे पहुँचें?
1       कैसे      कैसे      PRON    WQ      PronType=Int    2       obl     _       ChunkId=RBP|ChunkType=head|Translit=kaise
2       पहुँचें     पहुँच     VERB    VM      Mood=Sub|VerbForm=Fin|Voice=Act 0       root    _       SpaceAfter=No|Vib=एं|Tam=eM|ChunkId=VGF|ChunkType=head|Stype=int\
errogative|Translit=pahum̃ceṁ
3       ?       -       PUNCT   SYM     _       2       punct   _       ChunkId=BLK|ChunkType=head|Translit=?

The - lemma looks incorrect to me, not that I actually know anything about Hindi

Exclamation points not splitting sentences

Not sure if this is an intentional feature of the treebank or a feature of the Hindi language, but there are two exclamation points which do not split sentences (and none which do)

# sent_id = train-s842
# text = चोरी और उस पर सीनाजोरी! लक्मे इंडिया फैशन वीक में इस बार कुछ ऐसा ही चल रहा है ।
# sent_id = train-s852
# text = उन्होंने कहा! दरअसल मैंने इसे एक पत्रिका में रानी मुखर्जी को एक फिल्म के प्रचार के सिलसिले में पहने देखा था तो उन्होंने आरोप साबित करने के लिए प्रेसकांफ्रेंस बुलाई थी ।

There are no ! in dev or test.

Excessive use of dep label

Since I've been referencing the HDTB for some other work I've noticed several potential points of improvement. One is excessive use of the dep label which should only be for very strange or basically unanalyseable grammatical relations. A grew-match query for this label indicates "more than 1000 results found in 20.71% of the corpus". This is probably way too high!

I will try to categorise these instances in this meta-issue and work towards picking better labels for them over time (in some cases this might need issues in the multilingual repo). Some of the major categories are below (will be expanded into sub-issues as attempted to make sense of):

  • Emphatic postpositional particles such as भी, ही, तो. Probably best described by advmod:emph, except for when तो is used to introduce a clause (like English so) which can just be advmod.
  • Approximators like क़रीब "near", लगभग "around". Best is advmod.
  • भर as in दिन-भर "all day". This is a really neat and strange construction because भर is postposed! Perhaps amod or det or case but needs more analysis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.