Git Product home page Git Product logo

fd-dictionaries's People

Contributors

axet avatar bansp avatar bendman avatar denez avatar fredmaranhao avatar grimpy101 avatar hjpotter92 avatar humenda avatar ivan-pan avatar jimregan avatar joedalton2 avatar karlb avatar micha137 avatar s-leroux avatar tmgreen avatar vocabulista avatar vthorey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fd-dictionaries's Issues

XML test suite?

I wonder if it makes sense to have an XML test suite, with examples of what needs to be supported by each potential modification of the Freedict schema, and of what is not legal.

Common ontology for part of speech

Some dictionaries already have some kind of local ontology to reliably identify
parrt of speech (and potentially gender, etc.). Examples are the WikDict
dictionaries or eng-pol. Most other dictionaries lack this information, there
the <pos/> tag may contain arbitrary text. For machine-friendly
postprocessing, this should be mapped to an ontology, valid for all FreeDict
dictionaries.

Things to happen:

  • provide common ontology
  • mention in documentation that newly imported / created dictionaries need to use the ontology
  • convert existing dictionaries

tur-eng: no whitespace in equivalents

Example showcasing the issue:

         <entry>
            <form>
               <orth>acıkmak</orth>
            </form>
            <sense n="1">
               <cit type="trans">
                  <quote>besorryabout</quote>
               </cit>
               <cit type="trans">
                  <quote>regret</quote>
               </cit>
            </sense>
            <sense n="2">
               <cit type="trans">
                  <quote>behungry</quote>
               </cit>
            </sense>
         </entry>

The dictionary would also benefit from having all entries ID-ed with xml:id (but that is not a bug report, just an enhancement suggestion).

implement <appInfo> to document importer script for imported dictionary

The TEI specification lists a <appInfo/> tag, see
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HDAPP.
It would be good to have such a tag to document the script or program with which
the dictionary was imported. It should contain:

  • application name
  • application version
  • path to script (or its directory) relative to the tools/ directory
<appInfo>
 <application version="1.5"
  ident="ImageMarkupTool"
  <label>Image Markup Tool</label>
  <ptr target="#P1"/>
  <ptr target="#P2"/>
 </application>
</appInfo>

The notAfter attribute (referenced in the URL above) could be omitted, application version is given in the above example and everything else is tracked by Git.

The following would need to be adapted:

  • the shared/freedict-P5.dtd
  • the shared/freedict-P5.rng
  • tools/xsl/inc/(?)

Please document any progress here.

extend the Freedict schema to handle new demands

This issue rides on the back of #62 (update of the ODD).

One issue is to introduce a fixed list of types for <pron>, with @type="broad" set as the default (see this message for some background and links; the follow-ups provide some further explanation).

Yet another, that came to my mind at some point, is to import some lists of values of the @type attribute from the TEI Lex0, for "usg", for example. That may involve the modification of the existing databases, or the Freedict-local type values can simply get added to that fixed list.

I'm guessing that this is all up to the demands of the current databases and to the general standardization practice. So perhaps this ticket can serve to gather a list of demands, or at least a list of references to other issues.

Articles missing in Dutch dictionary

Dutch dictionary is missing the noun articles. I may contribute to filling this data. Editor needed tho. Proposed solution: add POS field (sort of word), separate nouns from the database, add the "het" articles, autofill the reminder with "de" article. I would start with the "het" nouns since they make just around 20% of all dutch nouns.

nld-deu|afr-deu: remove empty translations

nld-deu contains repeated <cit…><gramGrp>…</gramGrp></cit>, which are empty
and broken translations. It is fairly easy to remove them programatically, help
appreciated. Things to watch out for:

  • removed cit's may leave empty sense's behind
  • removing a sense node, the explicit numbering can break

hun-eng/eng-hun: remove binary blob

We currently ship eng-hun and hun-eng in a binary file. Upstream is not active anymore and hence it would be better to maintain a fork in TEI, so that the dictionary is open for improvements.

Implications of GPL as a license for dictionaries

Hi,

At least some of the dictionaries are licensed under GPL, a free non-permissive license initially created for software but that can also be used for art/text (even if not always encouraged by GNU). Any derivative work has to be licensed under GPL, as stated in the license:

You may convey a work based on the Program [...] provided that you also meet all of these conditions:

c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply [...] to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.

I understand that one of its implications is that any distributed improvement (fixing translations, adding some) shall be distributed under GPL, but it is less clear to me how this interacts with non-free software using the dictionaries as data to perform some task (eg, spelling checker, automated translation, word generator, ...). I couldn't find any related information in the GitHub Wiki or on the website.

The closest entry in the GNU GPL FAQ would be related to plugins and states:

Can I release a nonfree program that's designed to load a GPL-covered plug-in?

If they form a single combined program then the main program must be released under the GPL or a GPL-compatible free software license, and the terms of the GPL must be followed when the main program is distributed for use with these plug-ins.
However, if they are separate works then the license of the plug-in makes no requirements about the main program.

While open to interpretation, my guess is that GPL would allow the use cases I described earlier (spelling checker, etc..).

A related question can be found on SE, but the most upvoted answer has only one upvote, wich makes it unreliable.

Another SE related question also states that GPL does not apply to programs using datasets under GPL.

Licenses are a difficult topic and it's easy to get something wrong, do you have any input about what can and cannot be done with the dictionaries, and if there are restrictions attached? I think it would be nice to have a short section about licenses in the documentation so as not to discourage use of this resource.

Thanks for your great work!

Ergane dictionaries: fix doubled translations

Dictionaries imported from Ergane seem to list many senses with the same
content, as for instance:

ge
ge /xə/
1. du, Sie
2. ihr, Sie
3. Sie
4. Sie
5. du

It should be straight forward to remove some of the doubled translations.

LFS needed

I made a dictionary eng-rus, it's a compilation of wikdict and stardict dictionaries with a lot of manual edits. It counts 526876 headwords cause the stardict dictionary is comprehensive, so the eng-rus.tei file exceeds 160Mb. How I can push it to the repository? It seems that there is no support of LFS in the target repository.

alternative handling of schema and ODD files

This is a potential enhancement for handling schemas and ODD, if the current situation is seen as suboptimal.
Sebastian mentions elsewhere that symlinking the ODDs and schemas in each dictionary directory may still cause problems on some systems (if I understand it correctly).

One way to handle that would be for the source distribution packages to always contain two directories: the directory of the dictionary and the shared/ directory. So, for example, the ara-eng dictionary would be packed as follows:

ara-eng.tgz
        ara-eng/
            ara-eng.tei
            README
            Makefile
            COPYING
            INSTALL
            ...
        shared/
            Freedict-P5.rng
            Freedict-P5.xml

At the same time, the top of each dictionary would have to contain the following lines:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../shared/freedict-P5.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="../shared/freedict-P5.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
...

The <?xml-model> processing instruction is by now so standard that it should suffice to state the association between the dictionary document and its schema. And the INSTALL would have to contain the command for validating with xmllint, which I think is still unable to read the xml-model instruction (though I may be wrong):

xmllint --noout --relaxng ../shared/freedict-P5.rng lg1-lg2.tei

(The archive listing contains the minimal number of necessary files; some dictionaries would also need the Freedict-ontology; maybe Freedict-P5.dtd would have to be included under shared/ as well, in case some users for some unknown reason needed to use that.)

The above is only relevant if the current setup is suboptimal, of course.

only keep image descriptions with > 3 words

Enhance the JSON AST parser so that image descriptions are ignored with less
than 3 words. Background is that often, images are not properly described, but
only a replacement word is inserted. E.g. "infobox" does not really describe an
image and therefore does not provide any context to the AI algorithm.

Descriptions with more than 3 words are likely to be a proper description.

Add support for the colloc tag

At the moment, the colloc tag is not supported by the style sheets. It seems as if it were already present as an attribute, though.
I could have a look at the weekend at changing the style sheets, but would be happy if a XSL expert would have time for this little task.

Ah yes, justification: dan-eng :)

wikipedia: remove disambiguations

Disambiguation pages contain a lot of phrases, but no sentences. It's better to
remove them to reduce the noise for the AI algorithm.

How to indicate examples inside sense

I'd like to know what's the standard (or the best) way to indicate examples of a term, inside a sense.

Here is what I'm doing currently:

<sense n="1" xml:lang="fa">
	<gramGrp>
		<pos norm="adjective">صفت</pos>
	</gramGrp>
	<sense xml:lang="fa">
		<def>
			definition...
			<p class="example">
				<div xml:lang="en">example 1 ...</div>
				<div xml:lang="fa">tranlation of example 1 ...</div>
				<div xml:lang="en">example 2 ...</div>
				<div xml:lang="fa">tranlation of example 2 ...</div>
			</p>
		</def>
	</sense>
</sense>

Or maybe I should use <p type="example"> instead of <p class="example">

I'd also consider this:

<sense n="1" xml:lang="fa">
	<gramGrp>
		<pos norm="adjective">صفت</pos>
	</gramGrp>
	<sense xml:lang="fa">
		<def>
			definition...
			<br/>
			<spanGrp type="example">
				<span xml:lang="en">example 1 ...</span>
				<br/>
				<span xml:lang="fa">tranlation of example 1 ...</span>
			</spanGrp>
			<br/>
			<spanGrp type="example">
				<span xml:lang="en">example 2 ...</span>
				<br/>
				<span xml:lang="fa">tranlation of example 2 ...</span>
			</spanGrp>
		</def>
	</sense>
</sense>

I suppose <span> is like HTML and does not add newlines, that's why I added these <br/>

The reason xml:lang= attributes are specificed is to change the direction of text, and possibly style (text color, font etc) when rendering to HTML.

I looked at more than 20 existing FreeDict dictionaries and didn't find anything like this.

Thanks in advance

contact Stefani and/or devise an update strategy wrt Apertium

On 06/12/17 10:15, Sebastian Humenda wrote:

Hi Piotr,

isl-eng tells me that Stefani Stoyanova converted the Apertium translation rules
to TEI P5. Is there a chance that you could either dig out the script or even
bettter contact Stefani? It'd be great to import a newer version.

refreshing the schemas: freeze the p5subset, add it to our vc, update the syntax in the ODD

I would like to update the existing ODD, in two steps, and this ticket is meant for the first and gentler of them, namely for a rewrite of the current ODD to the current TEI idiom, which should ideally mean just a cosmetic change without affecting the extension (i.e., the patterns/grammars defined by RNG, XSD, DTD), but in practice, the extension is going to be affected due the the changes in the TEI that have happened over the years, so some tinkering may be in order, and a lot of test runs across all the databases.

In doing that, I would like to add two files to our version control. For strictly internal purposes, so that we can trace the changes in the TEI internals without investigating the git history of the TEI itself, each time.

Let me sketch some background:

  • the TEI ODD mechanism is in essence a customization / documentation mechanism that targets a set of all the definitions encoded by the TEI Guidelines.
  • that set is not present in a cloned TEI repository, but rather gets derived by the make system (via TEI Stylesheets, which is a set of tools that accompanies the TEI Guidelines) and resides in a cryptically named document called p5subset. It is called an 'integrated ODD'.
  • any typical ODD document created with the appropriate TEI tools is meant to tailor the integrated ODD down to a particular purpose: manuscript description, corpus encoding, dictionary encoding, etc.
  • the application of the Freedict ODD to the integrated ODD (p5subset) silently creates something that can be called Freedict integrated ODD; it is not visible to the outside eyes, because it is regenerated each time that the Freedict ODD is manipulated by the TEI Stylesheets.
  • the 'Freedict integrated ODD' is used (or rather: was used) to derive the schema documents: RNG (of primary use for us), but also XSD and DTD (which we provide more or less out of courtesy -- but I can imagine us not providing these two, to avoid having to address the potential issues if someone decides to use those instead of the RNG)
  • I stress the "was used" because, simplifying the history slightly, that happened once, years ago: I ran the TEI tools on the current Freedict ODD and created the three schema documents. Note the crucial issue: they were ran on the p5subset as it was defined by the TEI years ago. So while the Freedict ODD hasn't been modified since then, the result of its application on the current p5subset is going to be extensionally different from what was used years ago. I don't think it's a major issue (because we only use a very small subset of the TEI), but it's definitely something to be aware of.
  • one more relevant issue and an argument for 'freezing' the p5subset in our version control is that, if one doesn't have full control of the TEI environment, their ODDs may reference the current 'blessed' TEI ODD, recreated after each release in the TEI Vault, or the current snapshot of the TEI under control of their Jenkins environment, or the local p5subset on the user's hard drive; what I propose reduces this potential complexity and adds a lot of transparency.

A hopefully minor complication is that our RNG was edited by hand since it got derived. Since it is version-controlled, I can extract the modifications and reapply them at the ODD level.

Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases. I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?
EDIT: this is now the topic of freedict/tools#28 and I have an interim solution

I mentioned adding two files to the version control. I meant the current p5subset and the Freedict integrated ODD (call it... freedict_p5subset?). The first one freezes the current state of the TEI, so that, in the future, we can diff that. The second is to expose the Freedict integrated ODD for similar comparisons. I could probably live without the latter, since it depends on the former, but it also depends on the TEI stylesheets, and those are under constant development as well. Bottom line: it's far more convenient in case one has to investigate some schema-related issue across time, to have both these files handy, because both of them can only be recreated in the future after tinkering with two very dynamic repositories (TEI Guidelines and TEI Stylesheets).


Envisioned action sequence:

  1. derive the current p5subset (on my disk, against the current snapshot of the TEI and TEI Stylesheets)
  2. freeze the p5subset by adding it to Freedict version control (where? under shared/ or elsewhere?)
  3. derive the current freedict_p5subset by using the current Freedict ODD, with one change: its @source attribute will now point at the p5subset frozen at step (2)
  4. derive the RNG and check if all the databases validate against the RNG
  5. freeze the newly derived freedict_p5subset next to the p5subset; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheets
  6. rewrite the current Freedict ODD, just for the syntactic sugar
  7. (recurring step) derive the RNG and check if all the databases validate against the RNG
  8. commit the newly created freedict_p5subset just to document any modifications that could have crept in at step (6)
  9. check our RNG version history for potential modifications introduced by hand, and see if they need to be handled at the ODD level (it might be that the underlying TEI has caught up with them, during the years that passed), if an ODD rewrite is necessary, then repeat steps (7) and (8)

At this point, after all the above actions, we should be still at the status quo, except with (a) 2 new files, kept for reproducibility checks and (b) a newer Freedict ODD, ready to be modified further.

Project Description

Is this the new home of the project? If so, where can I find info on the project, for example:

  • Project goals
  • Dictionary format
  • Dictionary reading/writing tools
  • Installation instructions
  • Usage
  • History

I'm not sure if all of these points apply or would be useful, but at least some form of README and project description would be useful and help potential contributor. If I can find any of this myself (or be pointed in the right direction) then I can get some form of description started.

Evaluate upstream sources for freedict-eng-hin

Freedict currently provides freedict-eng-hin for Indic users speaking the Hindi language.

This bug report is to track the following:

  • Evaluate current eng-hin data
  • Evaluate currently available upstream projects providing data in this domain. Many have a permissive license.
  • Evaluate other possible Indic language dictionaries.

The current options are:

Please drop a note on this bug report if you are interested in helping with many of the regional Indic languages, other than and including Hindi.

modify the ODD to allow for multiple refs

Just a note for now: it's a rather bad idea to edit the .rng directly, because it's regenerated after each change of the ODD. So when our schema is tightened, there will be a new .rng.

So we simply need a new ODD, more relaxed in this respect.

On 29/11/17 22:16, Sebastian Humenda wrote:

Branch: refs/heads/master
Home: https://github.com/freedict/fd-dictionaries
Commit: d922e50
d922e50
Author: Sebastian Humenda
Date: 2017-11-29 (Wed, 29 Nov 2017)

Changed paths:
M shared/freedict-P5.rng

Log Message:

freedict-P5.rng: allow multi-licencing

Previously, a licence reference (<ref target…>) was mandatory, but did
not allow multiple licences.

Add support for the DictionaryForMID format

DictionaryForMID is a dictionary program featuring a custom format, importer
scripts and also an Android client.
Documentation seems good and both authors of the desktop / mobile version are
active. Their architecture model is described here:
http://dictionarymid.sourceforge.net/development.html.

For the short term, we could easily leverage their dictd2dictionaryformid
conversion process, see
http://dictionarymid.sourceforge.net/DfM-Creator/index.html and the GUI
http://dictionarymid.sourceforge.net/DfM-Creator/gui-DictdToDictionaryForMIDs.html.

For the longer term, I'd like to create template overrides for our style sheets
which would format certain parts of our format differently, so that we could
make use of the formatting features of the DictionaryForMID format. For
instance, example sentences can be formatted separately. That is ideally not to
much effort, since the format in use is quite close to the dictd format, see
http://dictionarymid.sourceforge.net/faq.html.

Last but not least, the project features its own API to inform about new
dictionaries and more importantly, to push the dictionaries to mobile devices.
See http://dictionarymid.sourceforge.net/ota.php?p=1.

I would like to see this format supported and would love to integrate our
FreeDict API into DictionaryForMID, so that we don't replicate efforts.

jpn-(eng|fra|rus|deu): correct copyright information

According to http://www.edrdg.org/edrdg/licence.html, the copyright holders are:

Copyright over the documents covered by this statement is held by James William BREEN and The Electronic Dictionary
Research and Development Group.

The header of e.g. jpn-eng says:

      <p>Copyright (C) 1994-2016 by various authors listed below.</p>
      <p>Available under the terms of the <ref target="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-Share Alike Licence (V3.0)</ref>.</p>
    </availability>

Correct would be the year range 2000-2016. Please correct this information with the next import.

eng-hun: check for missing spaces

I'm lacking knowledge in Hungarian, but some of the words seem too long. It
should be checked whether these terms lack spaces. Breaking up some of them
give results at Google, the long forms don't.

Adding words

Hi there, great to see this repository and initiative!

I was hoping to build a CLI tool which would benefit from such translations. However, rather than just take translations, I'd also hope to add them upstream when they are missing.

I saw the following commit 0f1aa58 and was wondering what do I need to do to submit a pull request with new words? Just add the words and bump the version?

Do you have existing command line tool which can manipulate (add words to) the XML? (EDIT: Oh, I saw in this wiki page you are working on it!?)

Thanks!

add support for the Stardict format as output format

Stardict is a widely used format and is hence worth supporting. There's also a
mobile client QDict available which understands this format. Since we don't have a client, it'd be great to make our dictionaries available this way.

jpn-(deu|eng|fra|rus): provide parsed part-of-speech information

For all mentioned dictionaries, part of speech information is annotated like this:

  <note type="pos">adjectival nouns or quasi-adjectives (keiyodoshi)</note>

It would be great if this could be converted to <pos/> elements and if possible, linked against an ontology. This way, the part of speech would be parseable by machines and could be localized for humans.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.