freedict / fd-dictionaries Goto Github PK

View Code? Open in Web Editor NEW

384.0 28.0 49.0 273.49 MB

hand-written dictionaries from the FreeDict project

Home Page: http://freedict.org/

Makefile 13.67% CSS 14.85% XSLT 9.45% Yacc 2.28% Perl 1.97% Shell 8.93% C 48.86%

tei tei-xml dictionaries dictionary

fd-dictionaries's People

Contributors

Stargazers

Watchers

fd-dictionaries's Issues

XML test suite?

I wonder if it makes sense to have an XML test suite, with examples of what needs to be supported by each potential modification of the Freedict schema, and of what is not legal.

Common ontology for part of speech

Some dictionaries already have some kind of local ontology to reliably identify
parrt of speech (and potentially gender, etc.). Examples are the WikDict
dictionaries or eng-pol. Most other dictionaries lack this information, there
the <pos/> tag may contain arbitrary text. For machine-friendly
postprocessing, this should be mapped to an ontology, valid for all FreeDict
dictionaries.

Things to happen:

provide common ontology
mention in documentation that newly imported / created dictionaries need to use the ontology
convert existing dictionaries

add english-esperanto dictionary

Hello,

Please, can you add the ESPDIC (Esperanto-English Dictionary) from Paul Denisowski? The dictionary uses the CC BY 3.0 licence, has 63k words and is available in a machine friendly format. See http://www.denisowski.org/Esperanto/ESPDIC/espdic_readme.htm .
Thanks for your cool project!

Regards, Andy

tur-eng: no whitespace in equivalents

Example showcasing the issue:

         <entry>
            <form>
               <orth>acıkmak</orth>
            </form>
            <sense n="1">
               <cit type="trans">
                  <quote>besorryabout</quote>
               </cit>
               <cit type="trans">
                  <quote>regret</quote>
               </cit>
            </sense>
            <sense n="2">
               <cit type="trans">
                  <quote>behungry</quote>
               </cit>
            </sense>
         </entry>

The dictionary would also benefit from having all entries ID-ed with xml:id (but that is not a bug report, just an enhancement suggestion).

implement <appInfo> to document importer script for imported dictionary

The TEI specification lists a <appInfo/> tag, see
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HDAPP.
It would be good to have such a tag to document the script or program with which
the dictionary was imported. It should contain:

application name
application version
path to script (or its directory) relative to the tools/ directory

<appInfo>
 <application version="1.5"
  ident="ImageMarkupTool"
  <label>Image Markup Tool</label>
  <ptr target="#P1"/>
  <ptr target="#P2"/>
 </application>
</appInfo>

The notAfter attribute (referenced in the URL above) could be omitted, application version is given in the above example and everything else is tracked by Git.

The following would need to be adapted:

the shared/freedict-P5.dtd
the shared/freedict-P5.rng
tools/xsl/inc/(?)

Please document any progress here.

Harakavy's Yiddish-English dictionary

https://github.com/dtolpin/harkavy

No licensing statement there, but I think the data is in the public domain.

extend the Freedict schema to handle new demands

This issue rides on the back of #62 (update of the ODD).

One issue is to introduce a fixed list of types for <pron>, with @type="broad" set as the default (see this message for some background and links; the follow-ups provide some further explanation).

Yet another, that came to my mind at some point, is to import some lists of values of the @type attribute from the TEI Lex0, for "usg", for example. That may involve the modification of the existing databases, or the Freedict-local type values can simply get added to that fixed list.

I'm guessing that this is all up to the demands of the current databases and to the general standardization practice. So perhaps this ticket can serve to gather a list of demands, or at least a list of references to other issues.

Articles missing in Dutch dictionary

Dutch dictionary is missing the noun articles. I may contribute to filling this data. Editor needed tho. Proposed solution: add POS field (sort of word), separate nouns from the database, add the "het" articles, autofill the reminder with "de" article. I would start with the "het" nouns since they make just around 20% of all dutch nouns.

update kur-deu dictionary

Apparently upstream is still active (www.ferheng.org, or for the German version www.ferheng.org/de). A new import would be appropriate.

nld-deu|afr-deu: remove empty translations

nld-deu contains repeated <cit…><gramGrp>…</gramGrp></cit>, which are empty
and broken translations. It is fairly easy to remove them programatically, help
appreciated. Things to watch out for:

removed cit's may leave empty sense's behind
removing a sense node, the explicit numbering can break

dict XSL stylesheets: allow cit type="translation"

ATM, only <cit type="trans"/> is supported.

hun-eng/eng-hun: remove binary blob

We currently ship eng-hun and hun-eng in a binary file. Upstream is not active anymore and hence it would be better to maintain a fork in TEI, so that the dictionary is open for improvements.

Implications of GPL as a license for dictionaries

Hi,

At least some of the dictionaries are licensed under GPL, a free non-permissive license initially created for software but that can also be used for art/text (even if not always encouraged by GNU). Any derivative work has to be licensed under GPL, as stated in the license:

You may convey a work based on the Program [...] provided that you also meet all of these conditions:

c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply [...] to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.

I understand that one of its implications is that any distributed improvement (fixing translations, adding some) shall be distributed under GPL, but it is less clear to me how this interacts with non-free software using the dictionaries as data to perform some task (eg, spelling checker, automated translation, word generator, ...). I couldn't find any related information in the GitHub Wiki or on the website.

The closest entry in the GNU GPL FAQ would be related to plugins and states:

Can I release a nonfree program that's designed to load a GPL-covered plug-in?

If they form a single combined program then the main program must be released under the GPL or a GPL-compatible free software license, and the terms of the GPL must be followed when the main program is distributed for use with these plug-ins.
However, if they are separate works then the license of the plug-in makes no requirements about the main program.

While open to interpretation, my guess is that GPL would allow the use cases I described earlier (spelling checker, etc..).

A related question can be found on SE, but the most upvoted answer has only one upvote, wich makes it unreliable.

Licenses are a difficult topic and it's easy to get something wrong, do you have any input about what can and cannot be done with the dictionaries, and if there are restrictions attached? I think it would be nice to have a short section about licenses in the documentation so as not to discourage use of this resource.

Thanks for your great work!

Ergane dictionaries: fix doubled translations

Dictionaries imported from Ergane seem to list many senses with the same
content, as for instance:

ge
ge /xə/
1. du, Sie
2. ihr, Sie
3. Sie
4. Sie
5. du

It should be straight forward to remove some of the doubled translations.

fix broken teiaddphonetics; use free and open eSpeak phone generation

The teiaddphonetics script is heavily outdated (uses P4) and uses Mbrola, which is non-free. Espeak could be used instead, which is also capable of generating pIPA.

see if we can use Freedict databases with the Viennese Lexicographic Editor

https://clarin.oeaw.ac.at/ccv/vle

Initial steps: I was able to install and run the most recent version of the client (Version 2.1.1utf8) under Kubuntu 14.04.

"To get a working environment you will need a server. The server-side scripts (php + mysql) are also available and easy to setup". This sounds like we should look for server space, probably also hosting a BaseX instance.

I will ask about some manual, etc.

add a lot of bilingual dictionaries using omegawiki

Hi,

[1] shows how to create bilingual dictionaries using omegawiki as a source. [2] uses these dictionaries for example. Could you add these dictionaries?

Regards, Andy

[1] http://wiki.apertium.org/wiki/Getting_bilingual_dictionaries_from_OmegaWiki
[2] http://dictionarymid.sourceforge.net/ http://dictionarymid.sourceforge.net/dictionaries/dictsBinlingualsOmegaWiki.html

LFS needed

I made a dictionary eng-rus, it's a compilation of wikdict and stardict dictionaries with a lot of manual edits. It counts 526876 headwords cause the stardict dictionary is comprehensive, so the eng-rus.tei file exceeds 160Mb. How I can push it to the repository? It seems that there is no support of LFS in the target repository.

alternative handling of schema and ODD files

This is a potential enhancement for handling schemas and ODD, if the current situation is seen as suboptimal.
Sebastian mentions elsewhere that symlinking the ODDs and schemas in each dictionary directory may still cause problems on some systems (if I understand it correctly).

One way to handle that would be for the source distribution packages to always contain two directories: the directory of the dictionary and the shared/ directory. So, for example, the ara-eng dictionary would be packed as follows:

ara-eng.tgz
        ara-eng/
            ara-eng.tei
            README
            Makefile
            COPYING
            INSTALL
            ...
        shared/
            Freedict-P5.rng
            Freedict-P5.xml

At the same time, the top of each dictionary would have to contain the following lines:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../shared/freedict-P5.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="../shared/freedict-P5.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
...

The <?xml-model> processing instruction is by now so standard that it should suffice to state the association between the dictionary document and its schema. And the INSTALL would have to contain the command for validating with xmllint, which I think is still unable to read the xml-model instruction (though I may be wrong):

xmllint --noout --relaxng ../shared/freedict-P5.rng lg1-lg2.tei

(The archive listing contains the minimal number of necessary files; some dictionaries would also need the Freedict-ontology; maybe Freedict-P5.dtd would have to be included under shared/ as well, in case some users for some unknown reason needed to use that.)

The above is only relevant if the current setup is suboptimal, of course.

only keep image descriptions with > 3 words

Enhance the JSON AST parser so that image descriptions are ignored with less
than 3 words. Background is that often, images are not properly described, but
only a replacement word is inserted. E.g. "infobox" does not really describe an
image and therefore does not provide any context to the AI algorithm.

Descriptions with more than 3 words are likely to be a proper description.

Consider alternative phonetic transcription tools (if/when needed)

I've just noticed this thread at Corpora-L: "IPA conversion tool" https://mailman.uib.no/public/corpora/2019-April/030032.html

People mention:

So I thought I'd make a note of that here, just in case.

Add support for the colloc tag

At the moment, the colloc tag is not supported by the style sheets. It seems as if it were already present as an attribute, though.
I could have a look at the weekend at changing the style sheets, but would be happy if a XSL expert would have time for this little task.

Ah yes, justification: dan-eng :)

wikipedia: remove disambiguations

Disambiguation pages contain a lot of phrases, but no sentences. It's better to
remove them to reduce the noise for the AI algorithm.

How to indicate examples inside sense

I'd like to know what's the standard (or the best) way to indicate examples of a term, inside a sense.

Here is what I'm doing currently:

<sense n="1" xml:lang="fa">
	<gramGrp>
		<pos norm="adjective">صفت</pos>
	</gramGrp>
	<sense xml:lang="fa">
		<def>
			definition...
			<p class="example">
				<div xml:lang="en">example 1 ...</div>
				<div xml:lang="fa">tranlation of example 1 ...</div>
				<div xml:lang="en">example 2 ...</div>
				<div xml:lang="fa">tranlation of example 2 ...</div>
			</p>
		</def>
	</sense>
</sense>

Or maybe I should use <p type="example"> instead of <p class="example">

I'd also consider this:

<sense n="1" xml:lang="fa">
	<gramGrp>
		<pos norm="adjective">صفت</pos>
	</gramGrp>
	<sense xml:lang="fa">
		<def>
			definition...
			<br/>
			<spanGrp type="example">
				<span xml:lang="en">example 1 ...</span>
				<br/>
				<span xml:lang="fa">tranlation of example 1 ...</span>
			</spanGrp>
			<br/>
			<spanGrp type="example">
				<span xml:lang="en">example 2 ...</span>
				<br/>
				<span xml:lang="fa">tranlation of example 2 ...</span>
			</spanGrp>
		</def>
	</sense>
</sense>

I suppose <span> is like HTML and does not add newlines, that's why I added these <br/>

The reason xml:lang= attributes are specificed is to change the direction of text, and possibly style (text color, font etc) when rendering to HTML.

I looked at more than 20 existing FreeDict dictionaries and didn't find anything like this.

Thanks in advance

contact Stefani and/or devise an update strategy wrt Apertium

On 06/12/17 10:15, Sebastian Humenda wrote:

Hi Piotr,

isl-eng tells me that Stefani Stoyanova converted the Apertium translation rules
to TEI P5. Is there a chance that you could either dig out the script or even
bettter contact Stefani? It'd be great to import a newer version.

refreshing the schemas: freeze the p5subset, add it to our vc, update the syntax in the ODD

I would like to update the existing ODD, in two steps, and this ticket is meant for the first and gentler of them, namely for a rewrite of the current ODD to the current TEI idiom, which should ideally mean just a cosmetic change without affecting the extension (i.e., the patterns/grammars defined by RNG, XSD, DTD), but in practice, the extension is going to be affected due the the changes in the TEI that have happened over the years, so some tinkering may be in order, and a lot of test runs across all the databases.

In doing that, I would like to add two files to our version control. For strictly internal purposes, so that we can trace the changes in the TEI internals without investigating the git history of the TEI itself, each time.

Let me sketch some background:

the TEI ODD mechanism is in essence a customization / documentation mechanism that targets a set of all the definitions encoded by the TEI Guidelines.
that set is not present in a cloned TEI repository, but rather gets derived by the make system (via TEI Stylesheets, which is a set of tools that accompanies the TEI Guidelines) and resides in a cryptically named document called p5subset. It is called an 'integrated ODD'.
any typical ODD document created with the appropriate TEI tools is meant to tailor the integrated ODD down to a particular purpose: manuscript description, corpus encoding, dictionary encoding, etc.
the application of the Freedict ODD to the integrated ODD (p5subset) silently creates something that can be called Freedict integrated ODD; it is not visible to the outside eyes, because it is regenerated each time that the Freedict ODD is manipulated by the TEI Stylesheets.
the 'Freedict integrated ODD' is used (or rather: was used) to derive the schema documents: RNG (of primary use for us), but also XSD and DTD (which we provide more or less out of courtesy -- but I can imagine us not providing these two, to avoid having to address the potential issues if someone decides to use those instead of the RNG)
I stress the "was used" because, simplifying the history slightly, that happened once, years ago: I ran the TEI tools on the current Freedict ODD and created the three schema documents. Note the crucial issue: they were ran on the p5subset as it was defined by the TEI years ago. So while the Freedict ODD hasn't been modified since then, the result of its application on the current p5subset is going to be extensionally different from what was used years ago. I don't think it's a major issue (because we only use a very small subset of the TEI), but it's definitely something to be aware of.
one more relevant issue and an argument for 'freezing' the p5subset in our version control is that, if one doesn't have full control of the TEI environment, their ODDs may reference the current 'blessed' TEI ODD, recreated after each release in the TEI Vault, or the current snapshot of the TEI under control of their Jenkins environment, or the local p5subset on the user's hard drive; what I propose reduces this potential complexity and adds a lot of transparency.

A hopefully minor complication is that our RNG was edited by hand since it got derived. Since it is version-controlled, I can extract the modifications and reapply them at the ODD level.

Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases. I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?
EDIT: this is now the topic of freedict/tools#28 and I have an interim solution

I mentioned adding two files to the version control. I meant the current p5subset and the Freedict integrated ODD (call it... freedict_p5subset?). The first one freezes the current state of the TEI, so that, in the future, we can diff that. The second is to expose the Freedict integrated ODD for similar comparisons. I could probably live without the latter, since it depends on the former, but it also depends on the TEI stylesheets, and those are under constant development as well. Bottom line: it's far more convenient in case one has to investigate some schema-related issue across time, to have both these files handy, because both of them can only be recreated in the future after tinkering with two very dynamic repositories (TEI Guidelines and TEI Stylesheets).

Envisioned action sequence:

derive the current p5subset (on my disk, against the current snapshot of the TEI and TEI Stylesheets)
freeze the p5subset by adding it to Freedict version control (where? under shared/ or elsewhere?)
derive the current freedict_p5subset by using the current Freedict ODD, with one change: its @source attribute will now point at the p5subset frozen at step (2)
derive the RNG and check if all the databases validate against the RNG
freeze the newly derived freedict_p5subset next to the p5subset; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheets
rewrite the current Freedict ODD, just for the syntactic sugar
(recurring step) derive the RNG and check if all the databases validate against the RNG
commit the newly created freedict_p5subset just to document any modifications that could have crept in at step (6)
check our RNG version history for potential modifications introduced by hand, and see if they need to be handled at the ODD level (it might be that the underlying TEI has caught up with them, during the years that passed), if an ODD rewrite is necessary, then repeat steps (7) and (8)

At this point, after all the above actions, we should be still at the status quo, except with (a) 2 new files, kept for reproducibility checks and (b) a newer Freedict ODD, ready to be modified further.

add chinese-hungarian dictionary

Hi,

let's add https://chdict.zydeo.net/en/search/

Regards, Andy

add vietnamese-english dictionary

Hi,

let's add http://www.denisowski.org/Vietnamese/Vietnamese.html

Regards, Andy

Fix typo in eng-hin, investigate re-import of upstream

There has been a bug report in Debian, #771289, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771289, asking for correction of the definition of "lust".
It would be great if we could investigate, whether upstream is still alive. The README file at least references a web site, which is under maintenance (2016-09). If upstream is dead, I'd ask for the correction of the bug.

Project Description

Is this the new home of the project? If so, where can I find info on the project, for example:

I'm not sure if all of these points apply or would be useful, but at least some form of README and project description would be useful and help potential contributor. If I can find any of this myself (or be pointed in the right direction) then I can get some form of description started.

Evaluate upstream sources for freedict-eng-hin

Freedict currently provides freedict-eng-hin for Indic users speaking the Hindi language.

This bug report is to track the following:

Evaluate current eng-hin data
Evaluate currently available upstream projects providing data in this domain. Many have a permissive license.
Evaluate other possible Indic language dictionaries.

The current options are:

Shabdanjali - http://ltrc.iiit.ac.in/
Shabdkosh - http://www.shabdkosh.com/contribute
Wordnet Hindi - http://www.cfilt.iitb.ac.in/wordnet/webhwn/

Please drop a note on this bug report if you are interested in helping with many of the regional Indic languages, other than and including Hindi.

evaluate dictionaries on FreeLang

eng-ell from http://www.freelang.net/dictionary/ might be updated. Research this
site for a newer version and more (free and open) dictionaries.

Update german-english dictionary

The following dictionary is open source and has fairly good meta data.
https://www-user.tu-chemnitz.de/~fri/ding/

You can download the text dictionary here:
https://ftp.tu-chemnitz.de/pub/Local/urz/ding/de-en/

It would be great if the german dictionary could put this to use!

[fra-bre] Remove empty headwords

fra-bre contains empty orth elements, such entries should be filled or removed.

add interlingua-english dictionary

Hi,

let's add http://www.denisowski.org/Interlingua/Interlingua.html

Regards, Andy

Import English-Serbian / Serbian-English dictionary

Our (old) Serbian databases are really small. We should consider merging them
with http://serbdict.sourceforge.net. It's in a database, so the conversion
should be rather easy. Help is appreciated. We should also try to contact
upstream.

Don't use bash-specific scripting in the Makefile

In one of the Makefiles, SHELL=bash is set. It's questionable whether depending on a particular shell, especially from a Makefile, is a good idea.

modify the ODD to allow for multiple refs

Just a note for now: it's a rather bad idea to edit the .rng directly, because it's regenerated after each change of the ODD. So when our schema is tightened, there will be a new .rng.

So we simply need a new ODD, more relaxed in this respect.

On 29/11/17 22:16, Sebastian Humenda wrote:

Branch: refs/heads/master
Home: https://github.com/freedict/fd-dictionaries
Commit: d922e50
d922e50
Author: Sebastian Humenda
Date: 2017-11-29 (Wed, 29 Nov 2017)

Changed paths:
M shared/freedict-P5.rng

Log Message:

freedict-P5.rng: allow multi-licencing

Previously, a licence reference (<ref target…>) was mandatory, but did
not allow multiple licences.

Add support for the DictionaryForMID format

DictionaryForMID is a dictionary program featuring a custom format, importer
scripts and also an Android client.
Documentation seems good and both authors of the desktop / mobile version are
active. Their architecture model is described here:
http://dictionarymid.sourceforge.net/development.html.

For the short term, we could easily leverage their dictd2dictionaryformid
conversion process, see
http://dictionarymid.sourceforge.net/DfM-Creator/index.html and the GUI
http://dictionarymid.sourceforge.net/DfM-Creator/gui-DictdToDictionaryForMIDs.html.

For the longer term, I'd like to create template overrides for our style sheets
which would format certain parts of our format differently, so that we could
make use of the formatting features of the DictionaryForMID format. For
instance, example sentences can be formatted separately. That is ideally not to
much effort, since the format in use is quite close to the dictd format, see
http://dictionarymid.sourceforge.net/faq.html.

Last but not least, the project features its own API to inform about new
dictionaries and more importantly, to push the dictionaries to mobile devices.
See http://dictionarymid.sourceforge.net/ota.php?p=1.

I would like to see this format supported and would love to integrate our
FreeDict API into DictionaryForMID, so that we don't replicate efforts.

jpn-(eng|fra|rus|deu): correct copyright information

According to http://www.edrdg.org/edrdg/licence.html, the copyright holders are:

Copyright over the documents covered by this statement is held by James William BREEN and The Electronic Dictionary
Research and Development Group.

The header of e.g. jpn-eng says:

      <p>Copyright (C) 1994-2016 by various authors listed below.</p>
      <p>Available under the terms of the <ref target="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-Share Alike Licence (V3.0)</ref>.</p>
    </availability>

Correct would be the year range 2000-2016. Please correct this information with the next import.

add chinese-german dictionary

Hi,

Let's add a free dictionary: https://handedict.zydeo.net/en/ https://handedict.zydeo.net/en/download

Regards, Andy

Replace eng-swe and swe-eng with the People's dictionary

The dictionary is licenced under the CC-BY-SA and keeps improving. Its project
home is at http://folkets-lexikon.csc.kth.se/folkets/om.en.html. It's in a XML
format, a converter can be easily written.

eng-hun: check for missing spaces

I'm lacking knowledge in Hungarian, but some of the words seem too long. It
should be checked whether these terms lack spaces. Breaking up some of them
give results at Google, the long forms don't.

Adding words

Hi there, great to see this repository and initiative!

I was hoping to build a CLI tool which would benefit from such translations. However, rather than just take translations, I'd also hope to add them upstream when they are missing.

I saw the following commit 0f1aa58 and was wondering what do I need to do to submit a pull request with new words? Just add the words and bump the version?

Do you have existing command line tool which can manipulate (add words to) the XML? (EDIT: Oh, I saw in this wiki page you are working on it!?)

Thanks!

add support for the Stardict format as output format

Stardict is a widely used format and is hence worth supporting. There's also a
mobile client QDict available which understands this format. Since we don't have a client, it'd be great to make our dictionaries available this way.

jpn-(deu|eng|fra|rus): provide parsed part-of-speech information

For all mentioned dictionaries, part of speech information is annotated like this:

  <note type="pos">adjectival nouns or quasi-adjectives (keiyodoshi)</note>

It would be great if this could be converted to <pos/> elements and if possible, linked against an ontology. This way, the part of speech would be parseable by machines and could be localized for humans.

Regards, Andy

freedict / fd-dictionaries Goto Github PK

fd-dictionaries's People

Contributors

Stargazers

Watchers

Forkers

fd-dictionaries's Issues

Log Message:

Recommend Projects

Recommend Topics

Recommend Org