freedict / fd-dictionaries Goto Github PK
View Code? Open in Web Editor NEWhand-written dictionaries from the FreeDict project
Home Page: http://freedict.org/
hand-written dictionaries from the FreeDict project
Home Page: http://freedict.org/
I wonder if it makes sense to have an XML test suite, with examples of what needs to be supported by each potential modification of the Freedict schema, and of what is not legal.
Some dictionaries already have some kind of local ontology to reliably identify
parrt of speech (and potentially gender, etc.). Examples are the WikDict
dictionaries or eng-pol. Most other dictionaries lack this information, there
the <pos/>
tag may contain arbitrary text. For machine-friendly
postprocessing, this should be mapped to an ontology, valid for all FreeDict
dictionaries.
Things to happen:
Hello,
Please, can you add the ESPDIC (Esperanto-English Dictionary) from Paul Denisowski? The dictionary uses the CC BY 3.0 licence, has 63k words and is available in a machine friendly format. See http://www.denisowski.org/Esperanto/ESPDIC/espdic_readme.htm .
Thanks for your cool project!
Regards, Andy
Example showcasing the issue:
<entry>
<form>
<orth>acıkmak</orth>
</form>
<sense n="1">
<cit type="trans">
<quote>besorryabout</quote>
</cit>
<cit type="trans">
<quote>regret</quote>
</cit>
</sense>
<sense n="2">
<cit type="trans">
<quote>behungry</quote>
</cit>
</sense>
</entry>
The dictionary would also benefit from having all entries ID-ed with xml:id
(but that is not a bug report, just an enhancement suggestion).
The TEI specification lists a <appInfo/>
tag, see
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HDAPP.
It would be good to have such a tag to document the script or program with which
the dictionary was imported. It should contain:
<appInfo>
<application version="1.5"
ident="ImageMarkupTool"
<label>Image Markup Tool</label>
<ptr target="#P1"/>
<ptr target="#P2"/>
</application>
</appInfo>
The notAfter attribute (referenced in the URL above) could be omitted, application version is given in the above example and everything else is tracked by Git.
The following would need to be adapted:
Please document any progress here.
https://github.com/dtolpin/harkavy
No licensing statement there, but I think the data is in the public domain.
This issue rides on the back of #62 (update of the ODD).
One issue is to introduce a fixed list of types for <pron>
, with @type="broad"
set as the default (see this message for some background and links; the follow-ups provide some further explanation).
Yet another, that came to my mind at some point, is to import some lists of values of the @type
attribute from the TEI Lex0, for "usg", for example. That may involve the modification of the existing databases, or the Freedict-local type values can simply get added to that fixed list.
I'm guessing that this is all up to the demands of the current databases and to the general standardization practice. So perhaps this ticket can serve to gather a list of demands, or at least a list of references to other issues.
Dutch dictionary is missing the noun articles. I may contribute to filling this data. Editor needed tho. Proposed solution: add POS field (sort of word), separate nouns from the database, add the "het" articles, autofill the reminder with "de" article. I would start with the "het" nouns since they make just around 20% of all dutch nouns.
Apparently upstream is still active (www.ferheng.org, or for the German version www.ferheng.org/de). A new import would be appropriate.
nld-deu contains repeated <cit…><gramGrp>…</gramGrp></cit>
, which are empty
and broken translations. It is fairly easy to remove them programatically, help
appreciated. Things to watch out for:
ATM, only <cit type="trans"/>
is supported.
We currently ship eng-hun and hun-eng in a binary file. Upstream is not active anymore and hence it would be better to maintain a fork in TEI, so that the dictionary is open for improvements.
Hi,
At least some of the dictionaries are licensed under GPL, a free non-permissive license initially created for software but that can also be used for art/text (even if not always encouraged by GNU). Any derivative work has to be licensed under GPL, as stated in the license:
You may convey a work based on the Program [...] provided that you also meet all of these conditions:
c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply [...] to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.
I understand that one of its implications is that any distributed improvement (fixing translations, adding some) shall be distributed under GPL, but it is less clear to me how this interacts with non-free software using the dictionaries as data to perform some task (eg, spelling checker, automated translation, word generator, ...). I couldn't find any related information in the GitHub Wiki or on the website.
The closest entry in the GNU GPL FAQ would be related to plugins and states:
Can I release a nonfree program that's designed to load a GPL-covered plug-in?
If they form a single combined program then the main program must be released under the GPL or a GPL-compatible free software license, and the terms of the GPL must be followed when the main program is distributed for use with these plug-ins.
However, if they are separate works then the license of the plug-in makes no requirements about the main program.
While open to interpretation, my guess is that GPL would allow the use cases I described earlier (spelling checker, etc..).
A related question can be found on SE, but the most upvoted answer has only one upvote, wich makes it unreliable.
Another SE related question also states that GPL does not apply to programs using datasets under GPL.
Licenses are a difficult topic and it's easy to get something wrong, do you have any input about what can and cannot be done with the dictionaries, and if there are restrictions attached? I think it would be nice to have a short section about licenses in the documentation so as not to discourage use of this resource.
Thanks for your great work!
Dictionaries imported from Ergane seem to list many senses with the same
content, as for instance:
ge
ge /xə/
1. du, Sie
2. ihr, Sie
3. Sie
4. Sie
5. du
It should be straight forward to remove some of the doubled translations.
The teiaddphonetics script is heavily outdated (uses P4) and uses Mbrola, which is non-free. Espeak could be used instead, which is also capable of generating pIPA.
https://clarin.oeaw.ac.at/ccv/vle
Initial steps: I was able to install and run the most recent version of the client (Version 2.1.1utf8) under Kubuntu 14.04.
"To get a working environment you will need a server. The server-side scripts (php + mysql) are also available and easy to setup". This sounds like we should look for server space, probably also hosting a BaseX instance.
I will ask about some manual, etc.
Hi,
[1] shows how to create bilingual dictionaries using omegawiki as a source. [2] uses these dictionaries for example. Could you add these dictionaries?
Regards, Andy
[1] http://wiki.apertium.org/wiki/Getting_bilingual_dictionaries_from_OmegaWiki
[2] http://dictionarymid.sourceforge.net/ http://dictionarymid.sourceforge.net/dictionaries/dictsBinlingualsOmegaWiki.html
I made a dictionary eng-rus, it's a compilation of wikdict and stardict dictionaries with a lot of manual edits. It counts 526876 headwords cause the stardict dictionary is comprehensive, so the eng-rus.tei file exceeds 160Mb. How I can push it to the repository? It seems that there is no support of LFS in the target repository.
This is a potential enhancement for handling schemas and ODD, if the current situation is seen as suboptimal.
Sebastian mentions elsewhere that symlinking the ODDs and schemas in each dictionary directory may still cause problems on some systems (if I understand it correctly).
One way to handle that would be for the source distribution packages to always contain two directories: the directory of the dictionary and the shared/ directory. So, for example, the ara-eng dictionary would be packed as follows:
ara-eng.tgz
ara-eng/
ara-eng.tei
README
Makefile
COPYING
INSTALL
...
shared/
Freedict-P5.rng
Freedict-P5.xml
At the same time, the top of each dictionary would have to contain the following lines:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../shared/freedict-P5.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="../shared/freedict-P5.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
...
The <?xml-model>
processing instruction is by now so standard that it should suffice to state the association between the dictionary document and its schema. And the INSTALL would have to contain the command for validating with xmllint, which I think is still unable to read the xml-model instruction (though I may be wrong):
xmllint --noout --relaxng ../shared/freedict-P5.rng lg1-lg2.tei
(The archive listing contains the minimal number of necessary files; some dictionaries would also need the Freedict-ontology; maybe Freedict-P5.dtd would have to be included under shared/ as well, in case some users for some unknown reason needed to use that.)
The above is only relevant if the current setup is suboptimal, of course.
Enhance the JSON AST parser so that image descriptions are ignored with less
than 3 words. Background is that often, images are not properly described, but
only a replacement word is inserted. E.g. "infobox" does not really describe an
image and therefore does not provide any context to the AI algorithm.
Descriptions with more than 3 words are likely to be a proper description.
I've just noticed this thread at Corpora-L: "IPA conversion tool" https://mailman.uib.no/public/corpora/2019-April/030032.html
People mention:
So I thought I'd make a note of that here, just in case.
At the moment, the colloc tag is not supported by the style sheets. It seems as if it were already present as an attribute, though.
I could have a look at the weekend at changing the style sheets, but would be happy if a XSL expert would have time for this little task.
Ah yes, justification: dan-eng :)
Disambiguation pages contain a lot of phrases, but no sentences. It's better to
remove them to reduce the noise for the AI algorithm.
I'd like to know what's the standard (or the best) way to indicate examples of a term, inside a sense.
Here is what I'm doing currently:
<sense n="1" xml:lang="fa">
<gramGrp>
<pos norm="adjective">صفت</pos>
</gramGrp>
<sense xml:lang="fa">
<def>
definition...
<p class="example">
<div xml:lang="en">example 1 ...</div>
<div xml:lang="fa">tranlation of example 1 ...</div>
<div xml:lang="en">example 2 ...</div>
<div xml:lang="fa">tranlation of example 2 ...</div>
</p>
</def>
</sense>
</sense>
Or maybe I should use <p type="example">
instead of <p class="example">
I'd also consider this:
<sense n="1" xml:lang="fa">
<gramGrp>
<pos norm="adjective">صفت</pos>
</gramGrp>
<sense xml:lang="fa">
<def>
definition...
<br/>
<spanGrp type="example">
<span xml:lang="en">example 1 ...</span>
<br/>
<span xml:lang="fa">tranlation of example 1 ...</span>
</spanGrp>
<br/>
<spanGrp type="example">
<span xml:lang="en">example 2 ...</span>
<br/>
<span xml:lang="fa">tranlation of example 2 ...</span>
</spanGrp>
</def>
</sense>
</sense>
I suppose <span>
is like HTML and does not add newlines, that's why I added these <br/>
The reason xml:lang=
attributes are specificed is to change the direction of text, and possibly style (text color, font etc) when rendering to HTML.
I looked at more than 20 existing FreeDict dictionaries and didn't find anything like this.
Thanks in advance
On 06/12/17 10:15, Sebastian Humenda wrote:
Hi Piotr,
isl-eng tells me that Stefani Stoyanova converted the Apertium translation rules
to TEI P5. Is there a chance that you could either dig out the script or even
bettter contact Stefani? It'd be great to import a newer version.
I would like to update the existing ODD, in two steps, and this ticket is meant for the first and gentler of them, namely for a rewrite of the current ODD to the current TEI idiom, which should ideally mean just a cosmetic change without affecting the extension (i.e., the patterns/grammars defined by RNG, XSD, DTD), but in practice, the extension is going to be affected due the the changes in the TEI that have happened over the years, so some tinkering may be in order, and a lot of test runs across all the databases.
In doing that, I would like to add two files to our version control. For strictly internal purposes, so that we can trace the changes in the TEI internals without investigating the git history of the TEI itself, each time.
Let me sketch some background:
p5subset
. It is called an 'integrated ODD'.p5subset
) silently creates something that can be called Freedict integrated ODD; it is not visible to the outside eyes, because it is regenerated each time that the Freedict ODD is manipulated by the TEI Stylesheets.p5subset
as it was defined by the TEI years ago. So while the Freedict ODD hasn't been modified since then, the result of its application on the current p5subset
is going to be extensionally different from what was used years ago. I don't think it's a major issue (because we only use a very small subset of the TEI), but it's definitely something to be aware of.p5subset
in our version control is that, if one doesn't have full control of the TEI environment, their ODDs may reference the current 'blessed' TEI ODD, recreated after each release in the TEI Vault, or the current snapshot of the TEI under control of their Jenkins environment, or the local p5subset on the user's hard drive; what I propose reduces this potential complexity and adds a lot of transparency.A hopefully minor complication is that our RNG was edited by hand since it got derived. Since it is version-controlled, I can extract the modifications and reapply them at the ODD level.
Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases. I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?
EDIT: this is now the topic of freedict/tools#28 and I have an interim solution
I mentioned adding two files to the version control. I meant the current p5subset
and the Freedict integrated ODD (call it... freedict_p5subset
?). The first one freezes the current state of the TEI, so that, in the future, we can diff that. The second is to expose the Freedict integrated ODD for similar comparisons. I could probably live without the latter, since it depends on the former, but it also depends on the TEI stylesheets, and those are under constant development as well. Bottom line: it's far more convenient in case one has to investigate some schema-related issue across time, to have both these files handy, because both of them can only be recreated in the future after tinkering with two very dynamic repositories (TEI Guidelines and TEI Stylesheets).
Envisioned action sequence:
p5subset
by adding it to Freedict version control (where? under shared/
or elsewhere?)freedict_p5subset
by using the current Freedict ODD, with one change: its @source
attribute will now point at the p5subset
frozen at step (2)freedict_p5subset
next to the p5subset
; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheetsfreedict_p5subset
just to document any modifications that could have crept in at step (6)At this point, after all the above actions, we should be still at the status quo, except with (a) 2 new files, kept for reproducibility checks and (b) a newer Freedict ODD, ready to be modified further.
There has been a bug report in Debian, #771289, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771289, asking for correction of the definition of "lust".
It would be great if we could investigate, whether upstream is still alive. The README file at least references a web site, which is under maintenance (2016-09). If upstream is dead, I'd ask for the correction of the bug.
Is this the new home of the project? If so, where can I find info on the project, for example:
I'm not sure if all of these points apply or would be useful, but at least some form of README and project description would be useful and help potential contributor. If I can find any of this myself (or be pointed in the right direction) then I can get some form of description started.
Freedict currently provides freedict-eng-hin for Indic users speaking the Hindi language.
This bug report is to track the following:
The current options are:
Please drop a note on this bug report if you are interested in helping with many of the regional Indic languages, other than and including Hindi.
eng-ell from http://www.freelang.net/dictionary/ might be updated. Research this
site for a newer version and more (free and open) dictionaries.
The following dictionary is open source and has fairly good meta data.
https://www-user.tu-chemnitz.de/~fri/ding/
You can download the text dictionary here:
https://ftp.tu-chemnitz.de/pub/Local/urz/ding/de-en/
It would be great if the german dictionary could put this to use!
fra-bre contains empty orth
elements, such entries should be filled or removed.
Our (old) Serbian databases are really small. We should consider merging them
with http://serbdict.sourceforge.net. It's in a database, so the conversion
should be rather easy. Help is appreciated. We should also try to contact
upstream.
In one of the Makefiles, SHELL=bash is set. It's questionable whether depending on a particular shell, especially from a Makefile, is a good idea.
Just a note for now: it's a rather bad idea to edit the .rng directly, because it's regenerated after each change of the ODD. So when our schema is tightened, there will be a new .rng.
So we simply need a new ODD, more relaxed in this respect.
On 29/11/17 22:16, Sebastian Humenda wrote:
Branch: refs/heads/master
Home: https://github.com/freedict/fd-dictionaries
Commit: d922e50
d922e50
Author: Sebastian Humenda
Date: 2017-11-29 (Wed, 29 Nov 2017)Changed paths:
M shared/freedict-P5.rngLog Message:
freedict-P5.rng: allow multi-licencing
Previously, a licence reference (
<ref target…>
) was mandatory, but did
not allow multiple licences.
DictionaryForMID is a dictionary program featuring a custom format, importer
scripts and also an Android client.
Documentation seems good and both authors of the desktop / mobile version are
active. Their architecture model is described here:
http://dictionarymid.sourceforge.net/development.html.
For the short term, we could easily leverage their dictd2dictionaryformid
conversion process, see
http://dictionarymid.sourceforge.net/DfM-Creator/index.html and the GUI
http://dictionarymid.sourceforge.net/DfM-Creator/gui-DictdToDictionaryForMIDs.html.
For the longer term, I'd like to create template overrides for our style sheets
which would format certain parts of our format differently, so that we could
make use of the formatting features of the DictionaryForMID format. For
instance, example sentences can be formatted separately. That is ideally not to
much effort, since the format in use is quite close to the dictd format, see
http://dictionarymid.sourceforge.net/faq.html.
Last but not least, the project features its own API to inform about new
dictionaries and more importantly, to push the dictionaries to mobile devices.
See http://dictionarymid.sourceforge.net/ota.php?p=1.
I would like to see this format supported and would love to integrate our
FreeDict API into DictionaryForMID, so that we don't replicate efforts.
According to http://www.edrdg.org/edrdg/licence.html, the copyright holders are:
Copyright over the documents covered by this statement is held by James William BREEN and The Electronic Dictionary
Research and Development Group.
The header of e.g. jpn-eng says:
<p>Copyright (C) 1994-2016 by various authors listed below.</p>
<p>Available under the terms of the <ref target="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-Share Alike Licence (V3.0)</ref>.</p>
</availability>
Correct would be the year range 2000-2016. Please correct this information with the next import.
Hi,
Let's add a free dictionary: https://handedict.zydeo.net/en/ https://handedict.zydeo.net/en/download
Regards, Andy
The dictionary is licenced under the CC-BY-SA and keeps improving. Its project
home is at http://folkets-lexikon.csc.kth.se/folkets/om.en.html. It's in a XML
format, a converter can be easily written.
I'm lacking knowledge in Hungarian, but some of the words seem too long. It
should be checked whether these terms lack spaces. Breaking up some of them
give results at Google, the long forms don't.
Hi there, great to see this repository and initiative!
I was hoping to build a CLI tool which would benefit from such translations. However, rather than just take translations, I'd also hope to add them upstream when they are missing.
I saw the following commit 0f1aa58 and was wondering what do I need to do to submit a pull request with new words? Just add the words and bump the version?
Do you have existing command line tool which can manipulate (add words to) the XML? (EDIT: Oh, I saw in this wiki page you are working on it!?)
Thanks!
Stardict is a widely used format and is hence worth supporting. There's also a
mobile client QDict available which understands this format. Since we don't have a client, it'd be great to make our dictionaries available this way.
For all mentioned dictionaries, part of speech information is annotated like this:
<note type="pos">adjectival nouns or quasi-adjectives (keiyodoshi)</note>
It would be great if this could be converted to <pos/>
elements and if possible, linked against an ontology. This way, the part of speech would be parseable by machines and could be localized for humans.
http://ferheng.org/ offers kurdish dictionaries which are already part of
FreeDict. It would be good to (re)import them if they have changed at all.
Home page: http://arabeyes.org
Chapter 4 instructs to install the DTDs. The more modern validation approach uses relaxng and in fact, the DTDs are not required anymore, though still usable. The chapter should be rewritten.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.