paracrawl / domain_adaptation Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 281 KB

InDomain detection is a tool designed to extract in-domain data from a large collections of data.

License: GNU General Public License v3.0

Python 100.00%

domain_adaptation's People

Contributors

Stargazers

Watchers

Forkers

zjaume

domain_adaptation's Issues

Integrate into bitextor

To be clear, the deliverable is that it should be integrated into the pipeline and Omniscien is responsible for this deliverable. A dump of scripts for us to integrate is not enough.

Be more flexible about python version

Don't assume everybody has Python 3.6 installed. They might have 3.7 and no 3.6. Currently the code calls python3.6.

Package as a Python package

Could this software please be packaged as a python package using setuptools?

Documentation

https://github.com/paracrawl/Domain_Adaptation/blob/master/Read.me doesn't actually say what the format of any of the files should be or have a worked example.

Table of Contents in README.md

The README.md is fairly long. Perhaps it can be separated into different documents (with links between them). A good way to start this would be with a table of contents. See also #11 as this can help with orientation and knowing what to expect from the document.

Incorrect data: output corpus twice the size of input corpus

Amir writes:

I manage to run it but the resulting corpus is almost twice the size of
actual corpus. So definitely not generating the correct data.

Stop adding files via upload

Use git properly: commit changes with descriptions of what was changed and preserve revision history.

Log path might as well be stderr

It's not clear to me what value is being added by asking the user to specify a log path when they might as well do 2>/path/to/log/file using stderr?

Start with a worked example

After the introduction, a brief explanation of the fact that there are different tools. The diagram at the beginning of Processes and Tools is helpful. But it is not introduced or explained very much. Immediately afterwards there is thorough documentation of the different tools and all of their command line arguments. This is also useful but would usefully come later once I know what each piece does and have seen a simple example. There are examples but they are interspersed with the detailed documentation. It's too easy to get lost without some more orientation at the beginning.

Running FullProcess

The Documentation says to run FullProcess.sh but that doesn't exist.
I run FullProcess.py and that's command not found. I shouldn't have to be in a particular directory to run software; if I need to give the path to it that's ok but not documented. Also, even if I was in the correct directory, . is not in my path.
Then I go into the lib directory (if it's meant to be run by the user it's probably not a lib...) and try FullProcess.py but it doesn't have a shebang #!/usr/bin/python3 at the top.

Implement Moore-Lewis for selection

Amir writes:

The selection is based on sentence length normalized language model
scoring of the crawl data using a monolingual source in-domain corpus. Well
in my opinion that is not a good model for data selection. We can at least
implement modified Moore-Lewis, which is kind of the baseline method for
data selection.

Why catch an informative exception to throw a generic one?

Domain_Adaptation/scripts/FullProcess.py

Line 92 in 432916d

raise Exception("There was a problem with Full Running ")

Just makes it harder to know what the error is.

Reduce dependency from Moses to preprocess

It looks like you're using the Moses preprocessor only. Here's a lighter version already in ParaCrawl projects: https://github.com/kpu/preprocess/

Sleeping?

Do I want to know why there are sleeps between commands? Does this suggest a larger issue around ensuring files are properly synced/closed?

Domain_Adaptation/lib/FullProcess.py

Line 80 in cbd4c91

time.sleep(1)

User-specified tokenizer

The standard in ParaCrawl is that the user can specify a tokenizer rather than it being hard-coded into the packages. This way some user can handle Chinese etc.
I suspect this will be moot once integrated into bitextor since it should be tokenizing for you anyway.

Paths with spaces

Scripts are not robust to paths containing spaces. Please write a test for this.

Could the output be untokenized?

With SentencePiece being widely used in NMT, tokenized output it's more an obstacle than a benefit, I think. Could the output be untokenized or add an option to choose between tokenized and untokenized? I think tokenized pool-data should be used only for LM scoring, and use the untokenized sentences and the scores to produce the output.

The output should not be parallel?

Hi,

I ran the FullProcess with Paracrawl as pool data and some data from news-crawl as domain data for English-Spanish and the output I'm getting is not parallel.
The command was:

python scripts/FullProcess.py -dn news -sl en -tl es -domain news -pool pooldata -out out -working_dir work -threshold 0.5

And the result was:

$ cat out/en/paracrawl6.txt
Kristine - Public profile - Catena Cycling
$ out/es/paracrawl6.txt
Increíble , pero cierto .
$ grep -n 'Kristine - Public' pooldata/en/paracrawl6.txt
41336:Kristine - Public profile - Catena Cycling
$ grep -n 'Increíble, pero' pooldata/es/paracrawl6.txt
20668:Increíble, pero cierto.

As far as I understand, the output should be a bilingual parallel subset of the pool data that match the threshold or ratio criteria.
Maybe I've misunderstood something.

Order documentation for first use with example

What is this tool (which you have here already; good)
Dependencies
Installation including specifying paths to dependencies
Setup parallel corpora. Example of downloading ParaCrawl data.
Specify my domain. Pick I dunno WMT biomedical or something.
Running the top-level tool.

Then after I've done my first run successfully you can tell me about how the tool is implemented internally with various scripts.

Find scripts relative to self

Rather than requiring user to specify a command line argument with the path to scripts, use __file__ to find itself, then other scripts relative to that.

Language codes

What are the language codes used for? I think they're just for the tokenizer in which case it would be best to say so. Also, ISO 639-1?

Run out of RAM?

Does LadderXMLfile parse lazily or is this going to load the entire XML into RAM?

Domain_Adaptation/P3_DD_Extract.py

Line 51 in cf06a0e

all_items = root.findall("sent")

I'm confused by the logger

Sometimes you print to stdout, sometimes you log. Can we just print to stderr consistently and remove the logger entirely? There's already a standard place to put logs: stderr. It will make configuration easier.

Domain_Adaptation/lib/ScorePoolData.py

Line 142 in cbd4c91

print("Loding the model ======> ")

Domain_Adaptation/lib/ScorePoolData.py

Line 153 in cbd4c91

print("Ladder File IsReady !")

Domain_Adaptation/lib/ScorePoolData.py

Line 149 in cbd4c91

logger.info("Start- Prepare XML")

Aside: it's weird for a for loop to have the same log message every iteration of the loop.

Corpus unnecessarily in RAM

Domain_Adaptation/scripts/ScorePoolData.py

Line 153 in 432916d

 create_xml(get_list_of_sent(full_item_to_score),get_list_of_scores(full_item_to_score,model),get_Soft_scores(full_item_to_score),full_item_score) 

You don't need the whole corpus in RAM. Stream it.
This appears to be done so you can do XML or something. Which is overkill for one extra column of data.

Eliminate hard-coded paths.

We can't assume that we have root on the machine or access to /opt/.

Installation instructions

Installation instructions would be helpful, say in INSTALL.md. Some fragments from README.md such as the config file and explanations of the dependencies and required directory structure could usefully be moved there. See also #13.

Error detection and handling

I don't see any error checking here. The program will just continue.

Domain_Adaptation/lib/FullProcess.py

Line 79 in cbd4c91

 subprocess.call("python3.6 TokenizeDomainSampleData.py -dsd "+str(args.dsd)+" -sl "+str(args.sl)+" -c "+str(args.c), shell=True) 

Config file default doesn't make sense outside Omniscien

The default config file is a bunch of custom paths that only exist at Omniscien and are useless to general users. Why is it optional? Reading sequentially from the beginning, the user is not aware of this problem until near the end.

The config file is really more of system setup with paths where tools are installed. Perhaps the install instructions should say to create such a config file:
https://github.com/paracrawl/Domain_Adaptation/blob/master/INSTALL.md

Spelling

Domain_Adaptation/lib/FullProcess.py

Line 31 in cbd4c91

print('Full Runining ')

Install instructions need work

https://github.com/paracrawl/Domain_Adaptation/blob/master/INSTALL.md doesn't render properly on github.
Do I need to compile Moses or just download it for the tokenizer? Why Moses and not just the Moses tokenizer extract like https://github.com/kpu/preprocess/ or the one Marcin made?
Do I need to compile kenlm? Does it need the python bindings?
Correct the spelling of download.
Perhaps make a paths configuration file part of the installation instructions.

More accessible introduction

When I read the beginning of README.md, if I don't already know what Domain Adaptation is, the introduction isn't super helpful. Who is the audience? What is the software good for? Why would I want it? What can I expect to be able to do or learn if I read the rest of the document? These are questions that should be answered in the first few sentences of the introduction.

Code documentation

The files don't have comments saying what they do.

paracrawl / domain_adaptation Goto Github PK

domain_adaptation's People

Contributors

Stargazers

Watchers

Forkers

domain_adaptation's Issues

Recommend Projects

Recommend Topics

Recommend Org