Git Product home page Git Product logo

domain_adaptation's People

Contributors

dalisola avatar dionwiggins avatar kpu avatar phikoehn avatar zjaume avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

zjaume

domain_adaptation's Issues

Integrate into bitextor

To be clear, the deliverable is that it should be integrated into the pipeline and Omniscien is responsible for this deliverable. A dump of scripts for us to integrate is not enough.

Table of Contents in README.md

The README.md is fairly long. Perhaps it can be separated into different documents (with links between them). A good way to start this would be with a table of contents. See also #11 as this can help with orientation and knowing what to expect from the document.

Log path might as well be stderr

It's not clear to me what value is being added by asking the user to specify a log path when they might as well do 2>/path/to/log/file using stderr?

Start with a worked example

After the introduction, a brief explanation of the fact that there are different tools. The diagram at the beginning of Processes and Tools is helpful. But it is not introduced or explained very much. Immediately afterwards there is thorough documentation of the different tools and all of their command line arguments. This is also useful but would usefully come later once I know what each piece does and have seen a simple example. There are examples but they are interspersed with the detailed documentation. It's too easy to get lost without some more orientation at the beginning.

Running FullProcess

  1. The Documentation says to run FullProcess.sh but that doesn't exist.
  2. I run FullProcess.py and that's command not found. I shouldn't have to be in a particular directory to run software; if I need to give the path to it that's ok but not documented. Also, even if I was in the correct directory, . is not in my path.
  3. Then I go into the lib directory (if it's meant to be run by the user it's probably not a lib...) and try FullProcess.py but it doesn't have a shebang #!/usr/bin/python3 at the top.

Implement Moore-Lewis for selection

Amir writes:

The selection is based on sentence length normalized language model
scoring of the crawl data using a monolingual source in-domain corpus. Well
in my opinion that is not a good model for data selection. We can at least
implement modified Moore-Lewis, which is kind of the baseline method for
data selection.

User-specified tokenizer

The standard in ParaCrawl is that the user can specify a tokenizer rather than it being hard-coded into the packages. This way some user can handle Chinese etc.
I suspect this will be moot once integrated into bitextor since it should be tokenizing for you anyway.

Paths with spaces

Scripts are not robust to paths containing spaces. Please write a test for this.

Could the output be untokenized?

With SentencePiece being widely used in NMT, tokenized output it's more an obstacle than a benefit, I think. Could the output be untokenized or add an option to choose between tokenized and untokenized? I think tokenized pool-data should be used only for LM scoring, and use the untokenized sentences and the scores to produce the output.

The output should not be parallel?

Hi,

I ran the FullProcess with Paracrawl as pool data and some data from news-crawl as domain data for English-Spanish and the output I'm getting is not parallel.
The command was:

python scripts/FullProcess.py -dn news -sl en -tl es -domain news -pool pooldata -out out -working_dir work -threshold 0.5

And the result was:

$ cat out/en/paracrawl6.txt
Kristine - Public profile - Catena Cycling
$ out/es/paracrawl6.txt
Increíble , pero cierto .
$ grep -n 'Kristine - Public' pooldata/en/paracrawl6.txt
41336:Kristine - Public profile - Catena Cycling
$ grep -n 'Increíble, pero' pooldata/es/paracrawl6.txt
20668:Increíble, pero cierto.

As far as I understand, the output should be a bilingual parallel subset of the pool data that match the threshold or ratio criteria.
Maybe I've misunderstood something.

Order documentation for first use with example

  1. What is this tool (which you have here already; good)
  2. Dependencies
  3. Installation including specifying paths to dependencies
  4. Setup parallel corpora. Example of downloading ParaCrawl data.
  5. Specify my domain. Pick I dunno WMT biomedical or something.
  6. Running the top-level tool.

Then after I've done my first run successfully you can tell me about how the tool is implemented internally with various scripts.

Find scripts relative to self

Rather than requiring user to specify a command line argument with the path to scripts, use __file__ to find itself, then other scripts relative to that.

Language codes

What are the language codes used for? I think they're just for the tokenizer in which case it would be best to say so. Also, ISO 639-1?

I'm confused by the logger

Sometimes you print to stdout, sometimes you log. Can we just print to stderr consistently and remove the logger entirely? There's already a standard place to put logs: stderr. It will make configuration easier.

print("Loding the model ======> ")

print("Ladder File IsReady !")

logger.info("Start- Prepare XML")

Aside: it's weird for a for loop to have the same log message every iteration of the loop.

Corpus unnecessarily in RAM

create_xml(get_list_of_sent(full_item_to_score),get_list_of_scores(full_item_to_score,model),get_Soft_scores(full_item_to_score),full_item_score)

You don't need the whole corpus in RAM. Stream it.
This appears to be done so you can do XML or something. Which is overkill for one extra column of data.

Installation instructions

Installation instructions would be helpful, say in INSTALL.md. Some fragments from README.md such as the config file and explanations of the dependencies and required directory structure could usefully be moved there. See also #13.

Config file default doesn't make sense outside Omniscien

The default config file is a bunch of custom paths that only exist at Omniscien and are useless to general users. Why is it optional? Reading sequentially from the beginning, the user is not aware of this problem until near the end.

The config file is really more of system setup with paths where tools are installed. Perhaps the install instructions should say to create such a config file:
https://github.com/paracrawl/Domain_Adaptation/blob/master/INSTALL.md

More accessible introduction

When I read the beginning of README.md, if I don't already know what Domain Adaptation is, the introduction isn't super helpful. Who is the audience? What is the software good for? Why would I want it? What can I expect to be able to do or learn if I read the rest of the document? These are questions that should be answered in the first few sentences of the introduction.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.