nlp100 / nlp100.github.io Goto Github PK

NLP 100 Exercises

License: MIT License

HTML 21.19% Ruby 0.72% JavaScript 51.95% Python 0.93% Shell 1.99% SCSS 23.22%

nlp100.github.io's Issues

Environment setup for macOS

Hi, I noticed that there is still no instruction for deploying this project in macOS. I have completed the environment setup (works fine in macOS 10.15.4), and the vital steps are as following:

Environment Setup (macOS)

If you don't install the basic developer toolkit, you need to run:

xcode-select --install

(For macOS version < 10.15) The Ruby 2.6 has integrated into macOS 10.15, so you may have to manually install the latest version of ruby if you are using the previous version of macOS. Kindly run the following command to install it through Homebrew:

(Skip this line if you have installed Homebrew)
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

brew install ruby

If you are using macOS built-in Ruby, run the following commands to install using local mode: (It may be risky to use global mode to install due to macOS SIP and potential permission problems)

gem install --user-install bundler jekyll
bundle install

Last but not least, remember to configure your shell environment. Please note that you may change PATH= and .zshrc according to your environment.

(For built-in Ruby and zsh)
echo 'export PATH="$HOME/.gem/ruby/X.X.0/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

The rest steps are the same as the Ubuntu/Debian part.

I hope that it would be helpful to complement the document.

Thanks.

English Chapter 4 suggests using a POS tagger, but no information on what

The English version of Chapter 4 suggests using a POS tagger to go solve the problems provided, but leaves the reader with no information on what. On the Japanese end Mecab is suggested, but that probably won't apply here. I guess NLTK might be good/universal enough?

第4章「基本形」と「原形」の表記揺れ

まずはじめに、このような教材を公開してくださいましてありがとうございます。楽しく勉強させてもらっています。

些細なことですが、「30. 形態素解析結果の読み込み」では

ただし，各形態素は表層形（surface），基本形（base），品詞（pos），品詞細分類1（pos1）をキーとするマッピング型に格納し，

とありますが、「32. 動詞の原形」では

動詞の原形をすべて抽出せよ．

となっており、base に対して「基本形」と「原形」という異なる訳語が当てられているのが気になりました。
もしこの2つが同じものというのであれば、いずれかに統一した方がよいかと思います（違うものだったらすみません 🙇）。

ちなみに、第5章では「基本形」で統一されています。

「69. t-SNEによる可視化」の問題文の内容について

第7章の「69. t-SNEによる可視化」では、

国名に関する単語ベクトルのベクトル空間をt-SNEで可視化せよ．

と書かれていると思いますが、この表現は不正確で、

国名に関する単語ベクトルをt-SNEで可視化せよ．

か、

ベクトル空間上の国名に関する単語ベクトルをt-SNEで可視化せよ.

が適切かと思われますがいかがでしょうか？

第三章問題23でのレベルの定義について

非常に細かいことで恐縮なのですが，

問題23では

記事中に含まれるセクション名とそのレベル（例えば”== セクション名 ==”なら1）を表示せよ．

とあるように”== セクション名 ==”をレベル1として扱われていますが，MediaWikiのドキュメント(https://www.mediawiki.org/wiki/Help:Formatting)では，

== Level 2 ==

とされており，表記が統一されていないのが気になりました．もしよければ，MediaWikiのドキュメント準拠の == Level 2 == という表記に統一して頂けないでしょうか．

「41. 係り受け解析結果の読み込み」以降でのCaboChaの解析ミスについて

「吾輩はここで始めて人間というものを見た」をCaboChaで係り受け解析すると、「始めて -> 見た」を抽出できないため、プログラムとしては正しくても、結果としては変なものになってしまっています。たとえば「48. 名詞から根へのパスの抽出」は、本来の係り受けから考えれば

吾輩は -> 見た
ここで -> 始めて -> 見た
人間という -> ものを -> 見た
ものを -> 見た

となるはずなのです。ginza -f cabochaやKNPとの比較も、よければ、ご一考ください。

「51. 特徴量抽出」のファイル保存形式について

「51. 特徴量抽出」のファイル保存形式について、問題文に以下の文言がありますが、ここで作成するファイルの内容は各自が設計する特徴量に依存するため、この文言は不適切ではないでしょうか？

ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのスペース区切り形式とせよ．

「21. カテゴリ名を含む行を抽出」について

第3章にて

各行には記事名が”title”キーに，記事本文が”text”キーの辞書オブジェクトに格納され，そのオブジェクトがJSON形式で書き出される

との文面がありますが，https://nlp100.github.io/data/jawiki-country.json.gz からダウンロードできるjsonファイル中には記事中でカテゴリ名を宣言している行がなく，代わりに"category"というキーが存在しているように思われるのですが，こちらは問題の想定内でしょうか．

第1章問題04. 元素記号における処理条件の追加

問題文の条件で処理をすると、MightからMiが取り出されます。
本来はMgを意図していると思うので
「12番めの単語については1番目と3番めの文字を取り出し、」
という条件を追加すべきではないでしょうか。

「71.単層ニューラルネットワークによる予測」の予測したyのドメインについて。

問題71では、学習データで以下の計算をせよと書いてあり、

このあとに、

なお，は未学習の行列Wで事例x1を分類したときに，各カテゴリに属する確率を表すベクトルである．同様には，学習データの事例x1,x2,x3,x4について，各カテゴリに属する確率を行列として表現している．

と書かれています。
しかし、softmaxの返り値は[0,1]上の値になっているため、とと書くのが正しいと思います。

English Chapter 4, Q.38 is hard to understand

This is the disambiguation:

($x$-axis: frequency of occurrences of words in x-axiz; $y$-axis: the number of distinct words occurring $x$ times)

This is extremely hard to comprehend.

I've looked at the Japanese version, and the intent seems to be:

Where the x-axis is a scalar range representing a frequency, ranging from 1 to the largest frequency of a given word in the entire corpus, and the y-axis is the count of unique words that fall into the count of the x value.

nlp100 / nlp100.github.io Goto Github PK

nlp100.github.io's Issues

Environment setup for macOS

Environment Setup (macOS)

English Chapter 4 suggests using a POS tagger, but no information on what

第4章「基本形」と「原形」の表記揺れ

「69. t-SNEによる可視化」の問題文の内容について

第三章問題23でのレベルの定義について

「41. 係り受け解析結果の読み込み」以降でのCaboChaの解析ミスについて

「51. 特徴量抽出」のファイル保存形式について

「21. カテゴリ名を含む行を抽出」について

第1章問題04. 元素記号における処理条件の追加

「71.単層ニューラルネットワークによる予測」の予測したyのドメインについて。

English Chapter 4, Q.38 is hard to understand

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent