- Go to data directory (https://github.com/zhanymkanov/russian_reviews_dataset/tree/master/data)
- Download raw/cleaned data
or
git clone https://github.com/zhanymkanov/russian_reviews_dataset
cd data
https://github.com/zhanymkanov/reviews_parser
https://github.com/zhanymkanov/reviews_tazalau
- 143k Russian reviews
- 8.7k Kazakh reviews
- 256 undetected language reviews (numbers like "10/10", English or Kazakh in latin)
- Review text might be in
text
and/or inplus
and/or inminus
- At least one of the columns is filled
Number of categories: 17
- computers
- smartphones
- perfumes
- watches
- wearables
- etc.
- Integers from 1 to 5
- russian
- kazakh
- other
text | plus | minus | language | rating | category |
---|---|---|---|---|---|
Парфюм оригинальный. Обожаю их! | russian | 5 | parfumes | ||
Иісі қатты ұнады. | kazakh | 5 | parfumes | ||
Телефон хороший. Все устраивает, покупкой довольна. | russian | 5 | smartphones |
- Stopwords are removed (find them in
reviews_tazalau/code/constants.py
) - Tags are removed (e.g.
<br>
,</br>
,<p>
) - Russian words are lemmatized (via pymystem3)
text
,plus
andminus
columns are concatenated into onecombined_text
column