Light

dongyuanxin / news-emotion Goto Github PK

View Code? Open in Web Editor NEW

329.0 20.0 129.0 3.71 MB

📉 金融文本情感分析模型

Home Page: https://yuanxin.me/

Python 100.00%

python nlp finance comments machine-learning

news-emotion's Introduction

Hi there 👋

🔭 I’m currently working for Bytedance
🌱 I’m currently learning System & Algorithm Design
📫 How to reach me: yuanxin.me
💬 Ask me about Serverless/Cloud/Frontend

news-emotion's People

Contributors

Stargazers

Watchers

Forkers

lyx0723 xiaoqingwang zapqo1988 qia1996 boragocode waveyan tracyzmq xieshifeng berryhn shortcut-queen lishulincug juzenn iammonster2333 zhang-byo edward1997 tb21434718 zouchl bobluff chenzhongde tgpgithub huangfude for-competition zhibzeng beyondgcy lavenderjia xlwu2270 jiadebin awesome-archive allenz-me omiga1219 lidutech hanyinong hrdg lindsaypeng xuxin1983 chenziyia cr7wo tujie-jiangye xiaoqiyu davidlanz justin18chan leewin1014 yilong2001 qsdnice jadeluo moonlione charlottew1 phelanwang dzhgb melansediao lining200 jinhuli rxt2012kc jiweizhi ccyccxcl fatterzhang terencesun bitterqiu fengshuu iaminblacklist jerryldh xiongyaokun archive-j gechao wuziyuansteven jiaozl pengyuange chenny0808 scievan fengyxp linkyee xiasexiaoting chenpe32cp aigeorgeli jinlongyangda yuze-wu wutonghua strawberring corwien ustcwuxiaoming piaoxue855 sxty4170160 husngyk zhangyishansarah pjx1993 yinsmart neverwelljd yw1991 cqulihui mrgibson-code yushengwu1122 rh999 zhaomingheyuhan wendyshiswt haoyu-li-1997 wtwong316 gongcq zijiedian amazingjj1 nocturne2333

news-emotion's Issues

TF-IDF时间复杂度如何降低

求助：希望提供关于TF-IDF的时间复杂度降低的解决办法。

关于TF-IDF模型的实现在这里：news-emotion/operate_data.py的words2vec方法中。

可以清晰的看到，实现的代码中和其他方法相比，多了一个循环，时间复杂度变成原来的N倍。
由于目前没有相应的集群供我们使用，并且服务器跑1000个训练样本也很慢，所以暂时先取消tf-idf这中词向量的尝试，之后会再重新补上。

关于训练样本的说明

不少朋友Email我询问训练样本的事情，这里统一说明一下。

来源：wisenews网站。
分类：属于港股的新闻，数据库目前有80w＋的新闻文本。
训练样本：从以上的80w+的新闻文本中挑选出的最新的1000条新闻，人工打标后交给模型训练。

由于项目需要，所以在公开的仓库没有上传打标的文本，之后会考虑上传训练用的全部文本，供同好使用。

关于准确率的疑问

在不过拟合的前提下，相信样本的打标的准确率是大家最期待的结果。那么，这里公布一下的1000个打标的数据模型，在留一验证后的准确率。

二分类

只是将新闻打标分为正极和负极，各路论文的常见分类。

三分类

将新闻文本分为正极、负极和中性三个类别。基本上，所有论文都尽力规避中性分类，但是，在现实中确是存在的问题。当然，在打标上，中性分类的标注也需要斟酌。就目前结果来看，三分类的效果可以接受。

一些说明

由于一些问题，这里先取消了tf-idf和svm及相关模型的组合，具体原因请移步bug Issues查看。（上面结果中，会有一行一列均为0）

requirements.txt文档希望能补上

做的不错，不过requirements.txt文档希望能补上

训练数据集能补上吗

现在可以把训练数据集补上吗？

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.