bhargavvader / personal Goto Github PK
View Code? Open in Web Editor NEWContains Jupyter Notebooks of stuff I am working on.
Contains Jupyter Notebooks of stuff I am working on.
Solution, download the English dictionary:
python -m spacy.en.download all
Was made by someone in the room, I posted it here so it don't gets lost.
https://pypi.python.org/pypi/Unidecode
So instead of doc = nlp(clean(text))
doc=nlp(unidecode(text))
can be used.
This should preserve the original text as close as possible.
My script couldn't find matplotlib till I figured out that it was missing
#!/usr/bin/python
Hi bhargav
Its was informative notebook about topic modeling and spacy.
I have doubt how to do trigram and trigram topic modeling
texts = metadata['cleandata']
bigram = gensim.models.Phrases(texts)
example this gives lda output of - India , car , license , india , visit , visa
I want output as - India car license , Visit visa , indian hotel
This code gives Bigram using tfidf
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic:", (topic_idx))
print(" ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]]))
def tfidf_vectorizer(documents,total_features):
# TFIDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=total_features, stop_words='english',ngram_range=[2,2])
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
return tfidf_vectorizer,tfidf,tfidf_feature_names
def count_vectorizer(documents,total_features):
# Count Vectorizer
tf_vectorizer = CountVectorizer(max_features=total_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()
return tf_vectorizer,tf,tf_feature_names
My question is how to do in gensim trigram and bigram ?
Thanks in advance
Topic Modelling with scikit-learn
Let us now use NMF and LDA which is available in sklearn to see how these topics work.
In [20]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import NMF, LatentDirichletAllocation
In [21]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
In [22]:
documents
Out[22]:
[u"Well i'm not sure about the story nad it did seem biased. What\nI disagr
Following up on pydata Amsterdam, where we noticed that stopwords like "the" were not removed from the corpus. This seems to happen in the notebook text_analysis_tutorial_unrun as well as ..._run.
Also, the word '-PRON-' appears in the clusters, but it was unclear where it's coming from.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.