This notebook started off from Kaggle's NIPS 2015 Papers dataset, to enable and showcase analytical exploration on the NIPS data. I then then parsed the web for the number of citation recieved by each paper published in NIPS 2015, using scholar that extract citation information from Google Scholar. Combining the citaiton information and the topic modeling techniques, I could identify the most influencial papers/topics from NIPS 2015.
NIPS is one of the top machine learning conferences in the world. According to wikipedia some main topics covered at NIPS are: 1. Machine learning, 2. Statistics, 3. Artificial intelligence, 4. Computational neuroscience, etc. However, the topics are within the same domain which makes it more challenging to distinguish them as seperate topics (overlapping topics?). Here in this Kernel I will try to extract some topics using Latent Dirichlet allocation LDA. This repository features an end-to-end natural language processing pipeline, starting with raw data (full text of papers) and running through preprocessing, modeling and clustering papers based on their topics. I wrap-up by visualizing clusters of papers and their relative similarity using T-sne. We'll touch on the following points:
- Topic modeling with LDA
- Visualizing topic models with pyLDAvis
- Visualizing LDA results with t-SNE and bokeh
- Citation analysis
Note: Two major visualizations of this notebook are interactive graphs, which do not get loaded on the Github. To see them, go to this link: Topic Modeling
Network analysis: Scientific collaboration Network It's known that different scientists have different styles of working, some enjoy mentoring many pupils while others prefer working with fewer. Moreover, even for a given scientist, their pattern of interaction within the scientific community might change during the course of their career. In this kernel, I try to see if we can classify authors at NIPS'15 based on their collaboration, number and type of publication. Here is the outline:
- EDA
- Network measures (Networkx)
- KMeans clustering (scikit-learn)
- KMeans visualization using PCA for dimensionality reduction
- Network visualization (plotly and Networkx)
Note: Two major visualizations of this notebook are interactive graphs, which do not get loaded on the Github. To see them, go to this link: Network analysis