adolenc / topotext Goto Github PK
View Code? Open in Web Editor NEWText analysis using persistence diagrams
Text analysis using persistence diagrams
Use bottleneck distance to compute distance between two diagrams (in all dimensions).
Also apply clustering to check if persistence diagrams make sense?
Put the data in folder ./data/<folder_name>
Names specified below:
Perhaps by using http://www.netlib.org/voronoi/hull.html
Meaning it takes texts, converts to feature matrix, splits matrices and generates diagrams appropriately.
I am having serious trouble figuring out how .data[0] component for 3d alpha_shapes is computed and what its maximum value could be. This then messes up our division of the interval on 10 equal parts and produces a handful of erroneously large birth/death pairs per each domain.
You can play around with this by changing line 14 in main.py to cx_method = alpha_shapes
, which will then print out matrix of points (PCA'd X) on which it is building alpha_shapes, and all the ''bad'' simplices along with the square root of their data[0] and the estimated maximum R for this domain. Note that the number of such simplices is very small (so we could in the worst case just ignore them). Note also, that changing line 15 to dims = 2
now works correctly for alpha_shapes.
Any help is appreciated.
Reference: http://www.mrzv.org/software/dionysus/python/alphashapes.html#alphashapes
Our method doesn't work even on linearly separable iris classes. But that doesnt necessary mean anything is wrong, as we are comparing the structures of examples of both classes individually. We should create a custom set of points, which would have a clear difference in its structure within classes (e.g. samples from one class would lie in a circle, and samples from the other in a straight line)
Write code to draw plots of both persistence diagrams and bar diagrams.
Since implementations of alpha shapes in higher than 3 dimensions are rare/non-existent implement PCA technique to project features of texts into 2(3) dimensions and then build aplha-shapes on top of that.
Use sklearn, dyonisis...
Instead of defining R as max distance between two points in domain, define it as max distance in filtration.
Preprocessor should accept:
and combines all of the features into matrices X and y.
Change and upload the code for calculating the tf-idf matrix.
It should accept an array of string arrays, where one row represents one text. Obviously rows will be of different lenghts.
It will return an tf-idf matrix (mxn) where m is the number of samples and n is the number of words selected as features.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.