percyliang / brown-cluster Goto Github PK

View Code? Open in Web Editor NEW

425.0 425.0 135.0 59 KB

C++ implementation of the Brown word clustering algorithm.

C++ 83.80% C 11.48% Shell 0.53% Python 2.59% CSS 0.30% Makefile 0.44% HTML 0.86%

brown-cluster's People

Contributors

Stargazers

Watchers

Forkers

gromgull khavnu brendano yovnchine simonsuster shibendu andrewyates xphan dianhai zuiwufenghua fanfannothing liyanghua caglar luongtieumy neufang csdnlzh redsuncmx heshizhu redpony ajaech hanzmyco 0x0all niuox arcodergh meltwaterarchive lixiangnlp lineryang shreyaspalekar senwu ruoshui1126 shaifulcse jlowryduda 5idaidai timwee atreyee-m dancal sriganeshnk zephyryin senarvi shaoliu08 yanyankangkang duhaime lxgxiong luoq xiyuanhou chrisquirk yedeheng apphpp riazi zach20151111 prashiyn sandy4321 to-shimo mannby nymph332088 stevenlol hitwsl zclfly xuhd cttsai zbxzc35 bigodatamining walkerwu qjay612 knowledgefold allanj p2501g2 nagyistge phuysmans saminigod amamidzu whuopm sbhttcha toolkitsz praneethgb kmisztal geledek luckystar1992 alxsoares liushifeng amrahstija jiangnanhugo dapeng2018 oliviershi ajalagam uestcxi yanxiao0201 miradel51 karlstratos prateekkolhar viveksck ztxyzu aiexperts innerface zhongyunuestc amalhtait chenkovsky vseledkin lifengjin howl-anderson

brown-cluster's Issues

Question

Hello, I would like to use your algorithm to categorize job titles. Do you still make updates and maintain the library ?

Bets Regards,
Evangelia

Running The code

Hi,

Can you please guide me how can I pass multiple text files to generate output files on them?

How to choose optimized number of cluster for specific input corpus ?

Is there any limit for the vocab size (#types)?

The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.

Is there any limit for the vocabulary size (#types)?

how Paths2map is used

Hello! I was browsing the code and I saw the
opt_define_bool(paths2map, "paths2map", false, "Take the paths file and generate a map file.");
Is it possible to be used? What is the output? Something like the tree presented in brown algorithm paper?
Thank you very much

Speed up with compiler optimization

In case anyone is clustering large datasets:

in my experiments (40M corpus and NofClusters=1000), turning on compiler optimization with "-O3" yields speed-ups of around 3.

I changed the following lines in my Makefile:

wcluster: $(files)
    g++ -Wall -g -O3 -o wcluster $(files)

%.o: %.cc
    g++ -Wall -g -O3 -o $@ -c $<

basic/prob-utils.cc:8:37: error: ‘M_PI’ was not declared in this scope

I am using Cygwin on windows and trying to run this code. On the first step when running "make" command, getting following error.

basic/prob-utils.cc: In function ‘double rand_gaussian(double, double)’:
basic/prob-utils.cc:8:37: error: ‘M_PI’ was not declared in this scope
double z = sqrt(-2log(x1))cos(2M_PIx2);
^~~~
make: *** [Makefile:13: basic/prob-utils.o] Error 1

Can you guide in this regard?

Broken link to thesis

http://www.cs.berkeley.edu/~pliang/papers/meng-thesis.pdf is borken.

Clustering perplexity measure

Does the package return (or write in the log file) the perplexity or any other goodness of fit measure? If yes, would it be a good idea to run a BayesOpt optimizer to find the best clustering this way? Or is it ill-posed?

Thanks

A library for brown clustering?

I was wondering if it's possible to make a library out of this code in order to be able to include it into other projects?

Is it possible to cluster new documents without relearning everything?

I'm looking for some way to run the clustering algorithm while using previously learned collocs, map, and paths. I tried pointing to the paths file with the --paths flag, but this just overwrote it with a newly learned one.

I don't have time to relearn everything from scratch: it takes days!

Problem compiling on Windows 7

I'm trying to compile on Windows 7 using g++ 4.7.2 and GNU Make 3.8.1. When I do I get the following errors:

C:\Users\ameasure\brown-cluster-master>make
g++ -Wall -g -o wcluster.o -c wcluster.cc
wcluster.cc: In function 'void repcheck()':
wcluster.cc:431:3: error: '__STRING' was not declared in this scope
wcluster.cc:432:3: error: '__STRING' was not declared in this scope
wcluster.cc: In function 'int main(int, char*)':
wcluster.cc:1072:3: error: '__STRING' was not declared in this scope
make: ** [wcluster.o] Error 1

Any idea what's going on?

what happened if length of text is bigger than INT_MAX ?

the length of text is defined int in src, so what happened if length of text is bigger than INT_MAX ?

what are these results?

I'm not sure whether this can be called an issue or the matter of understanding, I ran the clustering on Persian text and after couple of hours I got these results in map output:
بینبریج 00111111-L 5.54361 00111111-R 2.82232 00111111-freq 1
گروهان 00111111-L 5.20714 00111111-R 2.7586 00111111-freq 1
می‌دهده 00111111-L 4.15732 00111111-R 6.05444 00111111-freq 1
...
and I'm not sure what each column means and which one exactly is the cluster group?!