juliatext / textanalysis.jl Goto Github PK
View Code? Open in Web Editor NEWJulia package for text analysis
License: Other
Julia package for text analysis
License: Other
From a fresh installation of julia on Ubuntu 14.04.5 LTS and TextAnalysis package, I'm receiving the following error with Pkg.add("TextAnalysis")
$ sudo apt-get install julia
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" to list help topics
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.2.1 (2014-02-11 06:30 UTC)
_/ |\__'_|_|_|\__'_| |
|__/ | x86_64-linux-gnu
julia> Pkg.add("TextAnalysis")
INFO: Initializing package repository /home/ltan/.julia/v0.2
INFO: Cloning METADATA from git://github.com/JuliaLang/METADATA.jl
INFO: Cloning cache of BinDeps from git://github.com/JuliaLang/BinDeps.jl.git
INFO: Cloning cache of Blocks from git://github.com/JuliaParallel/Blocks.jl.git
INFO: Cloning cache of Compat from git://github.com/JuliaLang/Compat.jl.git
INFO: Cloning cache of DataArrays from git://github.com/JuliaStats/DataArrays.jl.git
INFO: Cloning cache of DataFrames from git://github.com/JuliaStats/DataFrames.jl.git
INFO: Updating cache of DataFrames...
INFO: Cloning cache of GZip from git://github.com/JuliaIO/GZip.jl.git
INFO: Cloning cache of Languages from git://github.com/johnmyleswhite/Languages.jl.git
INFO: Cloning cache of SHA from git://github.com/staticfloat/SHA.jl.git
INFO: Cloning cache of SortingAlgorithms from git://github.com/JuliaLang/SortingAlgorithms.jl.git
INFO: Cloning cache of StatsBase from git://github.com/JuliaStats/StatsBase.jl.git
INFO: Cloning cache of TextAnalysis from git://github.com/johnmyleswhite/TextAnalysis.jl.git
INFO: Cloning cache of URIParser from git://github.com/JuliaWeb/URIParser.jl.git
ERROR: Missing package versions (possible metadata misconfiguration): DataFrames v(nothing,v"0.4.3") [84523937447b336dfcbf1747b123e9d1da1308c6[1:10]]
in error at error.jl:21
in resolve at pkg/entry.jl:350
in resolve at pkg/entry.jl:316
in edit at pkg/entry.jl:24
in add at pkg/entry.jl:44
in add at pkg/entry.jl:48
in anonymous at pkg/dir.jl:28
in cd at file.jl:22
in cd at pkg/dir.jl:28
in add at pkg.jl:19
This issue is being filed by a script, but if you reply, I will see it.
PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their test (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3).
The results of this script are used to generate a package listing enhanced with testing results.
The status of this package, TextAnalysis, on...
'No tests, but package loads.' can be due to their being no tests (you should write some if you can!) but can also be due to PackageEvaluator not being able to find your tests. Consider adding a test/runtests.jl
file.
'Package doesn't load.' is the worst-case scenario. Sometimes this arises because your package doesn't have BinDeps support, or needs something that can't be installed with BinDeps. If this is the case for your package, please file an issue and an exception can be made so your package will not be tested.
This automatically filed issue is a one-off message. Starting soon, issues will only be filed when the testing status of your package changes in a negative direction (gets worse). If you'd like to opt-out of these status-change messages, reply to this message.
The function remove_corrupt_utf8()
does not work under Julia v0.4.6.
The problem is the line zeros(Char, endof(s)+1)
where it complains that
zero is not defined for type Char. When using UInt8 instead I could make it
run without error, but please check if this does what it is supposed to do.
function remove_corrupt_utf8(s::AbstractString)
r = zeros(UInt8, endof(s)+1)
i = 0
for chr in s
i += 1
r[i] = (chr != 0xfffd) ? chr : ' '
end
return utf8(r)
end
Note that on the return statement I got rid of the CharString()
too.
If this is ok I can make another pull request.
Cheers,
Andre
Julia should have a pluggable stemmer interface supporting pos tagging or at least lemmtizer support like "Wordnet"
Native Julia porter2 stemmer is already available at: https://github.com/mguzmann/CorpusTools/blob/master/src/PortStemmer.jl
If I create a text file which consists of the string "1 2" (not including the quotes, my lexicon always contains the "empty token" (""). Even if I try to remove the word "" it is still there. Is this the intended behaviour? It causes problems for me because in my LDA implementation I get the word "" in my topics.
Example:
$ cat docs/Sample1.txt
1 2
$ cat text_analysis_etoken.jl
using TextAnalysis
crps = DirectoryCorpus("docs")
standardize!(crps, StringDocument)
remove_words!(crps, [""])
update_lexicon!(crps)
println("Dictionary is: " * string(lexicon(crps)))
println("Corpus contains: " * string(length(crps)) * " documents")
$ julia text_analysis_etoken.jl
Corpus's lexicon contains 3 tokens
Corpus's index contains 3 tokens
Dictionary is: ["1"=>1,"2"=>1,""=>2]
Corpus contains: 1 documents
Best Regards
-Leif
From travis-ci:
Info: Calculated hash 5dcb031eccf01bb0b2d074281140679683f73603a54caa79941a1df1c8a6d70d for file /home/travis/.julia/v0.6/TextAnalysis/deps/usr/downloads/sentiment.tar.gz
============================[ ERROR: TextAnalysis ]=============================
LoadError: Hash Mismatch!
Expected sha256: f237378f3f866c7e697ed893b4208878a6c5dd111eddcebc84ac55dab3885004
Calculated sha256: 5dcb031eccf01bb0b2d074281140679683f73603a54caa79941a1df1c8a6d70d
while loading /home/travis/.julia/v0.6/TextAnalysis/deps/build.jl, in expression starting on line 44
================================================================================
We need some kind of thing where you input a dictionary or list and have words get replaced. This is important for standardizing synonyms, dates, and contractions. I think code like this make it easier to tokenize stuff especially since you eliminate apostrophes.
In python had code that replace contractions using regex.
'''
replacement_patterns = [
(r'(?i)won't', 'will not'),
(r'((?i)can't|(?i)can not)', 'cannot'),
(r'(?i)i'm', 'i am'),
(r'(?i)ain't', 'is not'),
(r'(\w+)'ll', '\g<1> will'),
(r'(\w+)n't', '\g<1> not'),
(r'(\w+)'ve', '\g<1> have'),
(r'(\w+t)'s', '\g<1> is'),
(r'(\w+)'re', '\g<1> are'),
(r'(\w+)'d', '\g<1> would'),
(r''cause', 'because'),]
'''
Not sure if it is possible to do this in julia
Is there a reason why documents are represented as rows in the dtm?
--> dtm is a (nDocs x nFeatures) sized matrix.
having docs as columns would computationally be more efficient, right?
PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their tests (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3). The results of this script are used to generate a package listing enhanced with testing results.
Tests fail, but package loads.
Package doesn't load.
Tests fail, but package loads.
means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using
worked.
Package doesn't load.
means that PackageEvaluator did not find tests for your package. Additionally, trying to load your package with using
failed.
This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.
Test log:
INFO: Installing Languages v0.0.1
INFO: Installing TextAnalysis v0.0.1
INFO: Package database updated
signal (11): Segmentation fault
??? at ???:1982132950
??? at ???:1982132515
??? at ???:1982135282
??? at ???:1982138981
??? at ???:1982054748
??? at ???:1981703818
??? at ???:1981705293
??? at ???:1981708965
??? at ???:1981710070
??? at ???:1981712156
??? at ???:1981793071
typeinf at ./inference.jl:1240
??? at ???:1947850708
??? at ???:1981766906
abstract_call_gf at ./inference.jl:724
..truncated..
??? at ???:1981766906
??? at ???:1982047992
??? at ???:1982044265
??? at ???:1982110975
??? at ???:1981799125
reload_path at loading.jl:152
_require at loading.jl:67
??? at ???:1981767045
require at loading.jl:51
??? at ???:1954597347
??? at ???:1981767045
??? at ???:1982106925
??? at ???:1982111579
??? at ???:1982113648
??? at ???:1982114080
include at ./boot.jl:245
??? at ???:1981766906
include_from_node1 at loading.jl:128
??? at ???:1981767045
process_options at ./client.jl:285
_start at ./client.jl:354
??? at ???:1949235993
??? at ???:1981766906
??? at ???:4200466
??? at ???:1982069575
??? at ???:4199453
??? at ???:1962160128
??? at ???:4199507
??? at ???:0
timeout: the monitored command dumped core
INFO: Package database updated
The URL of this package does not match that stored in METADATA.jl.
cc: @aviks
The stem!
method currently is a stub. It would be very useful to have a stemmer integrated.
What is the direction to be taken to add a stemmer? Would wrapping the snowball stemmer from http://snowball.tartarus.org/ be a good approach? What are the other good alternatives?
I would argue that the following is unintuitive, and would like for broadcasted functions over a corpus to return an array with the result of the function applied to each document in the corpus. If this is deemed a good idea, I would be willing to submit a PR if desired.
julia> length(crps)
12
julia> length(crps[1])
160427
First, thanks for producing a Julia implementation of LDA.
The implementation current returns P(w|z) - the distribution over words w for the set of topic z. However, it currently does not return the other distribution P(z|d) - the distribution over topics z for each document d.
It would be great if this could be included.
Thanks.
Because remove_words! uses regex matching even for string input, it fails on actually-present terms if those terms are larger than the maximum pattern size accepted by PCRE. Actually-present terms also fail if they contain regex-like punctuation. This produces an error message that doesn't specify the failed pattern, and furthermore aborts remove_words! entirely.
The same problem occurs in remove_sparse_terms! and remove_frequent_terms!, since these also file down to a call to remove_pattern.
Would it be possible to force only string-literal substitution in the case where an array of type String is passed (and only use regex if the items passed are actually typed as regular expressions)?
Hi all ! Thanks for this nice package :)
The preprocessing function prepare!(::TextAnalysis.Corpus, ::UInt32)
does not work (on my machine) because of the stem!
function.
I get the following error
ERROR: error compiling #prepare!#10: error compiling stem!: error compiling release: could not load library "/Users/EmileMathieu/.julia/v0.6/TextAnalysis/src/../deps/usr/lib/libstemmer.dylib"
dlopen(/Users/EmileMathieu/.julia/v0.6/TextAnalysis/src/../deps/usr/lib/libstemmer.dylib, 1): image not found
Stacktrace:
[1] prepare!(::TextAnalysis.Corpus, ::UInt32) at /Users/EmileMathieu/.julia/v0.6/TextAnalysis/src/preprocessing.jl:268
Best,
Emile
The documentation does not reflect the deprecations of remove_xxx!()
functions. These functions also use a nonstandard deprecation mechanism which makes it hard for users to understand how to transition to the new API. Consider using @deprecate
.
The first time I executed it I got the following error:
julia> TextAnalysis.prepare!(corpus, prepare_flags)
ERROR: MethodError: no method matching zero(::Type{Char})
Closest candidates are:
zero(::Type{Base.LibGit2.GitHash}) at libgit2/oid.jl:106
zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:82
zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:124
...
Stacktrace:
[1] remove_corrupt_utf8(::String) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:46
[2] remove_corrupt_utf8!(::TextAnalysis.StringDocument) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:58
[3] remove_corrupt_utf8!(::TextAnalysis.Corpus) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:83
[4] #prepare!#10(::Set{AbstractString}, ::Set{AbstractString}, ::Function, ::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:271
[5] prepare!(::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:268
I'm using Julia 0.6 and I don't have any definition of zero(:Type{Char})
. I'm not sure if Char
used to be a Number
and it got zero
from there or if I'm missing to import something.
In any case, if I define it
julia> import Base.zero
julia> zero(Char) = ' '
zero (generic function with 17 methods)
I then get
julia> TextAnalysis.prepare!(corpus, prepare_flags)
ERROR: UndefVarError: CharString not defined
Stacktrace:
[1] remove_corrupt_utf8(::String) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:52
[2] remove_corrupt_utf8!(::TextAnalysis.StringDocument) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:58
[3] remove_corrupt_utf8!(::TextAnalysis.Corpus) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:83
[4] #prepare!#10(::Set{AbstractString}, ::Set{AbstractString}, ::Function, ::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:271
[5] prepare!(::TextAnalysis.Corpus, ::UInt32) at /Users/mariosangiorgio/.julia/v0.6/TextAnalysis/src/preprocessing.jl:268
and this suggest that it has been deprecated since Julia 0.3
Am I doing something wrong or is remove_corrupt_utf8
broken on recent versions of Julia?
ERROR: /
has no method matching /(::Int64, ::Array{Int64,2})
in tf_idf at /home/steven/.julia/v0.3/TextAnalysis/src/tf_idf.jl:52
in tf_idf at /home/steven/.julia/v0.3/TextAnalysis/src/tf_idf.jl:74
here is a correct version:
function tf_idf{T <: Real}(dtm::Matrix{T})
n, p = size(dtm)
# TF tells us what proportion of a document is defined by a term
tf = Array(Float64, n, p)
# IDF tells us how rare a term is in the corpus
idf=Array(Float64,p)
for i in 1:n
words_in_document = 0
for j in 1:p
words_in_document += dtm[i, j]
idf[j]=log(n/sum(vec(dtm[:,j])))
end
tf[i, :] = dtm[i, :] / words_in_document
end
# TF-IDF is the product of TF and IDF
# We store it in the TF matrix to save space
for i in 1:n
for j in 1:p
tf[i, j] = tf[i, j] * idf[j]
end
end
return tf
end
function tf_idf{T <: Real}(dtm::SparseMatrixCSC{T})
n, p = size(dtm)
# TF tells us what proportion of a document is defined by a term
tf = Array(Float64, n, p)
# IDF tells us how rare a term is in the corpus
idf=Array(Float64,p)
for i in 1:n
words_in_document = 0
for j in 1:p
words_in_document += dtm[i, j]
idf[j]=log(n/sum(vec(dtm[:,j])))
end
tf[i, :] = dtm[i, :] / words_in_document
end
# TF-IDF is the product of TF and IDF
# We store it in the TF matrix to save space
for i in 1:n
for j in 1:p
tf[i, j] = tf[i, j] * idf[j]
end
end
return tf
end
function tf_idf!{T <: Real}(dtm::Matrix{T})
error("not yet implemented")
end
function tf_idf!{T <: Real}(dtm::SparseMatrixCSC{T})
error("not yet implemented")
end
function tf_idf(dtm::DocumentTermMatrix)
tf_idf(dtm.dtm)
end
function tf_idf!(dtm::DocumentTermMatrix)
tf_idf!(dtm.dtm)
end
I've written a Julia package containing the Aho-Corasick algorithm for fast multi-string searching:
https://github.com/gilesc/AhoCorasick.jl
I haven't yet submitted it as an independent package to Julia's METADATA, though. I was thinking that it might be better suited as a submodule in this TextAnalysis package than as its own package. If you're amenable, I'd be happy to write up a PR.
I can't using prepare!
and another functions are deprecated, could someone update the documentation?
This would be great, the R packages are slow and haven't been updated for years...
After adding TextAnalysis package in Julia with
julia> Pkg.add("Distributions")
I am getting following error while using the package:
using TextAnalysis
ERROR: Stats not found
in require at loading.jl:39
in include at boot.jl:238
in include_from_node1 at loading.jl:114
in reload_path at loading.jl:140
in _require at loading.jl:58
in require at loading.jl:46
in include at boot.jl:238
in include_from_node1 at loading.jl:114
in reload_path at loading.jl:140
in _require at loading.jl:58
in require at loading.jl:43
at /Users/user/.julia/DataFrames/src/DataFrames.jl:5
at /Users/user/.julia/TextAnalysis/src/TextAnalysis.jl:1
(On Mac OS X 10.9.1)
I'm using the LDA module and wondering if it's possible to generate top words within each topic.
After the update I have the follow error:
ERROR: LoadError: MethodError: no method matching start(::TextAnalysis.StringDocument)
Closest candidates are:
start(!Matched::SimpleVector) at essentials.jl:170
start(!Matched::Base.MethodList) at reflection.jl:258
start(!Matched::IntSet) at intset.jl:184
...
in _append!(::Array{String,1}, ::Base.HasLength, ::TextAnalysis.StringDocument) at ./collections.jl:26
in readData(::String) at /home/pmargreff/github/am/naive_bayes/naive_bayes.jl:35
in include_from_node1(::String) at ./loading.jl:488
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
This seems to be strange error. I looked up preprocessing.jl file but I could not find any reference to zero
function. Instead, the code clearly uses zeros
. Is it some kind of compatibility/installation issue?
https://github.com/JuliaText/TextAnalysis.jl/blob/v0.2.1/src/preprocessing.jl#L46
How to reproduce:
julia> sd = StringDocument("With the market providing strong extensions over the last several months, we had to review our shorter-term targets for the bull market which began in 2009. During this past week, almost every bounce we saw was corrective in nature, which mainly kept me viewing the market as being in a weak posture, signaling that we will likely test the 2796 level on the S&P 500(^GSPC). While it seems I may have wrongly given the bull market the benefit of the doubt last weekend, as the pullback we expected has now broken below the 2800 region support we cited last weekend, this break of support makes the much stronger immediate bullish expectation much less likely.")
A TextAnalysis.StringDocument
julia> remove_corrupt_utf8!(sd)
ERROR: MethodError: no method matching zero(::Type{Char})
Closest candidates are:
zero(::Type{Base.LibGit2.GitHash}) at libgit2/oid.jl:106
zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:82
zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:124
...
Stacktrace:
[1] remove_corrupt_utf8(::String) at /home/ec2-user/.julia/v0.6/TextAnalysis/src/preprocessing.jl:46
[2] remove_corrupt_utf8!(::TextAnalysis.StringDocument) at /home/ec2-user/.julia/v0.6/TextAnalysis/src/preprocessing.jl:58
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
It would be nice to have a sentence tokenizer in this package.
Hi I was trying to use the lsa function but it gives an error because svd doesn't seem to support sparse matrices
corpus = Corpus(docs)
update_lexicon!(corpus)
dtmat = DocumentTermMatrix(corpus)
lda(dtmat)
ERROR: svdfact
has no method matching svdfact(::SparseMatrixCSC{Float64,Int64})
this code (on julia v0.3 on Ubuntu 14.04):
using TextAnalysis
sd=StringDocument("dit is een test of dit, of dat kan werken met Nederlandse documenten.")
sd2=StringDocument("mooie badkamer met bad, douche en WC")
cp=Corpus({sd,sd2})
remove_punctuation!(cp)
remove_case!(cp)
remove_numbers!(cp)
remove_words!(cp,["de","het","een","en","of","geen","niet"])
update_lexicon!(cp)
dm=DocumentTermMatrix(cp)
gives the following error:
ERROR: getindex
has no method matching getindex(::Corpus, ::Int64)
in DocumentTermMatrix at /home/steven/.julia/v0.3/TextAnalysis/src/dtm.jl:31
in include at ./boot.jl:245
in include_from_node1 at loading.jl:128
in process_options at ./client.jl:285
in _start at ./client.jl:354
in _start_3B_1716 at /usr/bin/../lib/x86_64-linux-gnu/julia/sys.so
while loading /media/sf_Documents/julia_workspace/VMRecommender/src/testTextAnalysis.jl, in expression starting on line 17
Is there a reason why remove_patterns!
is not exported?
Hi,
I am getting this error when attempting to create a DocumentTermMatrix from a corpus. This can be replicated with the snippet below.
docs = {}
push!(docs, StringDocument("abc def xyz"))
push!(docs, StringDocument("xyz 123"))
crps = Corpus(docs)
DocumentTermMatrix(crps)
I'm using the latest 0.3v, is there anything i'm missing here?
I am trying the functions like remove_punctuation!()
,remove_numbers!(sd)
remove_pronouns!(sd)
,stem!(sd)
etc, on version 0.3, which is generating the following error:
julia> stem!(sd)
ERROR: error compiling Stemmer: error compiling Stemmer: could not load module /home/abhijith/.julia/TextAnalysis/deps/usr/lib/libstemmer.so: /home/abhijith/.julia/TextAnalysis/deps/usr/lib/libstemmer.so: cannot open shared object file: No such file or directory
in stemmer_for_document at /home/abhijith/.julia/TextAnalysis.jl/src/stemmer.jl:76
in stem! at /home/abhijith/.julia/TextAnalysis.jl/src/stemmer.jl:80
I am still a bit new to Julia so perhaps I am missing something basic here. When I try to create a Corpus following the example in the documentation (after correcting for the change in syntax) using:
Corpus([StringDocument("Document 1"), StringDocument("Document 2")])
I get the error:
MethodError: Cannot convert
an object of type Array{TextAnalysis.StringDocument,1} to an object of type TextAnalysis.Corpus
I am using Julia 0.5 and the latest tag version of TextAnalysis
I am currently using the version on http://pkg.julialang.org which is at commit f022c37.
There are several new features in the current (prepare and implementation of remove_words without regexes) commit that I, and certainly others, would like to use.
The ngram document constructor for the dictionary of n-grams does not work correctly in Julia 0.5. This appears to be due to the change in Dictionary creation syntax not anything wrong with the classes themselves.
PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their tests (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3). The results of this script are used to generate a package listing enhanced with testing results.
Tests fail, but package loads.
Package doesn't load.
Tests fail, but package loads.
means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using
worked.
Package doesn't load.
means that PackageEvaluator did not find tests for your package. Additionally, trying to load your package with using
failed.
This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.
Test log:
INFO: Installing ArrayViews v0.4.6
INFO: Installing DataArrays v0.1.12
INFO: Installing DataFrames v0.5.6
INFO: Installing GZip v0.2.13
INFO: Installing Languages v0.0.1
INFO: Installing Reexport v0.0.1
INFO: Installing SortingAlgorithms v0.0.1
INFO: Installing StatsBase v0.5.3
INFO: Installing TextAnalysis v0.0.1
INFO: Package database updated
Warning: could not import Sort.sortby into DataFrames
Warning: could not import Sort.sortby! into DataFrames
ERROR: repl_show not defined
in include at ./boot.jl:245
in include_from_node1 at ./loading.jl:128
in include at ./boot.jl:245
in include_from_node1 at ./loading.jl:128
in reload_path at loading.jl:152
in _require at loading.jl:67
in require at loading.jl:54
in include at ./boot.jl:245
in include_from_node1 at ./loading.jl:128
in reload_path at loading.jl:152
in _require at loading.jl:67
in require at loading.jl:51
in include at ./boot.jl:245
in include_from_node1 at loading.jl:128
in process_options at ./client.jl:285
in _start at ./client.jl:354
while loading /home/idunning/pkgtest/.julia/v0.3/DataFrames/src/dataframe/reshape.jl, in expression starting on line 163
while loading /home/idunning/pkgtest/.julia/v0.3/DataFrames/src/DataFrames.jl, in expression starting on line 110
while loading /home/idunning/pkgtest/.julia/v0.3/TextAnalysis/src/TextAnalysis.jl, in expression starting on line 1
while loading /home/idunning/pkgtest/.julia/v0.3/TextAnalysis/testusing.jl, in expression starting on line 1
INFO: Package database updated
Note this is possibly due to removal of deprecated functions in Julia 0.3-rc1: JuliaLang/julia#7609
Would be nice to provide word2vec support within Julia. IMO its natural home would be here.
One way, and hopefully both the easiest and fastest, would be to interface with gensim
via PyCall rather than to wrap the original C lib.
https://github.com/JuliaStats/MultivariateStats.jl replaces https://github.com/JuliaStats/DimensionalityReduction.jl and the Extended Usage example should be corrected, as it no longer works on newer versions of Julia.
Hello,
I am fairly new but very enthusiastic about text mining in Julia. I have managed to some of the basics metrics in the introductory documentation, however I am running with lots of error messages now that I am trying to work with corpuses. Every time that I tried to standardize or run other commands I get this message:
MethodError(getindex,(A Corpus,1))
Probably is something straightforward but documentation is scarce. Any hint would be really appreciated.
It seems likely that the CharString
constructor will be changing in Julia 0.3 (JuliaLang/julia#7016), but it looks like there is an easy fix to make your code compatible with both Julia 0.2 and Julia 0.3. Just change:
return utf8(CharString(r[1:i]))
to use utf8(r[1:i])
. The intermediate construction of a CharString
is not necessary even in Julia 0.2.
Do you plan to add grobid interface?
Are you aware of of julia package that works with grobid?
Some words are not converted properly. Probably a libstemmer issue but that repo doesn't seem to be active so I'm posting here :-)
julia> sm = TextAnalysis.stemmer_for_document(StringDocument("hello"))
Stemmer algorithm:english encoding:UTF_8
julia> stem(sm, "coming")
"come"
julia> stem(sm, "coding")
"code"
julia> stem(sm, "providing")
"provid"
julia> stem(sm, "improvising")
"improvis"
julia> stem(sm, "pursuing")
"pursu"
I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?
Thanks!
Hey,
When I tried adding this package to julia . I got an error with connection request as
2015-09-18 11:54:25-- http://snowball.tartarus.org/dist/snowball_code.tgz Resolving snowball.tartarus.org (snowball.tartarus.org)... 80.252.125.10 Connecting to snowball.tartarus.org (snowball.tartarus.org)|80.252.125.10|:80... connected. HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
So I downloaded .tar.gz file, extracted its contents and placed in .julia/v0.03/TextAnalysis/deps/download
. Then I tried to build using command Pkg.build(). But, still I am getting the same error.Please help.
@johnmyleswhite this one please as well?
After JuliaLang/julia#19449, soon to be merged for Julia 0.6, you will no longer be able to access the raw bytes of a string via string.data
; instead, do Vector{UInt8}(string)
. For length(string.data)
, just do sizeof(string)
. For example, this affects:
using TextAnalysis
model = TextAnalysis.SentimentAnalyzer()
model(TextAnalysis.Document("some sense and some nonSense"))
leads to
ERROR: KeyError: key "nonSense" not found
I am not familiar with sentiment analysis but is it impossible to simply ignore words that do not have a value assigned?
Hi,
when using the package I receive these error messages and dozen of warnings:
using TextAnalysis
....................................
ERROR: LoadError: LoadError: UndefVarError: SimpleVector not defined.
ERROR: LoadError: LoadError: Failed to precompile BSON [fbb218c0-5317-5bc6-957e-2ee96dd4b1f0] to C:\Users\emirzayev\.julia\compiled\v0.7\BSON\3tVCZ.ji.
ERROR: Failed to precompile TextAnalysis [a2db99b7-8b79-58f8-94bf-bbc811eef33d] to C:\Users\emirzayev\.julia\compiled\v0.7\TextAnalysis\5Mwet.ji.
Julia version:
Version 0.7.0 (2018-08-08 06:46 UTC)
Install method:
Downloading exe file
Hi,
yesterday I tried to install the package and it failed, because the website snowball.tartarus.org was down. The admin there kindly fixed the issue but pointed out, that the version there is outdated and the follow up project at http://snowballstem.org should be used.
Cheers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.