Git Product home page Git Product logo

Comments (9)

bararchy avatar bararchy commented on May 31, 2024 1

Btw, it's really easy to recreate the issue:

require 'classifier-reborn'
ai_overlord = ClassifierReborn::Bayes.new('normal_activity','suspicious_activity')
random = File.read("/dev/urandom", 2048)
## Testing encoding for random
random.encoding 
=> #<Encoding:ASCII-8BIT>

ai_overlord.train_normal_activity(random)
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
from /home/unshadow/.gem/ruby/2.3.0/gems/classifier-reborn-2.0.4/lib/classifier-reborn/bayes.rb:178:in

## Fixing 

random_new = random.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
ai_overlord.train_normal_activity(random_new)
=> ["Uy\v:\u000E|j$J8^y_%\u0012='4+\v(h(\u0010F,C\b\u0017\u001DqL]-\u007F;~H^\u00065F vE>?@`\v)6s\n65`\u0010x#:\u001C\e<\u0010$\u0005\u0005.\u0006q\a>w,*m~GV`d SYe!|l[\u001Db\f\"\u0011C(#z[\u001D8*P\u0010&m\u00144\f\u0015*\t\u001AIj\u001DV.^y\u0002\u001D-\\}pNav\r\u0015\u0000lrmg{t\u001Dc=-\u001CV{q\u0005-6P\u001DoOT\u0004|F3h\"\\\u001DWj7%rtcJ\u001C~9\u0012\u001CclZg>L\e\v+/\u0003\t\u0014#'<4\ty\u0001Wyh\u0000^N!Kyg`\tNyhi\u00004|\u007F\u007F\u0017D#o&Bx\u0004 '/W\tq\u007F\u0015MOb>9T*Y-+]=f.\u0018N]TM4E\aV\u0019k\u007F?1yFIT?!}\u0018^]JVM(\rp\t\u0002HuzNLOc\u0010o\rbZ\u0001\u001CW]yQu;^o \u0016cFug,sF;Te]W(;$+SreO\u0017t4~}_\u0006sp$\fCH{+)DG\n9\u0011\u001A\u007Fo\u001Am\u0011])PzB^{g\u00024zh\u0013E\u001ED's^?n!xz\t?\u0018\u0018\u0018Z^8M+)e\u0011ID\u0013<\u0002n8*\\\u0014\u0004\tK\u000E`\u0005q\aS=\u0000\n\a\u0015\u0004}u\u0018 \u0010(\u0011nb\u0015\u001D\u001A\u0016S\u0003\u0003|5F8V>w\u0013\u0004o]I?\b.+FeY\u0015[;\u0019lr\u0012#\u0016C\u0011_a~\u0003\u001E7\u000F\u00149\u001C\bs\\oLT\u0005wU\u0006\u000E<\u0017\u0004;~)\u001ALT\f\fs\b)50wvOGu)\au\u001A,+5'[\fD~m-\u001FJ~2Kt4rju'\u007F\u0017A1\r)A\a\f`\u0015H\u0003q*\u0016S\u0012\t&{1E\nFsN\e\u0019bI3$!yl$#0]\u0010X\u001C\u001Fx-\u007FcDKaUm\u0018en~U\b5\u0004\u0006\u00162!J\u0011P\u0018\u0005FI\u007FrX::$z\r}\u0013\r\u0000\u0016Tro\b\u0010/E\u001FL\u001AKF\u0002C\v?H?\f\u001F[.d\u0012\u0004p^q\u0019O\u0004j\u000F],1\u001F\v9v\u0000qceg#\t\u0019\u0018T\e g]3.N8\u0017k-!8,;vs(`/>w$8qu%\b\u000F\u001C\u0017\u0004\u0014o[EvY\u000EP|\\(Uj\u001CWxfA\u001D5\u0014*.\u000F*\eWr\u001Fqt\u00018Fw(}g\t\b@]a w9-G\u001CNy|p^,\u000E\u001EQ\a\\}3\v\u0016JA~$0\u0011\u0003\u0012%{7h/.a;\u000219]q\eg\u0011\nN|\vP <,\bGJVO9P?f=\vD87;Ml=9\u001A?a]>\u0001+z\u0003`lyLk\u0002c]`#}\u00000S\u001EV/@YYw`\u0005\u001E\a\u0012Ybo:\u0016YF3^/bm<E\t\u0016aJ\u001A\u0014\u0016fI3jqDB\vi!\f**\u0006qk@}pLT?J<t7c3?1poI0qI;\u0000%KdGWa-A@}T\u000F\u0014\u0004\u001D6\fx9\tw7km\v:M\u0001\n</>P>*<)@tq<Ph,cAx}#r'd*\u0011\aJ9\u0013Q\"\u001F1\u001F\u0018 \u001EceY)\bZKCQ8\t$\u001D7]/\av\u001E\u0010:\nk\u0013f;xCaXcZbcMVfj9!\f@\u0004|Ek0<l"]

# Works without issues

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 31, 2024

Interesting. Maybe? If you want to put in a PR, I'll take a look.

from classifier-reborn.

bararchy avatar bararchy commented on May 31, 2024

It seems that using .encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''}) on the data before sending it to train method made the problem go away, I think a discussion is needed about if it's the right approach to add this into the GEM it self for all trained data.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 31, 2024

Agreed. @parkr What do you think?

from classifier-reborn.

parkr avatar parkr commented on May 31, 2024

@parkr What do you think?

I think the user should be responsible for encoding the input properly. We could easily break things if we do it ourselves.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 31, 2024

I agree. I think we should throw a note in the readme. I doubt we'll ever have the resources to properly maintain encoding functionality.

from classifier-reborn.

bararchy avatar bararchy commented on May 31, 2024

I think a note in the README specifying that the classifier train method only supports UTF-8 will be great, also, maybe add a check like

fail 'Trained string can only be encoded as UTF8' if data.encoding != 'UTF8' 

This may help others avoid encoding issues, and let the maintenance be easier

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 31, 2024

That seems reasonable.

from classifier-reborn.

Ch4s3 avatar Ch4s3 commented on May 31, 2024

I made a note in the readme, I'm not sure an error is needed. Im closing this for now.

from classifier-reborn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.