Git Product home page Git Product logo

Comments (11)

juliasilge avatar juliasilge commented on August 20, 2024

@Double-y Thanks for your interest in the package! To clarify, are you saying there are existing tokenizers for Japanese that you think it would be nice to implement as an option in the way that the tokenizers package is used, or that you think it would be good to structure unnest_tokens in a more general way so that users can supply their own tokenizers?

from tidytext.

dgrtwo avatar dgrtwo commented on August 20, 2024

@juliasilge What if we allowed users to pass a function (or a string like we already do) to the token argument? I could set that up!

from tidytext.

dgrtwo avatar dgrtwo commented on August 20, 2024

Here's one implementation (along with test cases). You can pass a custom function to token = (and any extra arguments to ...). Please try it out @juliasilge and @Double-y!

from tidytext.

juliasilge avatar juliasilge commented on August 20, 2024

@dgrtwo This looks really good on an initial work through on my part, and I like the way the user interacts with the function here, the flexibility, etc. Really nice! @Double-y, let us know what you think as you try it out.

from tidytext.

yosuke-yasuda avatar yosuke-yasuda commented on August 20, 2024

Thanks for considering this feature @juliasilge, @dgrtwo! Yes, I know a tokenizer for Japanese called RMeCab. I'll try it on 338cc6f

from tidytext.

yosuke-yasuda avatar yosuke-yasuda commented on August 20, 2024

Yeah, it works :)

This might be a very specific matter for Japanese but it is tokenised with part of speech detection at the same time for efficiency. But in the current unnest_tokens framework, I can't preserve the part of speech information right? The output of tokenizer is like this.

screen shot 2016-05-17 at 10 15 20 am

If you have a good idea to preserve this information as well. It's gonna be amazing.

from tidytext.

juliasilge avatar juliasilge commented on August 20, 2024

That's great that it works! Thanks for bringing this up; I think this improves flexibility/usability for the package.

In English, we have been doing parts of speech detection separately (after unnesting) using a join with a data set that's in the package; at this point this has been separate from the unnest_tokens function because we have been working for thinking within tidy data principles, etc. Let me/us think about that.

from tidytext.

yosuke-yasuda avatar yosuke-yasuda commented on August 20, 2024

Yeah, as for English (and probably for most alphabetical languages too), separating those steps seems to be better.

That demand will be very specific to Japanese, so I don't think it should be implemented in this package either.

I just wanted to know if there's a good way to do it. Don't worry too much about it.

Thank you!

from tidytext.

dgrtwo avatar dgrtwo commented on August 20, 2024

One suggestion for keeping parts of speech (I agree it unfortunately wouldn't fit in the tidytext package since it's a very specific need) is to use unnest manually. I haven't been able to get RMeCab to work on my computer so this is a rough guess I haven't tested:

RMeCabWrapper <- function(...) {
    ret <- RMeCab::RMeCabC(...)
    lapply(ret, function(e) dplyr::data_frame(pos = names(e), word = e))
}
d %>% 
  tidyr::unnest(RMeCabWrapper(text))

This (or something similar to it) should be able to create two columns: pos (with the parts of speech) and word (with the words).

from tidytext.

yosuke-yasuda avatar yosuke-yasuda commented on August 20, 2024

@dgrtwo Thank you for the suggestion! It looks good. I'll try it.

from tidytext.

github-actions avatar github-actions commented on August 20, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from tidytext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.