Hi, thanks for developing this helpful package! unnest_tokens seems

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for considering this feature <a class="user-mention notranslate" data-hovercard

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Custom tokenizer in unnest_tokens about tidytext HOT 11 CLOSED

juliasilge commented on August 20, 2024

Custom tokenizer in unnest_tokens

from tidytext.

Comments (11)

juliasilge commented on August 20, 2024

@Double-y Thanks for your interest in the package! To clarify, are you saying there are existing tokenizers for Japanese that you think it would be nice to implement as an option in the way that the tokenizers package is used, or that you think it would be good to structure unnest_tokens in a more general way so that users can supply their own tokenizers?

from tidytext.

dgrtwo commented on August 20, 2024

@juliasilge What if we allowed users to pass a function (or a string like we already do) to the token argument? I could set that up!

from tidytext.

dgrtwo commented on August 20, 2024

Here's one implementation (along with test cases). You can pass a custom function to token = (and any extra arguments to ...). Please try it out @juliasilge and @Double-y!

from tidytext.

juliasilge commented on August 20, 2024

@dgrtwo This looks really good on an initial work through on my part, and I like the way the user interacts with the function here, the flexibility, etc. Really nice! @Double-y, let us know what you think as you try it out.

from tidytext.

yosuke-yasuda commented on August 20, 2024

Thanks for considering this feature @juliasilge, @dgrtwo! Yes, I know a tokenizer for Japanese called RMeCab. I'll try it on 338cc6f

from tidytext.

yosuke-yasuda commented on August 20, 2024

Yeah, it works :)

This might be a very specific matter for Japanese but it is tokenised with part of speech detection at the same time for efficiency. But in the current unnest_tokens framework, I can't preserve the part of speech information right? The output of tokenizer is like this.

If you have a good idea to preserve this information as well. It's gonna be amazing.

from tidytext.

juliasilge commented on August 20, 2024

That's great that it works! Thanks for bringing this up; I think this improves flexibility/usability for the package.

In English, we have been doing parts of speech detection separately (after unnesting) using a join with a data set that's in the package; at this point this has been separate from the unnest_tokens function because we have been working for thinking within tidy data principles, etc. Let me/us think about that.

from tidytext.

yosuke-yasuda commented on August 20, 2024

Yeah, as for English (and probably for most alphabetical languages too), separating those steps seems to be better.

That demand will be very specific to Japanese, so I don't think it should be implemented in this package either.

I just wanted to know if there's a good way to do it. Don't worry too much about it.

Thank you!

from tidytext.

dgrtwo commented on August 20, 2024

One suggestion for keeping parts of speech (I agree it unfortunately wouldn't fit in the tidytext package since it's a very specific need) is to use unnest manually. I haven't been able to get RMeCab to work on my computer so this is a rough guess I haven't tested:

RMeCabWrapper <- function(...) {
    ret <- RMeCab::RMeCabC(...)
    lapply(ret, function(e) dplyr::data_frame(pos = names(e), word = e))
}
d %>% 
  tidyr::unnest(RMeCabWrapper(text))

This (or something similar to it) should be able to create two columns: pos (with the parts of speech) and word (with the words).

from tidytext.

yosuke-yasuda commented on August 20, 2024

@dgrtwo Thank you for the suggestion! It looks good. I'll try it.

from tidytext.

github-actions commented on August 20, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from tidytext.

Custom tokenizer in unnest_tokens about tidytext HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent