Git Product home page Git Product logo

Comments (3)

jsvine avatar jsvine commented on June 10, 2024

Hi @mooseyboots, and thanks for your interesting in this library. I'm having a bit of trouble, however, understanding the specifics of your inquiry. Could you provide some code, inputs, and outputs that demonstrate the issue?

from markovify.

mooseyboots avatar mooseyboots commented on June 10, 2024

here is my subclass modifying split_into_sentences():

import re
import markovify
from markovify.splitters import is_sentence_ender


class NoInitCaps(markovify.Text):
    """
    An attempt to subclass markovify.Text to allow for sentences to not begin with an intital capital letter.
    """

    def split_into_sentences(self, text):
        potential_end_pat = re.compile(
            r"".join(
                [
                    r"([-\w\.\"'’~”&\]\)]+[…(\.){1,4}\?!])",  # A word that ends with punctuation, including ellipsis, possibly separated by white space
                    r"([‘’“”'~\"\)\]]*)",  # Followed by optional quote/parens/etc
                    r"\s+(?=[-•\w‘’“”'*\|/~\",])",  # followed by whitespace. then a lookahead to the next char, which can be alphanumeric or initial punctuation
                ]
            ),
            re.U,  # U for Unicode!
        )
        dot_iter = re.finditer(potential_end_pat, text)
        end_indices = [
            (x.start() + len(x.group(1)) + len(x.group(2)))
            for x in dot_iter
            if is_sentence_ender(x.group(1))
        ]
        spans = zip([None] + end_indices, end_indices + [None])
        sentences = [text[start:end].strip() for start, end in spans]
        return sentences

    def sentence_split(self, text):
        return self.split_into_sentences(text)

a selection of input from one of my files:

    • error, which makes things swollen, gives them that look of filling out just a little more space than is theirs, so that they bump into other swollen things, seek room.

renege.

‘empty’ words, imagine!

who among us not embalmed.
walk up to wall and kick it, once, twice, there.

, lying in wait / for the neonate. 

1 incorrect 'sentence' from the sample output using my subclass:

who among us not embalmed. walk up to the rhythm of beer and coffee, on the verge of nothing here.

so the word "embalmed." is not counting as an end.

but what confused me is that if i use the subclass manually to generate a corpus, such as something like:

  from markovify import Chain, Text
  from mkv_this.noinitcaps import NoInitCaps

  text = "/PATH/TO/INPUT/scrapbook.txt"

  with open(text, "r") as t:
      txt = t.read()
      text_obj = NoInitCaps(txt) # my subclass
      corpus = text_obj.generate_corpus(txt)
      clist = list(corpus)
  with open("/PATH/TO/OUTPUT/markov-corpus-no-init-caps.txt", "w") as c:
        c.write(str(clist))

the word "embalmed." will actually be the last item in its sentence's list:

 ['renege.'], ['‘empty’', 'words,', 'imagine!'], ['who', 'among', 'us', 'not', 'embalmed.'], ['walk', 'up', 'to', 'wall', 'and', 'kick', 'it,', 'once,', 'twice,', 'there.'], [',', 'lying', 'in', 'wait', '/', 'for', 'the', 'neonate.']

which to me suggested that the regex sentence splitter was working correctly.

my query is, if i want to change how markovify understands what constitutes the end of a sentence, is that all i need to do or are there other things to modify?

from markovify.

jsvine avatar jsvine commented on June 10, 2024

Hi @mooseyboots, and thanks for the additional details. Judging from the sample of the corpus you shared, which seems to place each sentence on a new line, the easiest solution may just be to use the already-defined markovify.NewlineText class.

And if that doesn't quite fit your use-case, you can use that subclass's definition as a perhaps-simpler starting place (i.e., swapping out the regular expression below for the regular expression of your choosing):

markovify/markovify/text.py

Lines 287 to 293 in 16b9367

class NewlineText(Text):
"""
A (usable) example of subclassing markovify.Text. This one lets you markovify
text where the sentences are separated by newlines instead of ". "
"""
def sentence_split(self, text):
return re.split(r"\s*\n\s*", text)

from markovify.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.