hi and thx for yr great library. i made a cli program to run it on m

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

subclassing markovify.Text to allow for different types of 'sentences' about markovify HOT 3 OPEN

mooseyboots commented on June 10, 2024

subclassing markovify.Text to allow for different types of 'sentences'

from markovify.

Comments (3)

jsvine commented on June 10, 2024

Hi @mooseyboots, and thanks for your interesting in this library. I'm having a bit of trouble, however, understanding the specifics of your inquiry. Could you provide some code, inputs, and outputs that demonstrate the issue?

from markovify.

mooseyboots commented on June 10, 2024

here is my subclass modifying split_into_sentences():

import re
import markovify
from markovify.splitters import is_sentence_ender


class NoInitCaps(markovify.Text):
    """
    An attempt to subclass markovify.Text to allow for sentences to not begin with an intital capital letter.
    """

    def split_into_sentences(self, text):
        potential_end_pat = re.compile(
            r"".join(
                [
                    r"([-\w\.\"'’~”&\]\)]+[…(\.){1,4}\?!])",  # A word that ends with punctuation, including ellipsis, possibly separated by white space
                    r"([‘’“”'~\"\)\]]*)",  # Followed by optional quote/parens/etc
                    r"\s+(?=[-•\w‘’“”'*\|/~\",])",  # followed by whitespace. then a lookahead to the next char, which can be alphanumeric or initial punctuation
                ]
            ),
            re.U,  # U for Unicode!
        )
        dot_iter = re.finditer(potential_end_pat, text)
        end_indices = [
            (x.start() + len(x.group(1)) + len(x.group(2)))
            for x in dot_iter
            if is_sentence_ender(x.group(1))
        ]
        spans = zip([None] + end_indices, end_indices + [None])
        sentences = [text[start:end].strip() for start, end in spans]
        return sentences

    def sentence_split(self, text):
        return self.split_into_sentences(text)

a selection of input from one of my files:

    • error, which makes things swollen, gives them that look of filling out just a little more space than is theirs, so that they bump into other swollen things, seek room.

renege.

‘empty’ words, imagine!

who among us not embalmed.
walk up to wall and kick it, once, twice, there.

, lying in wait / for the neonate.

1 incorrect 'sentence' from the sample output using my subclass:

who among us not embalmed. walk up to the rhythm of beer and coffee, on the verge of nothing here.

so the word "embalmed." is not counting as an end.

but what confused me is that if i use the subclass manually to generate a corpus, such as something like:

  from markovify import Chain, Text
  from mkv_this.noinitcaps import NoInitCaps

  text = "/PATH/TO/INPUT/scrapbook.txt"

  with open(text, "r") as t:
      txt = t.read()
      text_obj = NoInitCaps(txt) # my subclass
      corpus = text_obj.generate_corpus(txt)
      clist = list(corpus)
  with open("/PATH/TO/OUTPUT/markov-corpus-no-init-caps.txt", "w") as c:
        c.write(str(clist))

the word "embalmed." will actually be the last item in its sentence's list:

 ['renege.'], ['‘empty’', 'words,', 'imagine!'], ['who', 'among', 'us', 'not', 'embalmed.'], ['walk', 'up', 'to', 'wall', 'and', 'kick', 'it,', 'once,', 'twice,', 'there.'], [',', 'lying', 'in', 'wait', '/', 'for', 'the', 'neonate.']

which to me suggested that the regex sentence splitter was working correctly.

my query is, if i want to change how markovify understands what constitutes the end of a sentence, is that all i need to do or are there other things to modify?

from markovify.

jsvine commented on June 10, 2024

Hi @mooseyboots, and thanks for the additional details. Judging from the sample of the corpus you shared, which seems to place each sentence on a new line, the easiest solution may just be to use the already-defined markovify.NewlineText class.

And if that doesn't quite fit your use-case, you can use that subclass's definition as a perhaps-simpler starting place (i.e., swapping out the regular expression below for the regular expression of your choosing):

markovify/markovify/text.py

Lines 287 to 293 in 16b9367

 class NewlineText(Text): 

 """ 

  A (usable) example of subclassing markovify.Text. This one lets you markovify 

  text where the sentences are separated by newlines instead of ". " 

  """ 

 def sentence_split(self, text): 

 return re.split(r"\s*\n\s*", text)

from markovify.

subclassing markovify.Text to allow for different types of 'sentences' about markovify HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	class NewlineText(Text):
	"""
	A (usable) example of subclassing markovify.Text. This one lets you markovify
	text where the sentences are separated by newlines instead of ". "
	"""
	def sentence_split(self, text):
	return re.split(r"\s\n\s", text)