Comments (3)
Hi @mooseyboots, and thanks for your interesting in this library. I'm having a bit of trouble, however, understanding the specifics of your inquiry. Could you provide some code, inputs, and outputs that demonstrate the issue?
from markovify.
here is my subclass modifying split_into_sentences()
:
import re
import markovify
from markovify.splitters import is_sentence_ender
class NoInitCaps(markovify.Text):
"""
An attempt to subclass markovify.Text to allow for sentences to not begin with an intital capital letter.
"""
def split_into_sentences(self, text):
potential_end_pat = re.compile(
r"".join(
[
r"([-\w\.\"'’~”&\]\)]+[…(\.){1,4}\?!])", # A word that ends with punctuation, including ellipsis, possibly separated by white space
r"([‘’“”'~\"\)\]]*)", # Followed by optional quote/parens/etc
r"\s+(?=[-•\w‘’“”'*\|/~\",])", # followed by whitespace. then a lookahead to the next char, which can be alphanumeric or initial punctuation
]
),
re.U, # U for Unicode!
)
dot_iter = re.finditer(potential_end_pat, text)
end_indices = [
(x.start() + len(x.group(1)) + len(x.group(2)))
for x in dot_iter
if is_sentence_ender(x.group(1))
]
spans = zip([None] + end_indices, end_indices + [None])
sentences = [text[start:end].strip() for start, end in spans]
return sentences
def sentence_split(self, text):
return self.split_into_sentences(text)
a selection of input from one of my files:
• error, which makes things swollen, gives them that look of filling out just a little more space than is theirs, so that they bump into other swollen things, seek room.
renege.
‘empty’ words, imagine!
who among us not embalmed.
walk up to wall and kick it, once, twice, there.
, lying in wait / for the neonate.
1 incorrect 'sentence' from the sample output using my subclass:
who among us not embalmed. walk up to the rhythm of beer and coffee, on the verge of nothing here.
so the word "embalmed." is not counting as an end.
but what confused me is that if i use the subclass manually to generate a corpus, such as something like:
from markovify import Chain, Text
from mkv_this.noinitcaps import NoInitCaps
text = "/PATH/TO/INPUT/scrapbook.txt"
with open(text, "r") as t:
txt = t.read()
text_obj = NoInitCaps(txt) # my subclass
corpus = text_obj.generate_corpus(txt)
clist = list(corpus)
with open("/PATH/TO/OUTPUT/markov-corpus-no-init-caps.txt", "w") as c:
c.write(str(clist))
the word "embalmed." will actually be the last item in its sentence's list:
['renege.'], ['‘empty’', 'words,', 'imagine!'], ['who', 'among', 'us', 'not', 'embalmed.'], ['walk', 'up', 'to', 'wall', 'and', 'kick', 'it,', 'once,', 'twice,', 'there.'], [',', 'lying', 'in', 'wait', '/', 'for', 'the', 'neonate.']
which to me suggested that the regex sentence splitter was working correctly.
my query is, if i want to change how markovify understands what constitutes the end of a sentence, is that all i need to do or are there other things to modify?
from markovify.
Hi @mooseyboots, and thanks for the additional details. Judging from the sample of the corpus you shared, which seems to place each sentence on a new line, the easiest solution may just be to use the already-defined markovify.NewlineText
class.
And if that doesn't quite fit your use-case, you can use that subclass's definition as a perhaps-simpler starting place (i.e., swapping out the regular expression below for the regular expression of your choosing):
Lines 287 to 293 in 16b9367
from markovify.
Related Issues (20)
- Character level chains instead of word level? HOT 2
- Markovify always outputs "None" with russian corpus HOT 12
- markovify and music HOT 1
- Thank you for a job well done! HOT 2
- I can’t install because of the encoding of the file HOT 1
- Can I generate sentence with only two words? HOT 2
- generate sentence with it's prediction HOT 2
- spaCy model shortcuts are deprecated HOT 1
- Non-english characters are not being displayed correctly.
- markov_text_model.make_sentence_with_start KeyError HOT 1
- Fallback without building a new model? HOT 1
- “python_requires” should be set with “>=3.6”, as markovify 0.9.3 is not compatible with all Python versions. HOT 1
- Control generated sentences randomness HOT 2
- - HOT 2
- missing utf-8 BOM lead to codec failures during tests on windows
- Markovify - Markov chain : Seed and Condition to text generated based in input. HOT 2
- markovify's make_sentence_with_start() doesn't seem to work properly HOT 11
- Can't install on browser webpage.
- PolyCodeMaster.py HOT 1
- Markovify returns None HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from markovify.