kblixt / subcleaner Goto Github PK

removes ads from subtitle files cleanly.

Python 100.00%

subcleaner's Issues

Issue with regex (i think)

Hello, i have a problem with using regex properly, hopefully you can help me out.

First, i was trying everything on my windows machine and it removed lots of Dutch ads (i manually put in regex8) when i tried to do a couple dry-runs.

Then i transferred the same files to my Synology NAS/Linux machine gave all the permissions for it and added it to Bazarr.

Then i watched Portainer's logs for Bazarr and it said that the script ran successfully but it reported that it removed 0 blocks. When i looked in some files the ads were still there and then i dragged those .srt files back to my windows machine and tried to run the same script against it and it removed those lines.

I have been away so i can't exactly reproduce this anymore, i have been messing around with the logs as well because there were still old entries not from the Linux machine so i removed the logs..

I think the best thing i can do is show my global.conf file.

[PURGE_REGEX]

regex1: ([^Ã]|^)©|™|tvsubtitle|\b(YTS|YIFY)\b|opensub|sub(scene|rip)|podnapisi|addic7ed|Camikaze
regex2: bozxphd|sazu489|psagmeno|normita|anoxmous|9unshofl|BLACKdoor|titlovi|Danishbits|hound\.org|hunddawgs
regex3: jodix|LESAIGNEUR|HighCode|explosiveskull|GoldenBeard|nessundorma|Fingal61|dawaith|MoSub|srjanapala
regex4: FilthyRichFutures|celebritysex|shareuniversity|AmericasCardroom|saveanilluminati|MCH2022|ALLIN1BOX
regex5: admitme|argenteam
regex6: \.(tv|tk|xyz|io|sex|porn|xxx|link)\b|https?[:\.\/\\ ]
regex7: (Someone(\b.\b)?needs(\b.\b)?to(\b.\b)?stop(\b.\b)?Clearway(\b.\b)?Law)|(Public(\b.\b)?shouldn't(\b.\b)?leave(\b.\b)?reviews(\b.\b)?for(\b.\b)?lawyers) 
regex8: bierdopje|nlsubs|subtitles searcher|ondertiteling:|verbetering:|gedownload:|vertaling & sync:|vertaling:|== sync|==sync|ondertiteld door:|sync:|synced:|sync and corrections by|sync &|aangeboden door:|ondertiteling swell|captioning sponsored by|rip en sync
regex9: BluRay Rip:|Bdzzld

I might have just made some mistakes here I'm just began 'learning' how to use regex and i noticed it wanted to delete a normal line in Dutch as well:

Een week voor de noodtoestand in Californië

Which is just a normal line in the .SRT but i know this issue is probably from my own Regex8.

So in short; Began trying it on Windows with good results, added it to Bazarr on Linux and it seems to be running but not removing all the blocks i configured manually then i have been away and now i haven't really had a change to find a good sample to post.

I think the best course of action is fixing the Regex line first and go from there.

I know this isn't a very well written issue report i would have liked it to be cleaner, but it's at least a start I'll provide anything you need.

Bug Report: Subtitle Cleaning Script Timing Issue

Summary:

The subtitle cleaning script incorrectly adjusts the end time of subtitle lines when two consecutive lines have identical text but different timings. This results in the first subtitle line’s end time being changed to match the end time of the subsequent line.

Steps to Reproduce:

Run the subtitle cleaning script with a subtitle file in SRT format that contains two lines with identical text but different timings.
Example:

Original Subtitle File:

1
00:00:01,000 --> 00:00:03,000
Hello!!!

2
00:00:08,000 --> 00:00:10,000
Hello!!!

After Running the Script:

1
00:00:01,000 --> 00:00:10,000
Hello!!!

2
00:00:08,000 --> 00:00:10,000
Hello!!!

Expected Behavior:

The subtitle lines should retain their original timing:
Line 1: 00:00:01,000 --> 00:00:03,000
Line 2: 00:00:08,000 --> 00:00:10,000

Actual Behavior:

The script incorrectly changes the end time of the first subtitle line to match the end time of the second subtitle line.

Inconsistent removal of ads with same text, even within same file

Came across an ad that is making through somehow, they do get removed sometimes, very inconsistent.

2022-12-05 07:43:43: SUBTITLE: "/movies/Amityville Horror, The (2005)/The Amityville Horror (2005) [Bluray-720p].en.srt"
2022-12-05 07:43:43:     [INFO]: Didn't run language detection.
2022-12-05 07:43:43:     [INFO]: Removed 2 subtitle blocks:
2022-12-05 07:43:43:             [---------Removed Blocks----------]
2022-12-05 07:43:43:             733
2022-12-05 07:43:43:             01:22:43,006 --> 01:28:32,805
2022-12-05 07:43:43:             Subtitles by ARAVIND B
2022-12-05 07:43:43:             [[email protected]]
2022-12-05 07:43:43: 
2022-12-05 07:43:43:             734
2022-12-05 07:43:43:             01:28:33,305 --> 01:28:39,479
2022-12-05 07:43:43:             Shop this show's fashion on LookLive.com
2022-12-05 07:43:43:             [---------------------------------]
2022-12-05 07:43:43:     [WARNING]: Potential ads in 1 subtitle blocks, please verify:
2022-12-05 07:43:43:                [---------Warning Blocks----------]
2022-12-05 07:43:43:                1
2022-12-05 07:43:43:                00:00:06,000 --> 00:00:12,073
2022-12-05 07:43:43:                Shop this show's fashion on LookLive.com
2022-12-05 07:43:43:                [---------------------------------]
2022-12-05 07:43:43:     [INFO] To remove all these blocks use: 
2022-12-05 07:43:43: subcleaner '/movies/Amityville Horror, The (2005)/The Amityville Horror (2005) [Bluray-720p].en.srt' -d 1

One gets removed, the other doesn't even though it appears to be exactly the same.

support a --backup option

Would be good if it supported an option to make a backup copy of the file that it processed. I'm running the script against thousands of files, and so if/when I find something that it removed that it shouldn't have, it'd be nice to have a backup copy to restore.

Found a loophole that can be exploited

By adding this to the subtitle file, SubCleaner completely skips cleaning the file.

633
00:50:37,826 --> 00:50:48,248
<font color="#ec14bd">Sync & corrections by honeybunny</font>
<font color="#ec14bd">www.addic7ed.com</font>

999900:00:0,500 --> 00:00:2,00<font color="#ffff00" size=14>www.tvsubtitles.net</font>

9999
00:00:0,500 --> 00:00:2,00
<font color="#ffff00" size=14>www.tvsubtitles.net</font>

Game of Thrones - 5x01 - The Wars to Come.1080i.HDTV.RARBG.en.txt

Feature: Package as a library

I would like to integrate this cleaning into one of my own projects. To do so, it would be beneficial to package and release this tool onto PyPi and provide a standardized API for other tools to clean/check subtitles. The command line entry point can still be maintained using entry points, though the invocation may need to change for users, meaning this could be considered a breaking change.

".." breaks script

Hey, first off thanks for making this script.

I am processing a large number of srt files and I got several errors. The common thing with all the blocks that cause errors are that they start and end in "..", like so:

387
00:23:08,086 --> 00:23:10,680

..but the way he
describes me...

This gives me the following error:

➜  subcleaner git:(master) python3 ./subcleaner.py --dry-run --library ~/Downloads/Star\ Trek\ -\ Voyager\ -\ Subtitles -r ~/Downloads/Star\ Trek\ -\ Voyager\ -\ Subtitles -e
   ERROR: subcleaner was unable to decode the file. reason:
   ERROR: Parsing error at block ..but the way he in file /Users/alan/Downloads/Star Trek - Voyager - Subtitles/Star_Trek_Voyager - season 7.en/Star Trek Voyager - 7x17 - Workforce  Part 2.DVD.en.srt.

False Positives

Just wanted to share a few false positives that have come up, and how they might be captured in the regex (just using global and english regex configs).

This came from Nightcrawler, I believe they are radio callsigns

      |     [---------Removed Blocks----------]
      |     812
      |     01:04:28,679 --> 01:04:30,530
      |     3X21, go ahead
      |     
      |     813
      |     01:04:31,280 --> 01:04:34,610
      |     3X21, confirm address on the 211 at Bonhill Road.
      |     [---------------------------------]
      | 
      |                                         [---------Warning Blocks----------]
      |                                         130
      |                                         00:12:53,500 --> 00:12:58,502
      |                                         7-X-76 Roger this is David 1099965
      |                                         [---------------------------------]

and this from Wreck-It Ralph

      |     [---------Removed Blocks----------]
      |     1566
      |     01:39:59,576 --> 01:40:00,827
      |     You fixed it!
      |     
      |     1580
      |     01:40:56,175 --> 01:40:57,382
      |     You fixed it!
      |     [---------------------------------]

ModuleNotFoundError: No module named 'six'

Hey. I'm trying to run the script on Unraid but I just get:

$ python3 ./subcleaner.py
Traceback (most recent call last):
  File "/mnt/user/scripts/subcleaner/./subcleaner.py", line 3, in <module>
    from subcleaner.main import main
  File "/mnt/user/scripts/subcleaner/subcleaner/main.py", line 6, in <module>
    from .cleaner import Cleaner
  File "/mnt/user/scripts/subcleaner/subcleaner/cleaner.py", line 1, in <module>
    from .subtitle import Subtitle
  File "/mnt/user/scripts/subcleaner/subcleaner/subtitle.py", line 2, in <module>
    from libs.langdetect import detect_langs
  File "/mnt/user/scripts/subcleaner/libs/langdetect/__init__.py", line 1, in <module>
    from .detector_factory import DetectorFactory, PROFILES_DIRECTORY, detect, detect_langs
  File "/mnt/user/scripts/subcleaner/libs/langdetect/detector_factory.py", line 10, in <module>
    from .detector import Detector
  File "/mnt/user/scripts/subcleaner/libs/langdetect/detector.py", line 4, in <module>
    import six
ModuleNotFoundError: No module named 'six'

Is there some dependency I'm missing?

Opportunity to add groups

Saw these as warnings today, maybe can be added as specific groups to block (or refine the regex?)

      |                                         [---------Warning Blocks----------]
      |                                         1380
      |                                         02:40:47,009 --> 02:40:55,211
      |                                         Resynced for Media Player by SourGrass :-))
      |                                         [---------------------------------]
	  
	            |                                         [---------Warning Blocks----------]
      |                                         7
      |                                         00:00:24,980 --> 00:00:31,980
      |                                         Ripped By mstoll
      |                                         reasons: (en_warn1, en_warn2)
      |                                         [---------------------------------]
	  
      |                                         [---------Warning Blocks----------]
      |                                         32
      |                                         00:02:05,799 --> 00:02:09,931
      |                                         transcripted by alire2a
      |                                         [email protected]
      |                                         reasons: (global_warn1, global_warn1)
      |                                         [---------------------------------]

Need to be a bit more strict

The first block is valid. (It's from the show Devs). I am running with --sensitive and it still removed it

INFO: now cleaning subtitle: /mnt/tv/Devs/Devs.S01E03.1080p.AMZN.WEBRip.DDP5.1.x264-TEPES[rarbg]/Devs.S01E03.1080p.AMZN.WEB-DL.DDP5.1.H.264-TEPES.en.srt
INFO: Done. Cleaning report:
| 2 deleted blocks and 7 warnings remaining.
|
| [---------Removed Blocks----------]
| 12
| 00:03:56,470 --> 00:03:59,573
| It's the celebrity sex tape
| to end all celebrity sex tapes.
|
| 513
| 00:42:54,072 --> 00:42:56,007
| Captioned by
| Media Access Group at WGBH
| [---------------------------------]

Tons of false positives on Catastrophe S01E04

catastrophe.2015.s01e04.hdtv.x264-river-eng.srt.zip

I'm guessing this is because of the leading ~.

          | 75 deleted blocks and 20 warnings remaining.
          | 
          |     [---------Removed Blocks----------]
          |     2
          |     00:00:13,700 --> 00:00:15,890
          |     ~ Thanks.
          |     ~ Thank you.
          |     
          |     3
          |     00:00:15,940 --> 00:00:18,490
          |     You won't be able to
          |     do that for much longer.
          |     
          |     4
          |     00:00:18,540 --> 00:00:22,540
          |     ~ Which part?
          |     ~ Toss me about like a sack of radishes.
          |     ~ Yeah, I will.
          |     
          |     8
          |     00:00:30,660 --> 00:00:34,410
          |     ~ What will I do up there?
          |     ~ I don't care.
          |     
          |     20
          |     00:01:06,340 --> 00:01:10,170
          |     ~ We can't not christen the baby, Rob!
          |     ~ Why? It's bullshit.
          |     
          |     21
          |     00:01:10,220 --> 00:01:13,930
          |     ~ You know that godparents
          |     have to renounce the devil?
          |     ~ So what?
          |     
          |     25
          |     00:01:21,540 --> 00:01:25,540
          |     ~ The baby won't know.
          |     ~ Yeah, but we'll know.
          |     ~ We don't care.
          |     ~ I care!
          |     
          |     45
          |     00:02:25,340 --> 00:02:28,730
          |     ~ What about your mom?
          |     ~ She's magical.
          |     
          |     46
          |     00:02:28,780 --> 00:02:31,170
          |     ~ You'll like her better than you like me.
          |     ~ Great.
          |     
          |     51
          |     00:02:50,140 --> 00:02:54,140
          |     ~ Terrible flight. Turbulence.
          |     Fuckin' Ryanair.
          |     ~ Des.
          |     
          |     52
          |     00:02:55,300 --> 00:02:58,210
          |     ~ They sat me next to a Turkish.
          |     ~ A Turkish what?
          |     
          |     53
          |     00:02:58,260 --> 00:03:00,450
          |     ~ A Turkish man.
          |     ~ Right.
          |     
          |     54
          |     00:03:00,500 --> 00:03:04,090
          |     ~ From Turkey.
          |     ~ Right, right.
          |     
          |     57
          |     00:03:10,260 --> 00:03:13,330
          |     ~ Your brother's place?
          |     ~ Oh, yeah, it's lovely.
          |     
          |     58
          |     00:03:13,380 --> 00:03:16,330
          |     ~ He's doubled his money on this place.
          |     ~ He hasn't doubled his money.
          |     
          |     60
          |     00:03:21,460 --> 00:03:25,460
          |     ~ I'll go and help with the tea.
          |     ~ Where are you going?
          |     
          |     61
          |     00:03:31,420 --> 00:03:33,970
          |     ~ SIMULTANEOUSLY: You're an American?
          |     ~ I love Ireland.
          |     
          |     91
          |     00:05:14,340 --> 00:05:17,770
          |     ~ when Tottenham scored that goal.
          |     ~ Everybody laughed at that.
          |     
          |     95
          |     00:05:26,820 --> 00:05:29,250
          |     ~ What?
          |     ~ He missed his confirmation
          |     cos he had the measles
          |     
          |     100
          |     00:05:38,940 --> 00:05:42,940
          |     ~ What did you say?
          |     ~ Saint Mary's?
          |     
          |     101
          |     00:05:45,100 --> 00:05:49,100
          |     ~ How are we feeling?
          |     ~ Fine, I guess. Not terrible.
          |     
          |     103
          |     00:05:53,340 --> 00:05:57,340
          |     ~ A dumpster fire?
          |     ~ A bit up and down.
          |     ~ Good.
          |     
          |     105
          |     00:06:02,580 --> 00:06:04,810
          |     ~ It's full cancer now?
          |     ~ No, no. No, it's not that.
          |     
          |     109
          |     00:06:14,220 --> 00:06:16,730
          |     ~ What does that mean?
          |     ~ It would mean that there's a higher risk
          |     
          |     110
          |     00:06:16,780 --> 00:06:19,930
          |     of your baby having Down's Syndrome.
          |     
          |     111
          |     00:06:19,980 --> 00:06:22,610
          |     ~ Because in a geriatric pregnancy...
          |     ~ Geriatric!
          |     
          |     116
          |     00:06:38,540 --> 00:06:42,130
          |     ~ What are you? A sadist?
          |     ~ Sharon!
          |     ~ No, why do I need to know that?
          |     
          |     122
          |     00:07:11,140 --> 00:07:15,140
          |     ~ Why...
          |     ~ Where are my knickers? Er...
          |     
          |     124
          |     00:07:26,420 --> 00:07:30,420
          |     ~ for the next time we come in?
          |     ~ Erm, OK.
          |     ~ Thank you.
          |     
          |     126
          |     00:07:42,540 --> 00:07:46,540
          |     ~ Where the hell have you been?
          |     ~ I walked to Croydon.
          |     
          |     127
          |     00:07:46,860 --> 00:07:50,010
          |     ~ OK. Is that far?
          |     ~ About 13 miles.
          |     
          |     128
          |     00:07:50,060 --> 00:07:52,330
          |     Well, if you ever need
          |     to do that again,
          |     
          |     129
          |     00:07:52,380 --> 00:07:56,380
          |     ~ please let me shadow you from
          |     a block back or something.
          |     ~ OK.
          |     
          |     132
          |     00:08:06,980 --> 00:08:09,450
          |     ~ What the hell were you up to then?
          |     ~ Nothing that cool.
          |     
          |     151
          |     00:09:08,660 --> 00:09:11,570
          |     ~ to be able to look after a disabled child.
          |     ~ Neither am I, but...
          |     
          |     160
          |     00:09:43,220 --> 00:09:47,220
          |     ~ What do you want to do?
          |     ~ Do you mind if we don't
          |     talk about it any more?
          |     
          |     161
          |     00:09:49,860 --> 00:09:52,530
          |     ~ It's not going to happen.
          |     ~ But if it did...
          |     ~ It's not.
          |     
          |     164
          |     00:09:57,500 --> 00:09:58,690
          |     ~ It's one in 50.
          |     ~ Fine.
          |     
          |     165
          |     00:09:58,740 --> 00:10:01,810
          |     One in 50 chance of winning the
          |     lottery if you buy a ticket.
          |     
          |     166
          |     00:10:01,860 --> 00:10:04,530
          |     ~ Feeling good about your chances?
          |     ~ No, but...
          |     
          |     167
          |     00:10:04,580 --> 00:10:06,570
          |     There you go, if you're
          |     not going to win that,
          |     
          |     168
          |     00:10:06,620 --> 00:10:09,010
          |     ~ why would you win this one?
          |     ~ That's a good point.
          |     
          |     177
          |     00:10:34,460 --> 00:10:38,460
          |     ~ What?
          |     ~ It's all right. She wants
          |     to meet you and sort it out.
          |     ~ Why?
          |     
          |     181
          |     00:10:52,740 --> 00:10:56,740
          |     ~ Spleeny? That's not word.
          |     ~ Yeah, it is. It's a word.
          |     
          |     182
          |     00:10:59,140 --> 00:11:01,130
          |     ~ Spleeny?
          |     ~ Yeah, spleeny.
          |     
          |     186
          |     00:11:10,300 --> 00:11:14,300
          |     ~ Rob, don't!
          |     ~ PHONE RINGS
          |     
          |     187
          |     00:11:15,180 --> 00:11:18,170
          |     ~ Who's calling you?
          |     ~ Who's calling me?
          |     
          |     188
          |     00:11:18,220 --> 00:11:22,220
          |     ~ Yeah.
          |     ~ Er... Oh, Chris.
          |     
          |     191
          |     00:11:29,780 --> 00:11:33,780
          |     ~ You bellend.
          |     ~ Spleeny works. That's... you win.
          |     
          |     198
          |     00:11:55,340 --> 00:11:58,290
          |     ~ Hello?
          |     ~ Hello? Is that Ms Morris?
          |     ~ Yeah. It is.
          |     
          |     201
          |     00:12:03,540 --> 00:12:05,730
          |     ~ Yes.
          |     ~ Well, there was an error in the results.
          |     
          |     202
          |     00:12:05,780 --> 00:12:08,570
          |     It turns out the chances
          |     of chromosomal abnormality
          |     
          |     203
          |     00:12:08,620 --> 00:12:12,620
          |     ~ aren't one in 50, as you were told.
          |     ~ Oh. Well, that's... is that...
          |     
          |     208
          |     00:12:24,380 --> 00:12:28,380
          |     ~ Let us know. OK? Sorry again. Bye, now.
          |     ~ Bye, now.
          |     
          |     213
          |     00:12:57,220 --> 00:13:00,370
          |     ~ Is she drunk?
          |     ~ She was drunk at the Christmas
          |     concert last year.
          |     
          |     229
          |     00:14:05,900 --> 00:14:08,370
          |     ~ Is Mummy here?
          |     ~ Mummy?
          |     
          |     233
          |     00:14:20,260 --> 00:14:23,410
          |     ~ He had to take one of
          |     his pills and lie down.
          |     ~ Oh, please!
          |     
          |     236
          |     00:14:26,780 --> 00:14:29,090
          |     ~ And pregnant.
          |     ~ Yeah, with a baby! Not a...
          |     
          |     237
          |     00:14:29,140 --> 00:14:31,490
          |     wombat!
          |     
          |     238
          |     00:14:31,540 --> 00:14:35,540
          |     ~ Where have you come from?
          |     ~ I was at the doctor.
          |     ~ Why?
          |     
          |     247
          |     00:14:51,300 --> 00:14:54,650
          |     ~ God forgive me.
          |     ~ Mum?!
          |     
          |     252
          |     00:15:07,060 --> 00:15:10,690
          |     ~ Des! Mind your heart!
          |     ~ Yes, yes...
          |     
          |     263
          |     00:15:39,860 --> 00:15:42,250
          |     ~ So how's Jeffrey?
          |     ~ Jeffrey's great.
          |     
          |     271
          |     00:16:00,460 --> 00:16:04,460
          |     ~ Yeah, but he's...
          |     ~ Very tricky.
          |     ~ Mm-hmm.
          |     
          |     294
          |     00:17:34,340 --> 00:17:38,340
          |     ~ Hey.
          |     ~ Hi.
          |     
          |     295
          |     00:17:40,460 --> 00:17:43,290
          |     ~ There's something...
          |     ~ I've been to the doctor without telling you.
          |     
          |     333
          |     00:19:21,580 --> 00:19:25,580
          |     ~ Hi.
          |     ~ Hey! How are you? Come in. Come in.
          |     
          |     334
          |     00:19:28,020 --> 00:19:32,020
          |     ~ Is Rob home?
          |     ~ No. No, don't worry. I know all about it.
          |     ~ Oh.
          |     
          |     341
          |     00:19:49,860 --> 00:19:53,860
          |     ~ Sorry, what?
          |     ~ I think Rob and I both regret what happened.
          |     
          |     349
          |     00:20:19,500 --> 00:20:22,490
          |     ~ Oh, well...
          |     ~ Did you sleep with Fran?
          |     
          |     369
          |     00:21:41,900 --> 00:21:44,930
          |     ~ You OK, Miss?
          |     ~ Yeah, I'm great.
          |     
          |     372
          |     00:21:50,260 --> 00:21:54,260
          |     ~ Are you OK?
          |     ~ Yeah.
          |     
          |     375
          |     00:22:16,420 --> 00:22:19,850
          |     ~ I'll see you in a few months.
          |     ~ All right, love.
          |     
          |     376
          |     00:22:19,900 --> 00:22:22,490
          |     Now don't forget to eat.
          |     
          |     377
          |     00:22:22,540 --> 00:22:26,210
          |     ~ Bye, Dad.
          |     ~ All right, dear. Listen, just...
          |     ~ Oh, no, don't be silly.
          |     [---------------------------------]

Not removing ads?

Ran on a sub but nothing was processed and these lines were not removed:

Support us and become VIP member to remove all ads from www.OpenSubtitles.org

Please rate this subtitle at www.osdb.link/xyz Help other users choose the best subtitles.

hello from similar project - opensubtitles_adblocker.py

hey : )

i started my opensubtitles_adblocker.py about a month ago
as a part of my opensubtitles-scraper project
since i could not find any adblocker for subtitles...
not sure why i did not find this project earlier, i just found it via reddit

one difference:
my adblocker works on raw bytes, because that is faster
and because sub files can have broken encoding
for example utf8 and latin1 can appear in one file

for my opensubtitles_adblocker_add.py
i have forked pysubs2 to pysubs2bytes
so i can parse subtitle files into raw bytestrings

lets collaborate?
we could build a test corpus for ads in subtitles
based on the 6 million subs from opensubtitles.org
i would use one large srt file to store all the unwanted subs
and from that srt file, we could derive other subtitle formats
that srt would have a mix of many different encodings
so it would not be possible to read it as text file

i would use one large srt file to store all the unwanted subs

why srt?
srt is the most popular format = 90% of all subs
frequency of subtitle formats in opensubtitles.org

$ sqlite3  subtitles_all.txt.gz.db "select SubFormat, count(1) FROM subz_metadata GROUP BY SubFormat" | sort -n -k2 -t'|' -r

srt|5774302
sub|317934
ssa|179597
mpl|40845
tmp|17883
smi|7547
txt|6545
vtt|1998

SRT file start at zero index

Hey, first thanks for making/sharing this script! I recently tried it out for the first time, and it worked great, but a few movie SRT files came back with parsing errors.

When I looked into the files it seems like they all have the first block starting with an index of zero while SRT files which were read correctly all start at one.

Error from one of the runs:

INFO: subcleaner finished successfully partly. 31/36 files cleaned successfully.
    INFO: failed to clean following files:
    INFO:   - 'Harry Potter and the Deathly Hallows - Part 1 (2010) WEBDL-1080p.en.srt' reason: subcleaner was unable to decode the file: Parsing error at block 1 in file "\\file\path\redacted\Harry Potter Complete Collection (2001-2011)\Harry Potter and the Deathly Hallows - Part 1 (2010)\Harry Potter and the Deathly Hallows - Part 1 (2010) WEBDL-1080p.en.srt" line None. reason: incorrectly formatted subtitle block

SRT file attached (but saved as txt so it could upload).

My guess is the zero index, since it is the only difference I noticed between a good file vs. bad file. Is this just SRT standard format I guess to start the index at 1? If so, maybe there could be a simple sanity check when you first start reading the file, if the first block is indexed at zero change the index to 1 and re-number all following indexes (similar to re-numbering after a block deletion?).
Harry Potter and the Deathly Hallows - Part 2 (2011) WEBDL-1080p.en.srt.txt

Fatal error in manual excecution

Hi, i encountered a full stop of the script once more.

IndexError: list index out of range

This works great! But I'm getting some errors (maybe around 2% of the time):

Traceback (most recent call last):
  File "/opt/subcleaner/subcleaner.py", line 8, in <module>
    main.main(Path(__file__).absolute().parent)
  File "/opt/subcleaner/libs/subcleaner/main.py", line 34, in main
    clean_file(file)
  File "/opt/subcleaner/libs/subcleaner/main.py", line 69, in clean_file
    cleaner.run_regex(subtitle)
  File "/opt/subcleaner/libs/subcleaner/cleaner.py", line 22, in run_regex
    if subtitle.blocks[0].start_time < timedelta(seconds=2):
IndexError: list index out of range

Provide an option to remove id #s X, Y, Z

Let's say I do a dry run and I like 2 of the 3 changes it would do. How can I say only do #1 and #3 but not number 2?

New Regex to add to english (or others) config

Found the following ads in some subs that Bazarr was pulling in

          |     [---------Removed Blocks----------]
          |     33
          |     00:01:19,000 --> 00:01:25,074
          |     ENJOY ALL VOD IN HIGH QUALITY @ 4KVOD.TV
          |     GET LIVE TV,MOVIES, SHOWS IN ONE PACKAGE
          |     reasons: (en_purge6, global_purge2, global_warn1)
          |     
          |     1223
          |     01:36:55,305 --> 01:37:55,664
          |     Watch Movies, TV Series and Live Sports
          |     Signup Here -> WWW.ADMIT1.APP
          |     reasons: (en_purge6, global_warn1, global_warn1, close_to_end)
          |     [---------------------------------]

Created the following regex and tested it it works to remove. Probably a good idea to add to english config.
en_purge6: \b(admit1\.app|4kvod\.tv)\b

AttributeError: 'NoneType' object has no attribute 'replace'

Hey, I get an error when trying to run your script on my library. Not sure if the fault is on my end or if it's something in the script. I'm running python 3.9.5.

Traceback (most recent call last):
  File "/config/subcleaner/subcleaner.py", line 8, in <module>
    main.main(Path(__file__).absolute().parent)
  File "/config/subcleaner/libs/subcleaner/main.py", line 29, in main
    parse_config()
  File "/config/subcleaner/libs/subcleaner/main.py", line 217, in parse_config
    cleaner = Cleaner(package_dir.joinpath("regex"), regex_defaults)
  File "/config/subcleaner/libs/subcleaner/cleaner.py", line 17, in __init__
    self._build_regex(regex_dir, use_default_regex)
  File "/config/subcleaner/libs/subcleaner/cleaner.py", line 156, in _build_regex
    self._add_exclusive_configs()
  File "/config/subcleaner/libs/subcleaner/cleaner.py", line 187, in _add_exclusive_configs
    excluded_languages = parser["META"].get("excluded_language_codes").replace(" ", "").split(",")
AttributeError: 'NoneType' object has no attribute 'replace'

Problematic SRT files

I'm not sure what's wrong with these subs, but I'm getting this error for the entire series. I attached a txt version and a zip of all the show's subtitles. I'm using the REGEX that's in github that has caused no issues other than some Chinese TV show subs like this one.

Wrong text encoding?

[ERROR]: Exiting, There might be an issue with the regex, because everything in the subtitle would have gotten deleted. Nothing was changed.

If I Can Love You So (2019) - S01E01 - Episode 1 - (HDTV-720p-x265).en.zip
If I Can Love You So (2019) - S01E01 - Episode 1 - (HDTV-720p-x265).en.txt

How add regex profile for hi type subs

Hi, thanks for this subcleaner.

Can you please explain how to add hearing-impaired subs to the regex language profile?
Bazarr puts them like ".en.hi.srt".
Running subcleaner give a warning: "WARNING: language 'hi' have no regex profile associated with it."
Should/Can it be added as a language code en.hi in de english regex profile conf?
I get the idea it looks at id it as a separate language...

How to run the script on all previously downloaded subtitles?

I have just added the script to bazarr but if it only runs when a subtitle is grabbed, all previously downloaded subtitles will still contain their ads. How can it be made to run on all downloaded subtitles?

False Positive on Season X Episode X

Hi, I'm trying to build a Portuguese REGEX file, and in my tests I stumbled upon this regex on the global.conf:
global_purge3: s(eason)?\W*\d+[^,]\We(pisode)?\W\d+[^,]

This Regex is removing the following line from my file:

      |     00:21:19,280 --> 00:21:21,760
      |     casos famosos,
      |     especialmente nos anos 70 e 80,

Which translates to
"famous cases,
especially in the 70s and 80s,"

It is matching this part: anos 70 e 80.

I think a good solution for this would be to check the character before the 's(eason)', is this possible somehow?

Division By Zero

Just been running subcleaner manually on my existing subtitles. Currently on Battlestar Galactica. I've done a first pass using
python3 config/subcleaner/subcleaner.py -r "tv/Battlestar Galactica (2003)"
and then I was doing a dry run to check the remaining warnings
python3 config/subcleaner/subcleaner.py -r "tv/Battlestar Galactica (2003)" -n
which when it gets to tv/Battlestar Galactica (2003)/Season 01/Battlestar Galactica (2003) - S01E13 - Kobols Last Gleaming 2 [Bluray-1080p Remux][AAC 2.0][VC1].en.srt crashes with the following Traceback:

    INFO: [---------------------------------------------------------------------------------]
    INFO: loading subtitle: tv/Battlestar Galactica (2003)/Season 01/Battlestar Galactica (2003) - S01E13 - Kobols Last Gleaming 2 [Bluray-1080p Remux][AAC 2.0][VC1].en.srt
    INFO: now cleaning subtitle: tv/Battlestar Galactica (2003)/Season 01/Battlestar Galactica (2003) - S01E13 - Kobols Last Gleaming 2 [Bluray-1080p Remux][AAC 2.0][VC1].en.srt
 WARNING: the language within the file does not match language: 'en'
Traceback (most recent call last):
  File "//config/subcleaner/subcleaner.py", line 11, in <module>
    main.main()
  File "/config/subcleaner/libs/subcleaner/main.py", line 24, in main
    clean_directory(library)
  File "/config/subcleaner/libs/subcleaner/main.py", line 125, in clean_directory
    clean_directory(file)
  File "/config/subcleaner/libs/subcleaner/main.py", line 132, in clean_directory
    clean_file(file)
  File "/config/subcleaner/libs/subcleaner/main.py", line 89, in clean_file
    cleaner.fix_overlap(subtitle)
  File "/config/subcleaner/libs/subcleaner/cleaner/cleaner.py", line 92, in fix_overlap
    content_ratio = block.duration_seconds / (block.duration_seconds + previous_block.duration_seconds)
                    ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ZeroDivisionError: float division by zero

I've attached the file in question, although I've just had a look, and apparently it's in French despite stating it is in English!
Battlestar Galactica (2003) - S01E13 - Kobols Last Gleaming 2 [Bluray-1080p Remux][AAC 2.0][VC1].en.srt.txt

Avoid detecting "lyrics"

I have noticed when there are lyrics in a Show or Movie, that it triggers the similar content warning, among others (has resulted in incorrect removals also). These blocks typically contain a music symbol ♪. I wonder could this be used to ignore that section?

      |                                         [---------Warning Blocks----------]
      |                                         6
      |                                         00:00:22,435 --> 00:00:30,515
      |                                         ♪ It's Adventure Time ♪
      |                                         Ripped By mstoll
      |                                         reasons: (en_warn1, en_warn2)
      |                                         [---------------------------------]

False positives

I just ran this excellent script on all my media and here are the false positives I encountered:

[INFO]: Removed 1 subtitle blocks:
        [---------Removed Blocks----------]
        1150
        01:03:41,519 --> 01:03:43,650
        ...sexually explosive.
        [---------------------------------]

[INFO]: Removed 1 subtitle blocks:
        [---------Removed Blocks----------]
        43
        00:02:49,711 --> 00:02:52,921
        It's a great set.
        Full HD. 1080p,
        240 hertz, TrueMotion.
        [---------------------------------]

[INFO]: Removed 2 subtitle blocks:
        [---------Removed Blocks----------]
        333
        00:17:56,568 --> 00:17:58,768
        You want me to text my professor?

        334
        00:17:58,771 --> 00:18:00,771
        Yeah. Text... text him, text him.
        [---------------------------------]

[INFO]: Removed 6 subtitle blocks:
        [---------Removed Blocks----------]
        674
        00:21:17,817 --> 00:21:20,737
        Over and over,
        rickandmortyadventures.com.

        675
        00:21:20,820 --> 00:21:23,740
        www.rickandmorty.com.

        676
        00:21:23,824 --> 00:21:25,991
        www. rickandmortyadventures.

        677
        00:21:26,076 --> 00:21:27,661  
        All 100 years.                <--- Why did this block match at all? Is it just because it's between other matched blocks?

        678
        00:21:27,744 --> 00:21:30,329
        Every minute, rickandmorty.com.

        679
        00:21:30,413 --> 00:21:34,125
        www.100timesrickandmorty.com.
        [---------------------------------]

        ^This is from Rick and Morty - S01E01. Here it seems that everything resembling a URL gets removed, but the following was just a warning:

[WARNING]: Potential ads in 1 subtitle blocks, please verify:
            [---------Warning Blocks----------]
            550
            00:56:03,304 --> 00:56:09,175
            Craving big poker? Feast your eyes on Venom.
            $5 million GTD. AmericasCardroom.com

Here's a curious case:

[INFO]: Removed 1 subtitle blocks:
        [---------Removed Blocks----------]
        698
        00:51:27,500 --> 00:51:35,500
        Ripped By mstoll
        [---------------------------------]
[WARNING]: Potential ads in 1 subtitle blocks, please verify:
            [---------Warning Blocks----------]
            7
            00:00:46,000 --> 00:00:54,000
            Ripped By mstoll
            [---------------------------------]

How come the exact same pattern is only a warning the second time it's encountered?

All in all, not bad considering how many files it ran on 👍

Installing in Docker container

This is not a bug, but a request for help. I am trying to install this script within the LS docker container. Tried "git clone https://github.com/KBlixt/subcleaner.git", but this didn't do the trick as git doesn't seem to be installed. Any thoughts?

Many false positives

subcleaner/libs/subcleaner/cleaner.py

Line 23 in 3f76291

subtitle.blocks[0].regex_matches = 3

Hi!

Love the work you've done with this project. I've been using it for a few months now and I love it!

Line 23 on cleaner.py gives me a lot of false positives. If I change the 3 to a 1, it changes the false positives to warnings, so that's what I have done locally for now.

subtitle.blocks[0].regex_matches = 1

It would remove a lot of the first blocks of tv show subtitles that start with something like: "Previously on tv show".

List index out of range

When performing a dry run on Scrubs (2001)
python3 config/subcleaner/subcleaner.py -r "tv/Scrubs (2001)"/ -n
episode tv/Scrubs (2001)/Season 04/Scrubs (2001) - S04E16 - My Quarantine [SDTV][AC3 2.0][MPEG2].en.srt
causes a crash, with the resulting Traceback:

    INFO: [---------------------------------------------------------------------------------]
    INFO: loading subtitle: tv/Scrubs (2001)/Season 04/Scrubs (2001) - S04E16 - My Quarantine [SDTV][AC3 2.0][MPEG2].en.srt
    INFO: now cleaning subtitle: tv/Scrubs (2001)/Season 04/Scrubs (2001) - S04E16 - My Quarantine [SDTV][AC3 2.0][MPEG2].en.srt
Traceback (most recent call last):
  File "//config/subcleaner/subcleaner.py", line 11, in <module>
    main.main()
  File "/config/subcleaner/libs/subcleaner/main.py", line 24, in main
    clean_directory(library)
  File "/config/subcleaner/libs/subcleaner/main.py", line 125, in clean_directory
    clean_directory(file)
  File "/config/subcleaner/libs/subcleaner/main.py", line 132, in clean_directory
    clean_file(file)
  File "/config/subcleaner/libs/subcleaner/main.py", line 86, in clean_file
    cleaner.find_ads(subtitle)
  File "/config/subcleaner/libs/subcleaner/cleaner/cleaner.py", line 24, in find_ads
    punishers.punish_quick_first_block(subtitle)
  File "/config/subcleaner/libs/subcleaner/cleaner/punishers/time.py", line 8, in punish_quick_first_block
    block = subtitle.blocks[0]
            ~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Subtitles are attached
Scrubs (2001) - S04E16 - My Quarantine [SDTV][AC3 2.0][MPEG2].en.srt.txt

After checking the file, it appears that all the blocks are at 00:00:00,000 --> 00:00:00,000, and it's actually empty of any content. Obviously a bad file.

Can'r run it through bazarr

Hi, I've followed all the instructions but somehow I can't get it work under bazarr, I always seem to get this:

BAZARR Post-processing result for file /mnt/data/shows/Constellation (2024) [imdb-tt19395018]/Season 01/Constellation (2024) [imdb-tt19395018] - S01E07 - Through The Looking Glass [ATVP WEBDL-2160p][DV][EAC3 Atmos 5.1][h265]-FLUX.mkv: Traceback (most recent call last): File "/opt/subcleaner/subcleaner.py", line 10, in <module> from libs.subcleaner import main File "/opt/subcleaner/libs/subcleaner/main.py", line 4, in <module> from .subtitle import Subtitle, ParsingException, FileContentException File "/opt/subcleaner/libs/subcleaner/subtitle.py", line 6, in <module> from .settings import args, config File "/opt/subcleaner/libs/subcleaner/settings/__init__.py", line 3, in <module> from . import log_config File "/opt/subcleaner/libs/subcleaner/settings/log_config.py", line 15, in <module> file_handler = logging.handlers.RotatingFileHandler(config.log_file, maxBytes=10_000_000, backupCount=10, encoding='utf8') File "/usr/lib/python3.9/logging/handlers.py", line 153, in __init__ BaseRotatingHandler.__init__(self, filename, mode, encoding=encoding, File "/usr/lib/python3.9/logging/handlers.py", line 58, in __init__ logging.FileHandler.__init__(self, filename, mode=mode, File "/usr/lib/python3.9/logging/__init__.py", line 1142, in __init__ StreamHandler.__init__(self, self._open()) File "/usr/lib/python3.9/logging/__init__.py", line 1171, in _open return open(self.baseFilename, self.mode, encoding=self.encoding, OSError: [Errno 30] Read-only file system: '/opt/subcleaner/logs/subcleaner.log'

Looks like it never can't open the logs. I already tried assigning the user of subcleaner to bazarr without success. It still doesn't work. Running directly in the CLI works perfectly tho.

Mr Robot issue

Subcleaner fails to clean files that have extensions before .srt.
It's a very specific issue but Mr Robot has episode names with file extensions in them (and possibly some other shows) so subcleaner doesn't parse them.

  INFO: failed to clean following files:
  INFO:   - 'Mr. Robot - S01E10 - eps1.9_zer0-day.avi.en.srt' reason: language 'avi' have no regex profile associated with it.
  INFO:   - 'Mr. Robot - S01E06 - eps1.5_br4ve-trave1er.asf.en.srt' reason: language 'asf' have no regex profile associated with it.
  INFO:   - 'Mr. Robot - S01E03 - eps1.2_d3bug.mkv.en.srt' reason: language 'mkv' have no regex profile associated with it.
  INFO:   - 'Mr. Robot - S01E01 - eps1.0_hellofriend.mov.en.srt' reason: language 'mov' have no regex profile associated with it.

Issue with --end-report

Already send the specifics about this issue and much more on discord. Please delete this after you've read it.

Invalid Start Byte

Hi,

I'm trying to run the script against one of the srt files, unfortunately it thrown the following error:

root@bazaar:/movies/abc# python3 /config/script/subcleaner/subcleaner.py --dry-run File.en.srt 
subcleaner was unable to decode file: "Files.en.srt
" reason: "invalid start byte"
subcleaner completed successfully.

I have attached the srt file for your reference.

Thanks
Files.en.srt.zip