Comments (14)
Well it was simpler then I thought. It was mostly that I have had little time the last months. Early this winter tvgids.nl renewed their sites and I hadn't found time to do more then the basics. Now also the genres and the detail pages are working again.
from tvgrabpyapi.
Added lu li. It is a set of re.sub statements:
"sub": ["<html>", "", "</html>", "",
"\\s*</p><p>\\s*", " ", "<p>", "", "</p>", "",
"\\s*<ul>", " (", "</ul>\\s*", ") ",
"\\s*</li><li>\\s*", "; ", "<li>\\s*", "", "\\s*</li>", ""]
from tvgrabpyapi.
Oh and one tip. With tv_grab_nl3.py --clear-source 3
you remove all data from tvgids.nl from your database. The current day is always freshly fetched, but days further in the future are retrieved from your database and it will take 2 weeks for all html tagged descriptions to disappear.
It will next take several fetches to come up to 14 days as a max of 3 or 4 days is fetched every time. But faster then 14 days.
from tvgrabpyapi.
This is known, but I haven't had time jet to look into this. tvgids.nl is the source with the highest priority, so unless you set prefered_description for a channel to another source, it will be used. I have them mostly set to 7 (vpro.nl). That way you only get the tvgids.nl description if vpro.nl does not offer one.
from tvgrabpyapi.
Yeah! That would work fine for a workaround for now. Thanks 😉.
However, the information provided by tvgids.nl is the most detailed information I've encountered yet. So a fix would be nice of course!
from tvgrabpyapi.
The problem if I remember well is that those tags are not from their website, but are enclosed in the text. This possibly means I have to create extra functionality in tvgrabpyAPI to catch it and thus takes more time.
from tvgrabpyapi.
It needs an extra html decoding pass after decoding the page and grabbing the data or a very good regex for search and replace.
from tvgrabpyapi.
a regex could be placed in the datadef, so if you can think one up?
from tvgrabpyapi.
I understand. I've tried removing HTML with regex before, but there is always a case where it doesn't work...
The most simple regex I can think of is this <[^>]*>
, which removes everything between <
and >
. This is maybe a bit to aggressive? As it could also remove non-HTML tags.
A more restrictive approach could be by specifying common HTML tags. For example: <(?:html|\/html|div|\/div|br|p|\/p|ul|\/ul|li|\/li)[^>]*>
. However, in this case we'll most likely miss some tags.
The best way to tackle this issue forever would indeed be to implement a HTML decoding mechanism.
from tvgrabpyapi.
I think you should also include ul
and li
tags for tvgids.nl (https://github.com/tvgrabbers/sourcematching/blob/master/sources/source-tvgids.nl.json#L62).
I've just tested the new release, but these tags aren't removed:
<programme start="20190429022500 +0200" stop="20190429023000 +0200" channel="0-1">
<title lang="nl">NOS Journaal</title>
<desc lang="nl">Met het laatste nieuws, gebeurtenissen van nationaal en internationaal belang en de weersverwachting voor vandaag. <ul><li> Dode bij aanslag op synagoge in de VS </li><li> Eerste toeristen terug uit Sri Lanka </li><li> Agent Maastricht aangereden </li><li> Doden bij kraanongeval in Seattle </li><li> Het weer</li></ul></desc>
<date>2017</date>
<category>News</category>
<previously-shown/>
</programme>
from tvgrabpyapi.
Thanks, and now to determine with what to replace to keep it readable. I can't use new lines, so I guess it will become a ;
separated list, maybe enclosed in brackets.
from tvgrabpyapi.
Yeah, would probably be best. Maybe add some other common tags as well (https://www.w3schools.com/tags/)?
One other consideration. Right now tags with attributes wouldn't be removed. For example:
<p style="xxxx"></p>
I've never encountered a situation with Tvgids.nl yet where this would be necessary, but it might become in the future?
from tvgrabpyapi.
These are only basic layout tags. It comes from a text field inside a json data page, so it should not be more as else it could interfere with the frontend using the data. So definitely no style data
from tvgrabpyapi.
Yeah, that's true! Forgot about that.
from tvgrabpyapi.
Related Issues (20)
- Errors retrieving details from tvgids.nl HOT 5
- Extremely slow and data-errors in tvgids.nl HOT 1
- Memory usage? HOT 5
- is it ready for end-user? HOT 7
- tv_grab_nl3.py with --quiet still writes to stderr HOT 4
- Source "Humo" fails HOT 15
- tvgrabpy looping after exception in sources. HOT 6
- NPO Episode Numbers HOT 23
- Lot of programs have missing serie and episode HOT 17
- Hangs HOT 1
- Cannot disable source 8 HOT 15
- Port to Python 3 HOT 14
- Encoding/locale issue in retrieved EPG data HOT 2
- JSON error HOT 4
- tv_grab_fetch error: ValueError: 1 is not in list HOT 9
- Tvgrabber stopped working after json errorin sourcefile for source-horizon.tv HOT 1
- Serie information missing from lot of programs and lot of stations. HOT 5
- BBC First: time offset HOT 5
- tvgrab no longer works due to lack of valid sources HOT 2
- Configure offset -1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tvgrabpyapi.