Hi, I've noticed that the tvgids.nl source includes HTML tags in the

Added lu li. It is a set of re.sub statements: <div class="snippet-clipboard-conte

I think you should also include ul and <code class="n

Tvgids.nl HTML in description output about tvgrabpyapi HOT 14 CLOSED

tvgrabbers commented on September 25, 2024

Tvgids.nl HTML in description output

from tvgrabpyapi.

Comments (14)

hikavdh commented on September 25, 2024 1

Well it was simpler then I thought. It was mostly that I have had little time the last months. Early this winter tvgids.nl renewed their sites and I hadn't found time to do more then the basics. Now also the genres and the detail pages are working again.

from tvgrabpyapi.

hikavdh commented on September 25, 2024 1

Added lu li. It is a set of re.sub statements:

"sub": ["<html>", "", "</html>", "", 
	"\\s*</p><p>\\s*", " ", "<p>", "", "</p>", "", 
	"\\s*<ul>", " (", "</ul>\\s*", ") ", 
	"\\s*</li><li>\\s*", "; ", "<li>\\s*", "", "\\s*</li>", ""]

from tvgrabpyapi.

hikavdh commented on September 25, 2024 1

Oh and one tip. With tv_grab_nl3.py --clear-source 3 you remove all data from tvgids.nl from your database. The current day is always freshly fetched, but days further in the future are retrieved from your database and it will take 2 weeks for all html tagged descriptions to disappear.
It will next take several fetches to come up to 14 days as a max of 3 or 4 days is fetched every time. But faster then 14 days.

from tvgrabpyapi.

hikavdh commented on September 25, 2024

This is known, but I haven't had time jet to look into this. tvgids.nl is the source with the highest priority, so unless you set prefered_description for a channel to another source, it will be used. I have them mostly set to 7 (vpro.nl). That way you only get the tvgids.nl description if vpro.nl does not offer one.

from tvgrabpyapi.

mitchellklijs commented on September 25, 2024

Yeah! That would work fine for a workaround for now. Thanks 😉.

However, the information provided by tvgids.nl is the most detailed information I've encountered yet. So a fix would be nice of course!

from tvgrabpyapi.

hikavdh commented on September 25, 2024

The problem if I remember well is that those tags are not from their website, but are enclosed in the text. This possibly means I have to create extra functionality in tvgrabpyAPI to catch it and thus takes more time.

from tvgrabpyapi.

hikavdh commented on September 25, 2024

It needs an extra html decoding pass after decoding the page and grabbing the data or a very good regex for search and replace.

from tvgrabpyapi.

hikavdh commented on September 25, 2024

a regex could be placed in the datadef, so if you can think one up?

from tvgrabpyapi.

mitchellklijs commented on September 25, 2024

I understand. I've tried removing HTML with regex before, but there is always a case where it doesn't work...

The most simple regex I can think of is this <[^>]*>, which removes everything between < and >. This is maybe a bit to aggressive? As it could also remove non-HTML tags.
A more restrictive approach could be by specifying common HTML tags. For example: <(?:html|\/html|div|\/div|br|p|\/p|ul|\/ul|li|\/li)[^>]*>. However, in this case we'll most likely miss some tags.

The best way to tackle this issue forever would indeed be to implement a HTML decoding mechanism.

from tvgrabpyapi.

mitchellklijs commented on September 25, 2024

I think you should also include ul and li tags for tvgids.nl (https://github.com/tvgrabbers/sourcematching/blob/master/sources/source-tvgids.nl.json#L62).

I've just tested the new release, but these tags aren't removed:

  <programme  start="20190429022500 +0200" stop="20190429023000 +0200" channel="0-1">
    <title  lang="nl">NOS Journaal</title>
    <desc  lang="nl">Met het laatste nieuws, gebeurtenissen van nationaal en internationaal belang en de weersverwachting voor vandaag. &lt;ul&gt;&lt;li&gt; Dode bij aanslag op synagoge in de VS &lt;/li&gt;&lt;li&gt; Eerste toeristen terug uit Sri Lanka &lt;/li&gt;&lt;li&gt; Agent Maastricht aangereden &lt;/li&gt;&lt;li&gt; Doden bij kraanongeval in Seattle &lt;/li&gt;&lt;li&gt; Het weer&lt;/li&gt;&lt;/ul&gt;</desc>
    <date>2017</date>
    <category>News</category>
    <previously-shown/>
  </programme>

from tvgrabpyapi.

hikavdh commented on September 25, 2024

Thanks, and now to determine with what to replace to keep it readable. I can't use new lines, so I guess it will become a ; separated list, maybe enclosed in brackets.

from tvgrabpyapi.

mitchellklijs commented on September 25, 2024

Yeah, would probably be best. Maybe add some other common tags as well (https://www.w3schools.com/tags/)?

One other consideration. Right now tags with attributes wouldn't be removed. For example:

<p style="xxxx"></p>

I've never encountered a situation with Tvgids.nl yet where this would be necessary, but it might become in the future?

from tvgrabpyapi.

hikavdh commented on September 25, 2024

These are only basic layout tags. It comes from a text field inside a json data page, so it should not be more as else it could interfere with the frontend using the data. So definitely no style data

from tvgrabpyapi.

mitchellklijs commented on September 25, 2024

Yeah, that's true! Forgot about that.

from tvgrabpyapi.

Tvgids.nl HTML in description output about tvgrabpyapi HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent