Git Product home page Git Product logo

Comments (14)

hikavdh avatar hikavdh commented on September 25, 2024 1

Well it was simpler then I thought. It was mostly that I have had little time the last months. Early this winter tvgids.nl renewed their sites and I hadn't found time to do more then the basics. Now also the genres and the detail pages are working again.

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024 1

Added lu li. It is a set of re.sub statements:

"sub": ["<html>", "", "</html>", "", 
	"\\s*</p><p>\\s*", " ", "<p>", "", "</p>", "", 
	"\\s*<ul>", " (", "</ul>\\s*", ") ", 
	"\\s*</li><li>\\s*", "; ", "<li>\\s*", "", "\\s*</li>", ""]

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024 1

Oh and one tip. With tv_grab_nl3.py --clear-source 3 you remove all data from tvgids.nl from your database. The current day is always freshly fetched, but days further in the future are retrieved from your database and it will take 2 weeks for all html tagged descriptions to disappear.
It will next take several fetches to come up to 14 days as a max of 3 or 4 days is fetched every time. But faster then 14 days.

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024

This is known, but I haven't had time jet to look into this. tvgids.nl is the source with the highest priority, so unless you set prefered_description for a channel to another source, it will be used. I have them mostly set to 7 (vpro.nl). That way you only get the tvgids.nl description if vpro.nl does not offer one.

from tvgrabpyapi.

mitchellklijs avatar mitchellklijs commented on September 25, 2024

Yeah! That would work fine for a workaround for now. Thanks 😉.

However, the information provided by tvgids.nl is the most detailed information I've encountered yet. So a fix would be nice of course!

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024

The problem if I remember well is that those tags are not from their website, but are enclosed in the text. This possibly means I have to create extra functionality in tvgrabpyAPI to catch it and thus takes more time.

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024

It needs an extra html decoding pass after decoding the page and grabbing the data or a very good regex for search and replace.

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024

a regex could be placed in the datadef, so if you can think one up?

from tvgrabpyapi.

mitchellklijs avatar mitchellklijs commented on September 25, 2024

I understand. I've tried removing HTML with regex before, but there is always a case where it doesn't work...

The most simple regex I can think of is this <[^>]*>, which removes everything between < and >. This is maybe a bit to aggressive? As it could also remove non-HTML tags.
A more restrictive approach could be by specifying common HTML tags. For example: <(?:html|\/html|div|\/div|br|p|\/p|ul|\/ul|li|\/li)[^>]*>. However, in this case we'll most likely miss some tags.

The best way to tackle this issue forever would indeed be to implement a HTML decoding mechanism.

from tvgrabpyapi.

mitchellklijs avatar mitchellklijs commented on September 25, 2024

I think you should also include ul and li tags for tvgids.nl (https://github.com/tvgrabbers/sourcematching/blob/master/sources/source-tvgids.nl.json#L62).

I've just tested the new release, but these tags aren't removed:

  <programme  start="20190429022500 +0200" stop="20190429023000 +0200" channel="0-1">
    <title  lang="nl">NOS Journaal</title>
    <desc  lang="nl">Met het laatste nieuws, gebeurtenissen van nationaal en internationaal belang en de weersverwachting voor vandaag. &lt;ul&gt;&lt;li&gt; Dode bij aanslag op synagoge in de VS &lt;/li&gt;&lt;li&gt; Eerste toeristen terug uit Sri Lanka &lt;/li&gt;&lt;li&gt; Agent Maastricht aangereden &lt;/li&gt;&lt;li&gt; Doden bij kraanongeval in Seattle &lt;/li&gt;&lt;li&gt; Het weer&lt;/li&gt;&lt;/ul&gt;</desc>
    <date>2017</date>
    <category>News</category>
    <previously-shown/>
  </programme>

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024

Thanks, and now to determine with what to replace to keep it readable. I can't use new lines, so I guess it will become a ; separated list, maybe enclosed in brackets.

from tvgrabpyapi.

mitchellklijs avatar mitchellklijs commented on September 25, 2024

Yeah, would probably be best. Maybe add some other common tags as well (https://www.w3schools.com/tags/)?

One other consideration. Right now tags with attributes wouldn't be removed. For example:

<p style="xxxx"></p>

I've never encountered a situation with Tvgids.nl yet where this would be necessary, but it might become in the future?

from tvgrabpyapi.

hikavdh avatar hikavdh commented on September 25, 2024

These are only basic layout tags. It comes from a text field inside a json data page, so it should not be more as else it could interfere with the frontend using the data. So definitely no style data

from tvgrabpyapi.

mitchellklijs avatar mitchellklijs commented on September 25, 2024

Yeah, that's true! Forgot about that.

from tvgrabpyapi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.