coleifer / micawber Goto Github PK

View Code? Open in Web Editor NEW

628.0 628.0 91.0 206 KB

a small library for extracting rich content from urls

Home Page: http://micawber.readthedocs.org/

License: MIT License

Python 99.61% HTML 0.39%

oembed python

micawber's People

Contributors

Stargazers

Watchers

Forkers

d1on rayleyva umbrae jeffreylin82 jcerjak mechanism mvdwaeter pombredanne iterativ fdintino stefanfoulis lydiapierce tyaakow pennersr jvanasco ivirabyan strogo vdboor robv carljm ahumeau ligich etchalon j0rdm4n eshagh pulilab sammyrulez pombreda jimmy0000 benkonrath garito mgaitan narrativecontentgroup oleggirko edwardyangxin priestd09 gogobook tannie jose2190 thewebsitenursery kopos flimm saskye andreyfedoseev somair mennamorato busla knightth0r demojavascript aglaianwoman hireiq fullstackenviormentss bigrlab myhololens batermj cscg oelesin liquidgenius shubhampachori12110095 kee001 hhy5277 jaap3 tek9x gurbert russellromney enrobyn fagan2888 ra2003 shon inhersight joelschutz mgorny icodein dohmboyog tclancy bright-spark devlinarptiw anjali-92 theusefularts inncee81 swifilaboroka burakozturk16 timgates42 iq-scm ennamarie19 matthiask fabaff xxprinceanonymousxx

micawber's Issues

New release?

The most recent commit is a valuable addition and it would be great to see it rolled into an "official" release. In the meantime, deploying via git is working.

performance suggestion

I'm considering migrating to micawber from a custom oembed consumer, and wanted to suggest a performance improvement that I am willing to generate a PR for.

I'd like to extend the ProviderRegistry with a secondary internal register that nests providers under domain names.

this would allow users to optionally avoid a regex match against every provider and only test the domain.

some light tests on a quick mockup showed the lookups to run in 30% the time -- including the overhead of parsing the domain name from a url, but about 5% of the time if you have the domain already.

we would be using this on a high volume indexer, so this performance is a need.

feature request: add media.ccc.de integration

Hi!
Falsely reported to nikola (to add more features), I'm now reporting this here as a feature request:
It would be great to integrate videos/ streams from https://media.ccc.de into this library.

The service is run by the German hacker association Chaos Computer Club (CCC), which hosts annual events itself and lends streaming expertise to many external events via its Video Operation Center (VOC).

The streaming service is a valuable source of information on many different topics and I think it would be an awesome addition!

If you have pointers on where I can add it (I assume somewhere in providers.py), I might be able to do a pull request myself. I wouldn't call myself a Python expert though :-)

Limit number of rendered links

Hello,
There is a security concern that is generally not taken care of in oEmbed solutions: if one uses these solutions to provide media display of user input, one has to take care of malicious users filling their input with dozens or hundreds of links. (posted in order to clutter the other viewers' pages)
So I wonder if there is a simple way with micawber to limit the number of links parsed.
Thanks

Media with https

I am trying to embed that video in my nikola blog post.

The video is embed in http no matter what configuration I use.
I tried both http and https with the following syntax:

# With http
.. media:: http://www.dailymotion.com/video/x1apjif_une-arbalete-de-poche-fabriquee-manuellement_tv

# With https
.. media:: https://www.dailymotion.com/video/x1apjif_une-arbalete-de-poche-fabriquee-manuellement_tv

The problem is that the video is hidden by Firefox when I use the https version of my blog.

Is this a bug in micawber or, as @RAISINA mentioned here, is it a dailymotion issue?

If it can be of any help, I am currently using:

nikola 7.7.7
micawber 0.3.3

Here is my original issue

ModuleNotFoundError: No module named 'micawber.contrib.mcdjango.mcdjango_tests'

Since 0.3.7 I have troubles running the tests during packaging.

running test
running egg_info
writing micawber.egg-info/PKG-INFO
writing dependency_links to micawber.egg-info/dependency_links.txt
writing top-level names to micawber.egg-info/top_level.txt
reading manifest file 'micawber.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'micawber.egg-info/SOURCES.txt'
running build_ext
test_extract (micawber.tests.ParserTestCase) ... ok
test_html_entities (micawber.tests.ParserTestCase) ... ok
test_multiline (micawber.tests.ParserTestCase) ... ok
test_multiline_full (micawber.tests.ParserTestCase) ... ok
test_outside_of_markup (micawber.tests.ParserTestCase) ... ok
test_parse_text (micawber.tests.ParserTestCase) ... ok
test_parse_text_full (micawber.tests.ParserTestCase) ... ok
test_urlize (micawber.tests.ParserTestCase) ... ok
test_caching (micawber.tests.ProviderTestCase) ... ok
test_caching_params (micawber.tests.ProviderTestCase) ... ok
test_invalid_json (micawber.tests.ProviderTestCase) ... ok
test_multiple_matches (micawber.tests.ProviderTestCase) ... ok
test_provider (micawber.tests.ProviderTestCase) ... ok
test_provider_matching (micawber.tests.ProviderTestCase) ... ok
test_register_unregister (micawber.tests.ProviderTestCase) ... ok

----------------------------------------------------------------------
Ran 15 tests in 0.082s

OK
Running micawber tests
All micawber tests passed
Running django integration tests
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/django/apps/config.py", line 118, in create
    cls = getattr(mod, cls_name)
AttributeError: module 'micawber.contrib.mcdjango' has no attribute 'mcdjango_tests'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "setup.py", line 35, in <module>
    test_suite='runtests.runtests',
  File "/usr/lib/python3.7/site-packages/setuptools/__init__.py", line 140, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.7/site-packages/setuptools/command/test.py", line 228, in run
    self.run_tests()
  File "/usr/lib/python3.7/site-packages/setuptools/command/test.py", line 250, in run_tests
    exit=False,
  File "/usr/lib/python3.7/unittest/main.py", line 100, in __init__
    self.parseArgs(argv)
  File "/usr/lib/python3.7/unittest/main.py", line 147, in parseArgs
    self.createTests()
  File "/usr/lib/python3.7/unittest/main.py", line 159, in createTests
    self.module)
  File "/usr/lib/python3.7/unittest/loader.py", line 220, in loadTestsFromNames
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/usr/lib/python3.7/unittest/loader.py", line 220, in <listcomp>
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/usr/lib/python3.7/unittest/loader.py", line 205, in loadTestsFromName
    test = obj()
  File "/build/python-micawber/src/python-micawber-0.3.7/runtests.py", line 80, in runtests
    dj_failures = run_django_tests()
  File "/build/python-micawber/src/python-micawber-0.3.7/runtests.py", line 60, in run_django_tests
    setup()
  File "/usr/lib/python3.7/site-packages/django/__init__.py", line 24, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "/usr/lib/python3.7/site-packages/django/apps/registry.py", line 89, in populate
    app_config = AppConfig.create(entry)
  File "/usr/lib/python3.7/site-packages/django/apps/config.py", line 123, in create
    import_module(entry)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'micawber.contrib.mcdjango.mcdjango_tests'

It seems currently only the templates of mcdjango are packaged.

Upload next version to PyPI

The last version on PyPI is from 2015. Could a more recent version be uploaded?

Option to prevent converting inline links

I have a Markdown RichText field in my Django app that I'm using micawber for converting video links into embedded videos. I only want micawber to convert links on their own line into embedded media however. I don't want it to convert my markdown links, the mardown converter will take care of those.

So far the text is first run through an oembed_no_urlize function as described in your documentation:

from micawber.contrib.mcdjango import extension

oembed_no_urlize = extension('oembed', urlize_all=False)

Inline YouTube links are still oEmbed converted though, so a Markdown link like

[5 minutter og ti sekunder](http://www.youtube.com/watch?v=chbOViRudAg&t=5m10s)

is converted into

<a href='a href="http://www.youtube.com/watch?v=chbOViRudAg&amp;t=5m10s" title="Joo Sae Hyuk Vs Chuang Chih Yuan: WTTC 2014: 1/4 Final AMAZING MATCH">Joo Sae Hyuk Vs Chuang Chih Yuan: WTTC 2014: 1/4 Final AMAZING MATCH</a'>5 minutter og ti sekunder</a>

when first converted by micawber and then markdown.

Is it possible to disable all inline conversion?

Django 1.10 support

Hello! What versions of Django micawber does support?
Now with Django 1.9.x I get RemovedInDjango110Warning warnings in log.

.../site-packages/django/template/loader.py:97: RemovedInDjango110Warning: render() must be called with a dict, not a Context.
  return template.render(context, request)

It's because of render_to_string function. I looked through 1.8-1.10 Django docs. Looks like this function really waiting for dict.

Option to customize fallback behavior if provider not available

At present if no provider is found for a URL, and urlize_all is True, the urlize function appears to always be called which renders a simple link. There doesn't appear to be a way to change this.

I'd like to be able to customize this fallback behavior - perhaps by passing in a function as is done with the handlers - for example if I want to render the link with target="_blank", or use the domain instead of the full URL in the title, etc.

Is there a way to do this at present?

How to bootstrap Iframely?

Hi Charles, this is a separate ticket to continue this discussion.

We added the description of Iframely's approach to providers here: https://iframely.com/docs/providers.

Though our preference would be to bootstrap for all URLs as Iframely can generate summary cards, handle link shorteners, detect direct image links, etc.

Another issue is the API endpoint address:

cloud oEmbed API is at http://iframe.ly/api/oembed?url=...&api_key=....
however, Iframely can also be self-hosted and developers can have their own API endpoint address as the Iframely is open-source

Any suggestions?

bootstrap_embedly() is not python 3 compatible

Using Python 3.4.2 in ubuntu

In [1]: import micawber

In [2]: micawber.bootstrap_embedly()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-2419c7664bfa> in <module>()
----> 1 micawber.bootstrap_embedly()

/home/tin/.virtualenvs/waliki/lib/python3.4/site-packages/micawber/providers.py in bootstrap_embedly(cache, **params)
    203     resp.close()
    204 
--> 205     json_data = json.loads(contents)
    206 
    207     for provider_meta in json_data:

/usr/lib/python3.4/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    310     if not isinstance(s, str):
    311         raise TypeError('the JSON object must be str, not {!r}'.format(
--> 312                             s.__class__.__name__))
    313     if s.startswith(u'\ufeff'):
    314         raise ValueError("Unexpected UTF-8 BOM (decode using utf-8-sig)")

TypeError: the JSON object must be str, not 'bytes'

Bootstrap provider results in mixed protocol content in HTTPS sites

Using the bootstrap providers results in mixed protocol content in HTTPS sites. I believe using protocol relative urls for providers would fix this...

Vimeo and bootstrap_oembedio

In [5]: bootstrap_oembedio().request('http://vimeo.com/111410510')
---------------------------------------------------------------------------
ProviderNotFoundException                 Traceback (most recent call last)
<ipython-input-5-cdc3e12d26dd> in <module>()
----> 1 bootstrap_oembedio().request('http://vimeo.com/111410510')

/home/adas/.virtualenvs/rownosc-info/local/lib/python2.7/site-packages/micawber/providers.pyc in inner(self, url, **params)
     91                 self.cache.set(key, data)
     92             return data
---> 93         return fn(self, url, **params)
     94     return inner
     95 

/home/adas/.virtualenvs/rownosc-info/local/lib/python2.7/site-packages/micawber/providers.pyc in request(self, url, **params)
    132         if provider:
    133             return provider.request(url, **params)
--> 134         raise ProviderNotFoundException('Provider not found for "%s"' % url)
    135 
    136 

ProviderNotFoundException: Provider not found for "http://vimeo.com/111410510"
In [12]: [(k,v) for k,v in bootstrap_oembedio()._registry.items() if 'vimeo' in k]
Out[12]: [(u'vimeo\\.com', <micawber.providers.Provider at 0xb5ebfd0c>)]

Am doing something wrong?

Other way was working...

In [17]: bootstrap_basic().request('http://vimeo.com/111410510')
Out[17]: 
{u'author_name': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'description': u'Copyright by Fundacja Picture Doc\nCopyright by Fundacja Dialog-Pheniben',
 u'duration': 310,
 u'height': 720,
 u'html': u'<iframe src="//player.vimeo.com/video/111410510" width="1280" height="720" frameborder="0" title="Romowie w Europie. Zag\u0142ada" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>',
 u'is_plus': u'1',
 u'provider_name': u'Vimeo',
 u'provider_url': u'https://vimeo.com/',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'video',
 u'uri': u'/videos/111410510',
 'url': 'http://vimeo.com/111410510',
 u'version': u'1.0',
 u'video_id': 111410510,
 u'width': 1280}
In [18]: bootstrap_embedly().request('http://vimeo.com/111410510')
Out[18]: 
{u'author_name': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'description': u'Copyright by Fundacja Picture Doc Copyright by Fundacja Dialog-Pheniben',
 u'height': 720,
 u'html': u'<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fplayer.vimeo.com%2Fvideo%2F111410510&src_secure=1&url=http%3A%2F%2Fvimeo.com%2F111410510&image=http%3A%2F%2Fi.vimeocdn.com%2Fvideo%2F496100635_1280.jpg&type=text%2Fhtml&schema=vimeo" width="1280" height="720" scrolling="no" frameborder="0" allowfullscreen></iframe>',
 u'provider_name': u'Vimeo',
 u'provider_url': u'https://vimeo.com/',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'video',
 'url': 'http://vimeo.com/111410510',
 u'version': u'1.0',
 u'width': 1280}
In [19]: bootstrap_noembed().request('http://vimeo.com/111410510')
Out[19]: 
{u'author_name': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'description': u'Copyright by Fundacja Picture Doc\nCopyright by Fundacja Dialog-Pheniben',
 u'duration': 310,
 u'height': 720,
 u'html': u'\n<div class="noembed-embed ">\n  <div class="noembed-wrapper">\n    \n<div class="noembed-embed-inner noembed-vimeo">\n  <iframe src="//player.vimeo.com/video/111410510" width="1280" height="720" frameborder="0" title="Romowie w Europie. Zag\u0142ada" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>\n</div>\n\n    <table class="noembed-meta-info">\n      <tr>\n        <td class="favicon"><img src="https://noembed.com/favicon/Vimeo.png"></td>\n        <td>Vimeo</td>\n        <td align="right">\n          <a title="http://vimeo.com/111410510" href="http://vimeo.com/111410510">http://vimeo.com/111410510</a>\n        </td>\n      </tr>\n    </table>\n  </div>\n</div>\n',
 u'is_plus': u'1',
 u'provider_name': u'Vimeo',
 u'provider_url': u'https://vimeo.com/',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'video',
 u'uri': u'/videos/111410510',
 u'url': u'http://vimeo.com/111410510',
 u'version': u'1.0',
 u'video_id': 111410510,
 u'width': 1280}

Oembed.io support it too

In [23]: requests.get('http://oembed.io/api?url=http://vimeo.com/111410510').json()
Out[23]: 
{u'author': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'canonical': u'http://vimeo.com/111410510',
 u'description': u'Copyright by Fundacja Picture Doc\nCopyright by Fundacja Dialog-Pheniben',
 u'duration': 310,
 u'html': u'<div class="oembed-widget-container" style="left: 0px; width: 100%; height: 0px; position: relative; padding-bottom: 56%;"><iframe class="oembed-widget oembed-iframe" src="//player.vimeo.com/video/111410510" frameborder="0" style="top: 0px; left: 0px; width: 100%; height: 100%; position: absolute;"></iframe></div>',
 u'provider_name': u'Vimeo',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'rich',
 u'version': u'1.0'}

500px and bootstrap_embedly

In [3]: requests.get('http://api.embed.ly/1/oembed?url=https%3A%2F%2Fiso.500px.com%2Fguest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following%2F&maxwidth=500').json()
Out[3]: 
{u'author_name': u'DL Cade',
 u'author_url': u'https://iso.500px.com/author/dl/',
 u'description': u"One of December's talented 500px Guest Curators was photographer Joel (Julius) Tjintjelaar , and he fully embraced the real purpose of the Editors' Choice section: to unveil photos and photographers that might not have made the Popular page for one reason or another... but probably should have.",
 u'provider_name': u'500px',
 u'provider_url': u'https://iso.500px.com',
 u'thumbnail_height': 1000,
 u'thumbnail_url': u'https://isocdn.500px.org/wp-content/uploads/2014/12/julius-1500x1000.jpg',
 u'thumbnail_width': 1500,
 u'title': u'Guest Curator Joel (Julius) Tjintjelaar Reveals Three Photographers that Should Have a Larger Following',
 u'type': u'link',
 u'url': u'https://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/',
 u'version': u'1.0'}

In [4]: bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')
---------------------------------------------------------------------------
ProviderNotFoundException                 Traceback (most recent call last)
<ipython-input-4-aca3a4c8cf6f> in <module>()
----> 1 bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')

/tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in inner(self, url, **params)
     91                 self.cache.set(key, data)
     92             return data
---> 93         return fn(self, url, **params)
     94     return inner
     95 

/tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in request(self, url, **params)
    132         if provider:
    133             return provider.request(url, **params)
--> 134         raise ProviderNotFoundException('Provider not found for "%s"' % url)
    135 
    136 

ProviderNotFoundException: Provider not found for "http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/"

HTML parser doesn't deal with &amp

Suppose you've got the following content:

Testing

http://picasaweb.google.com/lh/sredir?uname=test&target=ALBUM&id=123&authkey=abc

(Note: the link itself is not valid due to mangled IDs (it was a private album))

Rendering this content as follows will not work:

{{post.body|linebreaksbr|oembed_html}}

The reason is that the "&" has been escaped and turned into "&amp". The HTML parser over at https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L144 does recognize & extract the URL, but it does not unescape &amp. Hence, &amp is fed to embed.ly... resulting in a 404 over there.

bootstrap_basic

Hi,
I was trying to include Facebook into the basic list of providers. E.g.,
pr.register('https://www.facebook.com/\S*?/posts/\S*', Provider('https://www.facebook.com/plugins/post/oembed.json'))

pr.register('https://www.facebook.com/\S*/photos/\S*', Provider('https://www.facebook.com/plugins/post/oembed.json'))

work perfectly fine. However, when I try

pr.register('https://www.facebook.com/photo.php?fbid=\S*', Provider('https://www.facebook.com/plugins/post/oembed.json')) for a url like
https://www.facebook.com/photo.php?fbid=10204669368414661&set=a.10201344709340262.1073741826.1849311083&type=3&theater
it always comes back with the message "Provider not found for ..."
What am I doing wrong? Is it the regular expression? Or is it an issue with the endpoints?

Many thanks for any feedback.

html5lib incompatibility

Getting this error when building a Wagtail project: AttributeError: module 'html5lib.treebuilders' has no attribute '_base'

Looks like it's being picked up elsewhere so hopefully a fix will be released soon..

Packaging: examples conflict with flasgger

There is a file conflict between flasgger and micawber, because both install files into the too generic path name examples.
For reference, please see this Arch Linux bug.

As a solution, micawber and flasgger should either not install these examples at all, or if required into a unique directory (e.g. micawber-examples) or another system directory (e.g. on Linux: /usr/share/doc/python-micawber/examples, which is usually done by the packagers).

I will remove them for now to resolve the file conflict.

To parse HTML, install BeautifulSoup

We get this error for some yet to be clarified reason. Is there a hidden dependency on the BeautifulSoup package?

File "/app/.heroku/python/lib/python3.9/site-packages/micawber/contrib/mcflask.py", line 21, in _oembed

2020-10-25T09:55:22.161763+00:00 app[web.1]:     return oembed(s, providers, urlize_all, html, **params)

2020-10-25T09:55:22.161763+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.9/site-packages/micawber/contrib/mcflask.py", line 10, in oembed

2020-10-25T09:55:22.161763+00:00 app[web.1]:     return Markup(fn(s, providers, urlize_all, **params))

2020-10-25T09:55:22.161764+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.9/site-packages/micawber/parsers.py", line 137, in parse_html

2020-10-25T09:55:22.161764+00:00 app[web.1]:     raise Exception('Unable to parse HTML, please install BeautifulSoup '

2020-10-25T09:55:22.161764+00:00 app[web.1]: Exception: Unable to parse HTML, please install BeautifulSoup or beautifulsoup4, or use the text parser

Possible unintended behavior with parse_html?

If you have encoded html characters like < and > inside the same html tag as an untagged link, parse_html will decode the encoded characters in stead of skipping them. This is inconsistent with the behavior when the encoded character is not inside the same tag as the untagged link, or if the link is already tagged.

Encoded characters next to an untagged link:

from micawber import ProviderRegistry
from micawber import parse_html
text = u'<p>http://www.google.com &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p><a href="http://www.google.com">http://www.google.com</a> <script> alert("foo"); </script></p>'

Here the encoded characters are decoded.

Encoded characters next to a tagged link:

text = u'<p><a href="http://www.google.com">http://www.google.com</a> &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p><a href="http://www.google.com">http://www.google.com</a> &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'

Here the encoded characters are not decoded.

Encoded characters alone:

text = u'<p>&lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p>&lt;script&gt; alert("foo"); &lt;/script&gt;</p>'

Here the encoded characters are not decoded.

Environment:

python2.7
Package                       Version
----------------------------- -------
backports.functools-lru-cache 1.5
beautifulsoup4                4.8.1
micawber                      0.5.0
pip                           19.2.3
pkg-resources                 0.0.0
setuptools                    41.4.0
soupsieve                     1.9.4
wheel                         0.33.6

What is the intended behavior for parse_html?

include runtests.py in PyPI source

I packaged micawber for NixOS. (NixOS/nixpkgs#34948)

I noticed that runtests.py is not included in the PyPI source, therefore the install tests don't work.

Could you include them, so we can test the builds properly?

live demo is down

got an error from google:

Support for Python 2.5 has turned off. Please refer to https://goo.gl/aESk5L for more information

bootstrap_basic raw strings / escapes

I noticed that a lot of the regular expression patterns in bootstrap_basic don't escape dots (match all). This means that a fair number of these patterns will match more than intended.

In addition most patterns aren't marked as raw strings and therefore contain invalid escape sequences. This isn't noticeable directly, but could cause issues in a future python version.

For an example of the latter:

python -W always -c '"https://\S*?soundcloud.com/\S+"'
<string>:1: DeprecationWarning: invalid escape sequence \S

Youtube Playlists

I'm not quite sure where the fault for this lies, but here seems a good start.

Embedding a youtube playlist using embed.ly directly works okay:
http://embed.ly/code?url=https%3A%2F%2Fwww.youtube.com%2Fplaylist%3Flist%3DPLE2714DC8F2BA092D (literally an example playlist heh)

Running it thorough micawber doesn't embed anything using the URL: https://www.youtube.com/playlist?list=PLE2714DC8F2BA092D - using the embed URL of https://www.youtube.com/embed/videoseries?list=PLE2714DC8F2BA092D results in the first video in the series being embedded but no playlist controls.

suggestion: requests and responses

i know that requests is a bit of a resource hit (and it's been brought up before), but I wanted to suggest using it as the Provider (or an ancillary option) because it could improve testing.

The responses library (https://github.com/getsentry/responses) lets you easily intercept calls to the requests library to quickly write integrated tests. for example:

expected_payload = {'author_name': 
                    }
as_json = json.dumps(expected_payload)
with responses.RequestsMock() as rsps:
    rsps.add(responses.GET,
             "http://www.youtube.com/oembed",
             body=as_json,
             status=200,
             content_type='text/html',
             )
    result = providers.request('http://www.youtube.com/watch?v=54XHDUOHuzU')
    for (k, v) in expected_payload.items():
        assert result[k] == v

This was a big benefit to us for testing and simulations (and incredibly easy to implement via subclassing), so I wanted to suggest it upstream.

LICENSE file?

Hi, what's the license for the micawber project?
Would you mind adding a license file for it?
We'd prefer an MIT license if you're open to suggestions.
Thanks

Django filters doesn't work as expected if 'width' (only) parameter specified

For example:
{{ object.body|oembed:"600" }}

I think there is a problem in fix_width_height function. If only width size passed it sets maxwidth to first digit only:
...
params['maxwidth'] = int(width_height[0])
...

Dead Links

Some of the links are dead.

I'd like more granular exceptions

I'd like more granular exceptions so I can distinguish between exception cases in my calling code. Specifically, I'd like to differentiate when a call to ProviderRegistry.request fails due to a provider not being found for a URL versus an error fetching a particular endpoint URL.

Let me know what you think about this. I'm happy to fork and make a pull request if you're willing to go this direction.

CSP headers

Hi!
I'm using Flask but this will be usefull for Django and others
Will be supernice to have a feature that accumulates in a per request cache or something which services has been used and correct the content security policy header to include this services as accepted origins

Otherwise the embedded object will not load blocked by the browser and it is not acceptable to allow any origin but only those needed

Thanks a lot!

Add form inputs to skip elements EOM

Provider bootstrap_oembed broken

The bootstrap_oembed provider appears to be broken. The following works fine with other providers (Python 3.7, micawber 0.4.0).

from micawber.providers import bootstrap_oembed
r = bootstrap_oembed()
result = r.provider_for_url("https://i.imgur.com/CZX7D64.jpg")

/usr/local/lib/python3.7/site-packages/micawber/providers.py in provider_for_url(self, url)
    136     def provider_for_url(self, url):
    137         for regex, provider in self:
--> 138             if re.match(regex, url):
    139                 return provider
    140 

/usr/local/lib/python3.7/re.py in match(pattern, string, flags)
    171     """Try to apply the pattern at the start of the string, returning
    172     a Match object, or None if no match was found."""
--> 173     return _compile(pattern, flags).match(string)
    174 
    175 def fullmatch(pattern, string, flags=0):

/usr/local/lib/python3.7/re.py in _compile(pattern, flags)
    284     if not sre_compile.isstring(pattern):
    285         raise TypeError("first argument must be string or compiled pattern")
--> 286     p = sre_compile.compile(pattern, flags)
    287     if not (flags & DEBUG):
    288         if len(_cache) >= _MAXCACHE:

/usr/local/lib/python3.7/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/usr/local/lib/python3.7/sre_parse.py in parse(str, flags, pattern)
    928 
    929     try:
--> 930         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    931     except Verbose:
    932         # the VERBOSE flag was switched on inside the pattern.  to be

/usr/local/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    424     while True:
    425         itemsappend(_parse(source, state, verbose, nested + 1,
--> 426                            not nested and not items))
    427         if not sourcematch("|"):
    428             break

/usr/local/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    652             if item[0][0] in _REPEATCODES:
    653                 raise source.error("multiple repeat",
--> 654                                    source.tell() - here + len(this))
    655             if item[0][0] is SUBPATTERN:
    656                 group, add_flags, del_flags, p = item[0][1]

error: multiple repeat at position 44

Django 1.8 warning

Would be nice to see this fix added to next release:

/micawber/contrib/mcdjango/__init__.py:4: RemovedInDjango19Warning: django.utils.importlib will be removed in Django 1.9. from django.utils.importlib import import_module

in case you want to support more services:

pr.register('http://qik.com/\S*',
            Provider('http://qik.com/api/oembed.json'))
pr.register('http://www.polleverywhere.com/\w+/\S+',
            Provider('http://www.polleverywhere.com/services/oembed/'))
pr.register('http://www.slideshare.net/\w+/\S+',
            Provider('http://www.slideshare.net/api/oembed/2'))
pr.register('http://\w+.wordpress.com/\S+',
            Provider('http://public-api.wordpress.com/oembed/'))
pr.register('http://*.revision3.com/\S+',
            Provider('http://revision3.com/api/oembed/'))
pr.register('http://www.slideshare.net/\w+/\S+',
            Provider('http://api.smugmug.com/services/oembed/'))
pr.register('http://\w+.viddler.com/\S+',
            Provider('http://lab.viddler.com/services/oembed/'))

Can micawber parse into https contents of youtube?

I tried to parse a youtube https url by the steps:

import micawber
providers = micawber.bootstrap_basic()
url = "https://www.youtube.com/watch?v=5BbSe_pI_eo"
micawber.parse_text(url, providers)

output:

<iframe width="480" height="270" src="http://www.youtube.com/embe/5BbSe_pI_eo?feature=oembed" frameborder="0" allowfullscreen></iframe>

The result still use http url instead of https. Is this due to the design of micawber or the limitation of youtube?

Duration of videos not showing?

Hi, so I am following your example. When I pull the json from

micawber.bootstrap_basic().request('https://www.youtube.com/watch?v=M9taeyvPQzg')

It pulls all the info, but for some reason it does not pull the duration?

oembed.io does not resolve

It seems that the oembed.io domain is no longer registered. This means that bootstrap_oembedio is dead code for all intents and purposes.

It could possibly be replaced by https://oembed.com/ (https://oembed.com/providers.json)

'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

I get the error shown below when I run the Peewee sample blog app from here: https://github.com/coleifer/peewee/tree/master/examples/blog

Specifically this happens when Micawber tries to display a post with links that need converting to embeds (e.g. a YouTube video link).

I've been able to reproduce this reliably with different links (e.g. Vimeo links instead of YouTube) and different browsers. It doesn't always happen immediately, but if you click around to view the posts with embeds, then return to the index page, then view posts again, the error appears and the page is either unavailable or shows the page with no CSS. Errors in the console show that files failed to load: Failed to load resource: net::ERR_SOCKET_NOT_CONNECTED

This is in a Python 2.7.10 virtualenv on Ubuntu 15.10 running the Flask dev server.

Interestingly, running it in a Python 3.4 virtualenv works without issues. But it would be great to have a fix for Python 2.

Exception happened during processing of request from ('127.0.0.1', 33044)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
    self.handle()
  File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 216, in handle
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 247, in handle_one_request
    self.raw_requestline = self.rfile.readline()
IOError: [Errno 11] Resource temporarily unavailable

Allow compositing Providers

We need to grab oembed data and would prefer to do it ourselves using the providers in micawber.providers.bootstrap_basic.

There are some services that don't provide endpoints (thinking of Facebook and Vine, in particular) or aren't defined in bootstrap_basic. We want to compose a ProviderRegistry instance which tries providers from bootstrap_basic first, falling back to oembedio or Embedly if nothing is found.

Our current (proposed) solution:

from micawber import bootstrap_basic, bootstrap_embedio

# embedio first so that basic providers overwrite embedio providers
# a bit icky since it relies on internal registry implementation
providers = bootstrap_embedio()
for provider in boostrap_basic():
    providers.register(provider)

That seems a bit ... circuitous. So, here a couple of ways to provide composited ProverRegistrys that I can think of:

use our proposed solution above, and note it in the docs,
allow the various bootstrap_* funcs to take an optional registry argument that defaults to None, but is used if passed,

def bootstrap_basic(pr=None, cache=None):
    pr = pr or ProviderRegistry(cache)
    ...
    return pr

Extract the hard coded endpoints in bootstrap_basic so that they're available to use by library users.

PROVIDERS = {
    'http://blip.tv/\S+': 'http://blip.tv/oembed',
    ...
}

def bootstrap_basic(cache=None)
    pr = ProviderRegistry(cache)

    for regex, endpoint in PROVIDERS.items():
        pr.register(regex, Provider(endpoint))

    return pr

Thoughts?

eMonitor provider

If you want to add my web service http://monitor.eibriel.com as a provider:

providers.register('http://monitor.eibriel.com/\S*', Provider('http://monitor.eibriel.com/api/job/oembed'))

Example: http://monitor.eibriel.com/54f317ef7ff6a915d864496a

Bests!

parse_html overhead

>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'

What is html and body tags ? i do not need it.

>>> micawber.parse_text('http://www.youtube.com/watch?v=54XHDUOHuzU', providers)
u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen></iframe>'
>>> micawber.parse_text('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<p><a href="http://www.youtube.com/watch?v=54XHDUOHuzU" title="Future Crew - Second Reality demo - HD">Future Crew - Second Reality demo - HD</a></p>'

I don't want link, i want iframe, etc, as in docs, even i have other tags in text.

I use bs4, but why it is not in docs as dependency?

ps. Python 2.7.3 (default, Mar 13 2014, 11:03:55)

Google Maps with HTTPS not working.

I'm not 100% certain what the issue is, but I'll give you as many details as possible.

my site is behind https (only). When trying to embed a map using an https link, the map does not embed.

When switching the URL to http the map will embed.

poking in the source I found: https://github.com/coleifer/micawber/blob/master/micawber/contrib/providers.py#L34

Which seems to only accept http as a valid url for google maps?

I think the problem might be that google redirects to https if you just go to maps.google.com.

So then when trying to embed a google map with https it fails the regex match?

bootstrap_embedly performance question

Hi, I'm new to micawber, and I'm reading the docs. I have a question about bootstrap_embedly.

If I want to use embed.ly, is bootstrap_embedly required initialization every time? For example, if I call it in a django web app's initialization code, is it going to cause a delay at startup? Or does it cache results for future use?

From the docs:

>>> import micawber
>>> providers = micawber.bootstrap_embedly() # may take a second
>>> print micawber.parse_text('this is a test:\nhttp://www.youtube.com/watch?v=54XHDUOHuzU', providers)
this is a test:
<iframe width="640" height="360" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>

A bit more detail in the docs regarding this issue would be appreciated.

youtube unpredictable parsing

Hi Coleifer,

Thank you for your nice project. :)

Sorry for disturbing you

I found the answer to my issue, so I deleted the issue text as I can't delete all the issue record.

Best regards
and thanks again!

Igor

Short link for youtube

Hello,

I am would like report short URL for YT eg. http://youtu.be/tS3FDpAiy3k raise ProviderNotFoundException .

Greetings,
Adam Dobrawy

Custom fetcher

Do you plan to add a way to use custom data retrieval method by any chance?

It could be nice because than different methods could be used, like requests library or asyncio in python3.4.

oauthlib (https://oauthlib.readthedocs.org/en/latest/oauth1/client.html) has this implementation and works pretty nice. Example:

client = oauthlib.oauth1.Client('client_key', client_secret='your_secret')
uri, headers, body = client.sign('http://example.com/request_token')
# Here you do a request
# and next you can grab data from response

So in this case it could be

provider = micawber.bootstrap_basic()
url, headers, body = provider.prepare_request(URL)
# Do a request
provider.parse_response(resp)

Or, could be easier maybe to allow override fetch method by using a callback?

provider.request(URL, fetch_callback=my_callback(url, headers, body))

I know I can monkeypatch fetch method but that don't seem to be a good way in a long run.

Release 0.2.3 on pypi missing templates directory

When I try to use the template filters:

{% load micawber_tags %}
{{ "http://www.youtube.com/watch?v=mQEWI1cn7HY"|oembed }}

I get an error:

TemplateDoesNotExist at ...
    micawber/video.html

This is due to a missing templates folder in the released 0.2.3. version http://pypi.python.org/pypi/micawber/0.2.3 . The same tagged version on github seems to be fine: https://github.com/coleifer/micawber/tree/0.2.3/micawber/contrib/mcdjango

Define (test) requirements

Hi!
I'm currently trying to package this module for Arch Linux [community].
However, while doing so, I realized, that there is no definition of required dependencies.
When grep'ing for imports, I see that tests definitely require beautifulsoup, because they import it (also they failed hard trying to execute without it being installed). There seems to be a conditional dependency on redis, django and flask. Can you please add them to a requirements.txt or add a Pipfile (and explain why they are needed), so I can add proper (optional, runtime and test) dependencies for the package and people will have an easier time using micawber?
Thanks for your work!

responsive width for iframe in django

It would be super helpful if you could pass in a percentage for the iFrame width in the django template for oembed.

Cheers

coleifer / micawber Goto Github PK

micawber's People

Contributors

Stargazers

Watchers

Forkers

micawber's Issues

Encoded characters next to an untagged link:

Encoded characters next to a tagged link:

Encoded characters alone:

Environment:

Recommend Projects

Recommend Topics

Recommend Org