Comments (12)
Thanks to @vincentdavis for his experiments with images overnight on his fork https://github.com/vincentdavis/peterjc.github.io
Looking at an example from http://biopython.org/wiki/Logo mapped to https://github.com/peterjc/peterjc.github.io/blob/master/wiki/Logo.mediawiki
[[Image:biopython.jpg]]
which became https://github.com/peterjc/peterjc.github.io/blob/master/wiki/Logo.md
![](biopython.jpg "biopython.jpg")
Over at the test Jekyll rendering http://peterjc.github.io/wiki/Logo.html this became:
<p><img src="biopython.jpg" alt="" title="biopython.jpg"></p>
i.e. A simple relative link. I have tested this by manually committing the logo image to the master branch.
This works on wiki/Logo.html
(via Jekyll on github.io) and wiki/Logo.md
(rendered on github.com), providing the image is uploaded as wiki/biopython.jpg
(lower case). Note the image as hosted on the original wiki was at http://biopython.org/w/images/5/5c/Biopython.jpg (somewhat cryptic URL with Biopython with a capital B), with associated wiki page http://biopython.org/wiki/File:Biopython.jpg (note Biopython with a capital B).
Strangely the wiki/Logo.mediawiki
rendering on github.com does not show the image, but I don't mind. This may be a GitHub bug?
Note that I can edit the original MediaWiki site to use [[Image:Biopython.jpg]]
instead of [[Image:biopython.jpg]]
so the case is somewhat flexible within MediaWiki
The converter will need to look at File:...
page revisions, in this case http://biopython.org/wiki/File:Biopython.jpg aka File:Biopython.jpg
and fetch the matching image (upper case B).
One question is, do we save it as wiki/biopython.jpg
(lower case b) or wiki/Biopython.jpg
(upper case B), given MediaWiki allows [[Image:Biopython.jpg]]
and [[Image:biopython.jpg]]
? My slight preference is preserve the capitalisation in the filename, and pre-process lower case image links.
from mediawiki_to_git_md.
For a test case of a file with a revision, see http://biopython.org/wiki/File:TorusDBN.png
from mediawiki_to_git_md.
On the subject of filenames, quoting http://www.mediawiki.org/wiki/Manual:ImportImages.php
Note: The "canonical database form" required by "--from" is obtained from the file name by capitalizing the first letter, replacing all spaces with underscores, and then replacing multiple consecutive underscores with one underscore. For example, to start with the file someFile with __weird_ spaces.png
, the correct argument would be --from=SomeFile_with_weird_spaces.png
(update: See notes on issue #2, there are more rules for special character encoding)
So, I think we should transform all [[Image:XXX]]
entries to use this "canonical database form" (prior to calling pandoc) and ensure we save the images using the "canonical database form".
e.g. Replace [[Image:biopython.jpg]]
with [Image:Biopython.jpg]
and save the associated file as wiki/Biopython.jpg
.
from mediawiki_to_git_md.
Some images are scaled as an example in.
http://biopython.org/wiki/Phylo
Same page on github
https://github.com/peterjc/peterjc.github.io/edit/master/wiki/Phylo.mediawiki
[[File:phylo-draw-apaf1.png|256px|thumb|right|Rooted phylogram, via Phylo.draw]]
which becomes
![Rooted phylogram, via Phylo.draw](phylo-draw-apaf1.png "fig:Rooted phylogram, via Phylo.draw")
On the actual wiki the URL is:
http://biopython.org/w/images/thumb/0/04/Phylo-draw-apaf1.png/256px-Phylo-draw-apaf1.png
If you what to see the full size image
http://biopython.org/w/images/0/04/Phylo-draw-apaf1.png
I found a document that suggests code like this can be used for scaling images.
[[ http://url.to/image.png | height = 100px ]]
Assuming you don't what to use ftp to download the images folder.
I will work on a script for downloading the images,
from mediawiki_to_git_md.
Maybe useful tool
"extract all image names from the XML dump which it may reference, then generate a series of BASH"
https://meta.wikimedia.org/wiki/Wikix
and also
https://github.com/benjaoming/python-mwdump-tools
from mediawiki_to_git_md.
We can probably get convert.py
script to download the (latest) version of each image (e.g. Biopython.jpg
) via scraping the URL from the the associated wiki page when it sees a revision for File:Biopython.jpg
, save this under wiki/
and then use git add wiki/Biopython.jpg
and commit it. This would be functional, but not fully capture image revisions.
from mediawiki_to_git_md.
The lack of image scaling may be worth reporting to pandoc as a feature enhancement request, but the good news is for Biopython's wiki there are so few images we can check them all manually.
from mediawiki_to_git_md.
Ok this is a hack.
It scans the mediawiki files in /wiki/
if it finds a an image (file:) it then scans the corisponding biopython.org page for all images.
It then get the the full size images.
import os
import urllib.request
import re
scan_path = 'peterjc.github.io/wiki/' # look here for files to scan
save_path = ''
##############
files = os.listdir(scan_path)
mfile = re.compile(r'File:.+')
mlink = re.compile("""class="image"><img alt="" src="[^"]+""")
for file in files:
namesplit = file.rsplit('.', maxsplit=1)
if namesplit[-1] == 'mediawiki':
ofile = open(dir_path+file, 'r').read()
result = mfile.findall(ofile)
if len(result) > 0: # Found a file to download
# for r in result:
# print(r)
print(namesplit[0])
response = urllib.request.urlopen('http://biopython.org/wiki/'+ namesplit[0])
html = response.read()
for m in mlink.findall(str(html)):
pre_url = 'http://biopython.org'
img_url = pre_url + m.split('src="')[1].rsplit('/', maxsplit=1)[0].replace('/thumb/', '/')
print(img_url)
img = urllib.request.urlopen(img_url)
filename = img_url.rsplit('/', maxsplit=1)[-1]
print(filename)
localFile = open(save_path+filename, 'wb')
localFile.write(img.read())
localFile.close()
from mediawiki_to_git_md.
Even if the API links are not enabled, we can if need be crawl the Special:ListFiles
page to find all the currently uploaded files, e.g. http://biopython.org/wiki/Special:ListFiles
However, simply processing the XML revisions tells us when a new image was uploaded, or an old image updated. We may be able to use this revision date stamp to cross reference with the image page in order to get the appropriate version of the image. e.g. parsing the file history table on http://biopython.org/wiki/File:TorusDBN.png - based on the snippet from @vincentdavis
from mediawiki_to_git_md.
Added get_images function which used the existing XML parsing in convert.py.
Need to see if I can use the existing commit functions for the images.
https://github.com/vincentdavis/mediawiki_to_git_md/commit/63bec72ebf78912d2e2ffe1223850e1cde366245#diff-fc95a3840033cc06854352f25fb6822f
from mediawiki_to_git_md.
Excellent - I should be able to add the git part and test this over the weekend :)
from mediawiki_to_git_md.
Marking as fixed; will open a new issue for image scaling...
from mediawiki_to_git_md.
Related Issues (20)
- Image scaling during MediaWiki to Markdown conversion HOT 5
- Deal with MediaWiki attachments nicely, e.g. images HOT 2
- Convert usernames in commit comments
- Handle MediaWiki User:XXX pages HOT 1
- Skip empty commits (e.g. reverts after a skipped spammer's edit) HOT 1
- Another python tag quirk when given id
- Closing Python tags not always at end of line
- What to do with sub-folders (slashes in MediaWiki page names)? HOT 2
- Avoiding colons in filenames (for working on Windows)
- Convert <bash> tags as another <source> variant. HOT 1
- Spot case variation in category tags
- Exception in subprocess when pandoc not installed HOT 2
- Handle case changes in article titles HOT 4
- Errors when converting dump with different localization than english HOT 2
- Quotes within article titles cause problems with the git command HOT 2
- The converter is not robust against markup errors within the MediaWiki dump HOT 9
- Auto-squash flurries of revisions from same author into one commit?
- Handle MediaWiki categories HOT 3
- MediaWiki templates
- MediaWiki redirects --> Jekyll's redirection HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mediawiki_to_git_md.