bbolli / tumblr-utils Goto Github PK

View Code? Open in Web Editor NEW

667.0 38.0 124.0 446 KB

Utilities for dealing with Tumblr blogs, Tumblr backup

License: GNU General Public License v3.0

Python 100.00%

tumblr-backup python tumblr

tumblr-utils's Introduction

tumblr-utils

This is a collection of utilities dealing with Tumblr blogs.

tumble.py creates new posts from RSS or Atom feeds
tumblr_backup.py makes a local backup of posts and images
mail_export.py mails tagged links to a recipient list

These scripts are or have been useful to me over the years.

More documentation can be found in each script's docstring or in tumblr_backup.md.

The utilities run under Python 2.7.

Notice

On 2015-06-04, I made the v2 API the default on the master branch. The former master branch using the v1 API is still available on Github as api-v1, but will no longer be updated. The one feature that's only available with the old API is the option to backup password-protected blogs. There's no way to pass a password in Tumblr's v2 API.

License

GPL3.

tumblr-utils's People

Contributors

Stargazers

Watchers

Forkers

pmclanahan satonaoya juliansutherland phaufe erizocosmico ohhdemgirls seabirdshanty imclab wyohknott vxbinaca changbiao gorodok11 wrenth04 landsurveyorsunited azasypkin vbs d6-9b xlxjh dplepage nightpool wjt guarnacciaa unforeseenocean theunmutual carywoods suzumiyang mitar awesome-archive vaginessa chiseld covorsorin lisamarie1993 leavein0831 heatherbmayer troyxmccall hxysh ysj123688 kstenerud rohit-venkat dulusno deflated-criossant reguess cayabyabsrose theassyrian shauryachawla keepyoursecrets xiaoyang18 canwe 9999years derrierebender aspensmonster l3112 felina-lain cclauss rstorj skier0072 adevyish myonetaps zafai senorsnrub eridun nidahafeez mreldritch qu3stbaby nonomal ritu1337 cherryband gromgit miloxing aldunbar rswail jeffhoelter rodh rsdesoto thesimplemango suffocatedfuckhole kdbruin agcarter cebtenzzre plct contentmint greatlemer jessicp0249 xenostalgic quivop natalieanguyen thisismycontributionaccount bastianmax changephonenumberintumblr onyxr misterlisbon hrbynum rsmrlvlck rand82 33f1a3 emilycoleman gterezis allegrabottlik grukz tu-p

tumblr-utils's Issues

Support the v2 Tumblr API

I'm not sure whether and how the theme can be backed up using the new API. The docs are silent on the issue.

Proposal: use meaningful image filenames

Hello,

Tumblr image filenames are quite meaningless if one where to look at the saved images locally in a file manager: there is no easy way to find back the post related to a specific picture.
I propose to use a different pattern to save the images: account_postID.ext
This way, one could find back the post attached to the picture they were looking at locally.

Proposed patch:

--- tumblr_backup.orig.py   2014-04-21 12:55:40.198076300 +0200
+++ tumblr_backup.py    2014-04-21 13:54:23.970624600 +0200
@@ -127,11 +127,14 @@
         return None
     return doc if doc._name == 'tumblr' else None

-def save_image(image_url):
+def save_image(image_url, post, offset):
     """saves an image if not saved yet, returns the local file name"""
     def _url(fn):
         return u'../%s/%s' % (image_dir, fn)
-    image_filename = image_url.split('/')[-1]
+    if offset is None:
+        image_filename = '{0}_{1}'.format(account,post.ident)
+    else:
+        image_filename = '{0}_{1}_{2}'.format(account,post.ident,offset)
     glob_filter = '' if '.' in image_filename else '.*'
     # check if a file with this name already exists
     image_glob = glob(join(image_folder, image_filename + glob_filter))
@@ -422,7 +425,10 @@
             url = escape(get_try('photo-link-url'))
             for p in post.photoset['photo':] if hasattr(post, 'photoset') else [post]:
                 src = unicode(p['photo-url'])
-                append(escape(self.get_image_url(src)), u'<img alt="" src="%s">')
+                if p._name == 'photo' and p('offset'):
+                    append(escape(self.get_image_url(src, p('offset'))), u'<img alt="" src="%s">')
+                else:
+                    append(escape(self.get_image_url(src)), u'<img alt="" src="%s">')
                 if url:
                     content[-1] = '<a href="%s">%s</a>' % (url, content[-1])
                 content[-1] = '<p>' + content[-1] + '</p>'
@@ -482,8 +488,8 @@
         for p in ('<p>(<(%s)>)', '(</(%s)>)</p>'):
             self.content = re.sub(p % 'p|ol|iframe[^>]*', r'\1', self.content)

-    def get_image_url(self, url):
-        return save_image(url)
+    def get_image_url(self, url, offset=None):
+        return save_image(url, self, offset)

     def get_post(self):
         """returns this post in HTML"""

Thank you,
— Clare

tumblr_backup.py syntax error in python 2.7.6

Multiple exceptions on a single line syntax has changed in Python. See http://stackoverflow.com/questions/6470428/catch-multiple-exceptions-in-one-line-except-block
Here is a fix, tested on OS X Yosemite:

681c681
<         except IOError, OSError, ValueError:

---
>         except (IOError, OSError, ValueError):

Feature Request: Import blog list

I'd like a --list option in tumblr_backup.py to import blogs from a text file so I can easily download 1000's of blogs with the same options without having to list them in the command.

Output option: can't write to folder with non Latin characters

Hello,

I'm trying to backup to a folder containing Cyrillic characters and it does not work as expected.
For example:

tumblr_backup.py   -O "..\Алан" bboli

outputs

WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect

But Windows Explorer does handle those characters. Is there a workaround to it?

Thanks.

Some remarks about the APIv2 branch

I've seen that you didn't include any oAuth identification. It would have allowed an user to backup his own private blog if full auth was provided. But it's a lot of extra code for few user cases I guess.

    params = {'api_key': API_KEY, 'num': count}

The param num has been changed to limit in apiv2.

Retrieving new information:

        self.note_count = post['note_count']
        try:            
            self.source_title = post['source_title']
            self.source_url = post['source_url']
        except KeyError:
            self.source_title = ''
            self.source_url = ''

The apiv2 returns the following post types: text, quote, link, answer, video, audio, photo, chat, so this isn't needed anymore:

    def type_callback(option, opt, value, parser):
        value = value.replace('text', 'regular')
        value = value.replace('chat', 'conversation')
        value = value.replace('photoset', 'photo')
        csv_callback(option, opt, value, parser)

I've changed it to sanitize the input, not sure if it's needed.

Full patch:

--- tumblr_backup_apiv2.py.orig 2014-10-06 14:07:00.964349600 +0200
+++ tumblr_backup_apiv2.py  2014-10-06 14:06:31.007685400 +0200
@@ -155,7 +155,7 @@


 def apiparse(base, count, start=0):
-    params = {'api_key': API_KEY, 'num': count}
+    params = {'api_key': API_KEY, 'limit': count}
     if start > 0:
         params['offset'] = start
     url = base + '?' + urllib.urlencode(params)
@@ -450,6 +450,13 @@
         self.tm = time.localtime(self.date)
         self.title = ''
         self.tags = post['tags']
+        self.note_count = post['note_count']
+        try:            
+            self.source_title = post['source_title']
+            self.source_url = post['source_url']
+        except KeyError:
+            self.source_title = ''
+            self.source_url = ''
         if options.tags:
             self.tags_lower = set(t.lower() for t in self.tags)
         self.file_name = join(self.ident, dir_index) if options.dirs else self.ident + post_ext
@@ -692,10 +699,9 @@
         csv_callback(option, opt, value.lower(), parser)

     def type_callback(option, opt, value, parser):
-        value = value.replace('text', 'regular')
-        value = value.replace('chat', 'conversation')
-        value = value.replace('photoset', 'photo')
-        csv_callback(option, opt, value, parser)
+        values_list = value.lower().split(',')
+        set(values_list).intersection(['text','quote','link', 'answer','video','audio','photo','chat'])
+        setattr(parser.values, option.dest, values_list)

     parser = optparse.OptionParser("Usage: %prog [options] blog-name ...",
         description="Makes a local backup of Tumblr blogs."
@@ -748,7 +754,9 @@
         " case-insensitive)"
     )
     parser.add_option('-T', '--type', type='string', action='callback',
-        callback=type_callback, help="save only posts of type TYPE (comma-separated values)"
+        callback=type_callback, help="save only posts of type TYPE (comma-separated values)."
+        " Type can be:  text, quote, link, answer, video, audio, photo, chat"
+        
     )
     parser.add_option('-I', '--image-names', type='choice', choices=('o', 'i', 'bi'),
         default='o', metavar='FMT',

403 error on image download

For some reason one of the images in my blog was returning a 403 error, and this caused the script to terminate.

I got it working by putting in a try block in tumblr_backup.py around line 130

# download the image data                                                   
try:
    image_response = urllib2.urlopen(image_url)
except urllib2.HTTPError, e:
    return ''

KeyError: 'link-text'

For some reason my blog was causing KeyError because 'link-text' was missing for the post variable.

I fixed it by modifying the code in tumblr_backup.py around line 400 to:

        if 'link-text' in post:
            self.title = u'<a href="%s">%s</a>' % (url, post['link-text'])
        else :
            self.title = ''

Enhancement: inline images download

Just a small patch to retrieve inline images in any kind of posts:

--- tumblr_backup_apiv2.py.orig 2014-10-08 12:21:28.322480700 +0200
+++ tumblr_backup_apiv2_inlinesupport.py    2014-10-08 14:01:19.565999100 +0200
@@ -162,7 +162,7 @@


 def apiparse(base, count, start=0):
-    params = {'api_key': API_KEY, 'num': count}
+    params = {'api_key': API_KEY, 'limit': count}
     if start > 0:
         params['offset'] = start
     url = base + '?' + urllib.urlencode(params)
@@ -530,9 +530,11 @@
         def get_try(elt):
             return post.get(elt)

+
         def append_try(elt, fmt=u'%s'):
             elt = get_try(elt)
             if elt:
+                elt = re.sub(r'(<img [^\>]*src\s*=\s*["\'])(.*?)(["\'][^\>]*>)', self.get_inline_url, elt, flags=re.I)
                 append(elt, fmt)

         if self.typ == 'text':
@@ -576,7 +578,7 @@

         elif self.typ == 'answer':
             self.title = post['question']
-            append(post['answer'])
+            append_try('answer')

         elif self.typ == 'chat':
             self.title = get_try('title')
@@ -599,6 +601,44 @@

         self.save_post()

+    def get_inline_url(self, match):
+        """Saves an inline image if not saved yet. Returns the new URL or
+        the original URL in case of download errors."""
+        
+        self.image_dir = join(post_dir, self.ident) if options.dirs else image_dir
+        self.image_folder = path_to(self.image_dir)
+        image_url = match.group(2)
+
+        def _url(fn):
+            return match.group(1) + u'%s%s/%s' % (save_dir, self.image_dir, fn) + match.group(3)
+
+        image_filename = image_url.split('/')[-1]
+        # check if a file with this name already exists
+        known_extension = '.' in image_filename[-5:]
+        image_glob = glob(join(self.image_folder, image_filename +
+            ('' if known_extension else '.*')
+        ))
+        if image_glob:
+            return _url(split(image_glob[0])[1])
+        # download the image data
+        try:
+            image_response = urllib2.urlopen(image_url, timeout=HTTP_TIMEOUT)
+            image_data = image_response.read()
+            image_response.close()
+        except:
+            # return the original URL
+            return match.group(0)
+        # determine the file type if it's unknown
+        if not known_extension:
+            image_type = imghdr.what(None, image_data[:32])
+            if image_type:
+                image_filename += '.' + image_type.replace('jpeg', 'jpg')
+        # save the image
+        with open_image(self.image_dir, image_filename) as image_file:
+            image_file.write(image_data)
+        return _url(image_filename)
+
+
     def get_image_url(self, image_url, offset):
         """Saves an image if not saved yet. Returns the new URL or
         the original URL in case of download errors."""

Unicode issue

100% reproducible for one particular Tumblr feed:

$ ./tumblr_backup.py unsplash.com
Traceback (most recent call last): 158 of 159
File "./tumblr_backup.py", line 598, in
tb.backup(account)
File "./tumblr_backup.py", line 366, in backup
get_style()
File "./tumblr_backup.py", line 218, in get_style
f.write(css + '\n')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 347: ordinal not in range(128)

I don't know Python well enough to assist but it looks like some unicode issues were addressed in the past. Possibly a spot was missed?

Environment:
Mac OSX 10.9 with latest tumblr-utils code.

Images not ending in _500, _inline (?) or similar are not backed up

When running the following command:
$ python2 tumblr_backup.py -D ask-twilight-and-trixie (ponies blog)
the images of all posts until (and including) November 10, 2012 are not backed up.
Their posts/somenumber/index.html contains
<p><img src="http://media.tumblr.com/tumblr_md9da7womz1qixfad.png"/></p>
whereas newer posts have
<p><img alt="" src="../../posts/3545644837/tumblr_mdazquJasP1rp7k8so1_1280.png"></p>

I have not encountered _400 or _inline images yet, so can't say much about those.
Post from November 11, 2012 - goes correct
Post from November 10, 2012 - goes incorrect

Tumblr Backup: Video Support (download)

Could you please add video download support.

Bug: photoset caption issue

The photo caption inside a photoset is under "photos", not associated with each photo size.

--- tumblr_backup_apiv2.py.orig 2014-10-08 12:21:28.322480700 +0200
+++ tumblr_backup_apiv2_fixphotosetcaption.py   2014-10-08 14:34:34.414669700 +0200
@@ -554,7 +554,7 @@
                     content[-1] = u'<a href="%s">%s</a>' % (escape(url), content[-1])
                 content[-1] = '<p>' + content[-1] + '</p>'
                 if p['caption']:
-                    append(o['caption'], u'<p>%s</p>')
+                    append(p['caption'], u'<p>%s</p>')
             append_try('caption')

         elif self.typ == 'link':

Enhancement: audio/video support

Optional dependency to youtube_dl for external embedded videos.

--- tumblr_backup_apiv2.py.orig 2014-10-07 11:29:56.562755600 +0200
+++ tumblr_backup_mediasupport.py   2014-10-07 17:37:38.295584200 +0200
@@ -22,6 +22,7 @@
 import time
 import urllib
 import urllib2
+import socket
 from xml.sax.saxutils import escape

 # extra optional packages
@@ -29,6 +30,10 @@
     import pyexiv2
 except ImportError:
     pyexiv2 = None
+try:
+    import youtube_dl
+except ImportError:
+    youtube_dl = None

 # default blog name(s)
 DEFAULT_BLOGS = ['bbolli']
@@ -52,12 +57,14 @@
 # variable directory names, will be set in TumblrBackup.backup()
 save_folder = ''
 image_folder = ''
+media_folder = ''

 # constant names
 root_folder = os.getcwdu()
 post_dir = 'posts'
 json_dir = 'json'
 image_dir = 'images'
+media_dir = 'medias'
 archive_dir = 'archive'
 theme_dir = 'theme'
 save_dir = '../'
@@ -130,6 +137,10 @@
 def open_image(*parts):
     return open_file(lambda f: open(f, 'wb'), parts)

+    
+def open_media(*parts):
+    return open_file(lambda f: open(f, 'ab'), parts)
+

 def strftime(format, t=None):
     if t is None:
@@ -162,7 +173,7 @@


 def apiparse(base, count, start=0):
-    params = {'api_key': API_KEY, 'num': count}
+    params = {'api_key': API_KEY, 'limit': count}
     if start > 0:
         params['offset'] = start
     url = base + '?' + urllib.urlencode(params)
@@ -334,7 +345,7 @@
         base = get_api_url(account)

         # make sure there are folders to save in
-        global save_folder, image_folder, post_ext, post_dir, save_dir, have_custom_css
+        global save_folder, image_folder, media_folder, post_ext, post_dir, save_dir, have_custom_css
         if options.blosxom:
             save_folder = root_folder
             post_ext = '.txt'
@@ -343,6 +354,7 @@
         else:
             save_folder = join(root_folder, options.outdir or account)
             image_folder = path_to(image_dir)
+            media_folder = path_to(media_dir)
             if options.dirs:
                 post_ext = ''
                 save_dir = '../../'
@@ -465,6 +477,16 @@
         self.tags = post['tags']
         if options.tags:
             self.tags_lower = set(t.lower() for t in self.tags)
+        try:
+            self.note_count = post['note_count']
+        except KeyError:
+            self.note_count = ''
+        try:
+            self.source_title = post['source_title']
+            self.source_url = post['source_url']
+        except KeyError:
+            self.source_title = ''
+            self.source_url = ''
         self.file_name = join(self.ident, dir_index) if options.dirs else self.ident + post_ext
         self.llink = self.ident if options.dirs else self.file_name

@@ -516,11 +538,47 @@
             append_try('source', u'<p>%s</p>')

         elif self.typ == 'video':
-            append(post['player'][-1]['embed_code'])
+            src = ''
+            if not options.skip_images:
+                self.media_dir = join(post_dir, self.ident) if options.dirs else media_dir
+                self.media_folder = path_to(self.media_dir)
+                if post['video_type'] == 'tumblr':
+                    src = self.get_media_url(post['video_url'], '.mp4')
+                elif youtube_dl:
+                    if post['html5_capable']:
+                        try:
+                            src = self.get_youtube_url(post['permalink_url'])
+                        except:
+                            sys.stdout.write(u'Unknown video type in post #%s%-50s\n' % (self.ident, ' '))
+                    else:
+                        try:
+                            src = self.get_youtube_url(post['source_url'])
+                        except:
+                            sys.stdout.write(u'Unknown video type in post #%s%-50s\n' % (self.ident, ' '))
+            if src:
+                append(u'<p><video controls><source src="%s" type="video/mp4">Your browser does not support the video element.<br /><a href="%s" >Video file</a></video></p>' % (src, src))
+            else:
+                append(post['player'][-1]['embed_code'])
             append_try('caption')

         elif self.typ == 'audio':
-            append(post['player'])
+            src = ''
+            if not options.skip_images:
+                self.media_dir = join(post_dir, self.ident) if options.dirs else media_dir
+                self.media_folder = path_to(self.media_dir)
+                if post['audio_type'] == 'tumblr':
+                    audio_url = post['audio_url']
+                    if audio_url.startswith('http://a.tumblr.com/'):
+                        src = self.get_media_url(audio_url, '.mp3')
+                    elif audio_url.startswith('https://www.tumblr.com/audio_file/'):
+                        audio_url = u'http://a.tumblr.com/%so1.mp3' % audio_url.split('/')[-1]
+                        src = self.get_media_url(audio_url, '.mp3')
+                elif  post['audio_type'] == 'soundcloud':
+                    src = self.get_media_url(post['audio_url'], '.mp3')
+            if src:
+                append(u'<p><audio controls><source src="%s" type="audio/mpeg">Your browser does not support the audio element.<br /><a href="%s" >Audio file</a></audio></p>' % (src, src))
+            else:
+                append(post['player'])
             append_try('caption')

         elif self.typ == 'answer':
@@ -548,6 +606,79 @@

         self.save_post()

+
+    def get_youtube_url(self, youtube_url):
+        # determine the media file name
+        ydl = youtube_dl.YoutubeDL({'outtmpl': join(self.media_folder, u'%(id)s_%(uploader_id)s_%(title)s.%(ext)s'), 'quiet': True, 'restrictfilenames': True, 'noplaylist': True})
+        ydl.add_default_info_extractors()
+        try:
+            result = ydl.extract_info(youtube_url, download=False)
+            media_filename = ydl.prepare_filename(result)
+        except:
+            return ''
+
+        # check if a file with this name already exists
+        media_glob = glob(media_filename)
+        if media_glob:
+            return u'%s%s/%s' % (save_dir, self.media_dir, split(media_glob[0])[1])
+
+        try:
+            result = ydl.extract_info(youtube_url, download=True)
+        except:
+            return ''
+        return u'%s%s/%s' % (save_dir, self.media_dir,os.path.split(media_filename)[1])
+
+
+    def get_media_url(self, media_url, extension=''):
+        def _url(fn):
+            return u'%s%s/%s' % (save_dir, self.media_dir, fn)
+
+        # determine the media file name
+        if options.image_names == 'i':
+            media_filename = self.ident
+        elif options.image_names == 'bi':
+            media_filename = account + '_' + self.ident
+        else:
+            media_filename = media_url.split('/')[-1]
+
+        if '.' in media_filename[-4]:
+            media_filename = media_filename[:-4]
+        media_filename += extension
+
+        # check if a file with this name already exists
+        media_glob = glob(join(self.media_folder, media_filename))
+        if media_glob:
+            return _url(split(media_glob[0])[1])
+
+        # download the media data
+        media_part_glob = glob(join(self.media_folder, media_filename + '.part'))
+        if media_part_glob:
+            try:
+                os.remove(media_part_glob[0])
+            except Exception, e:
+                sys.stderr.write('Error deleting the temporary file: %s' % e)
+                return ''
+
+        try:
+            media_response = urllib2.urlopen(media_url)
+            while True:
+                media_data = media_response.read(1024*1024)
+                if not media_data:
+                    break
+                # save the media
+                with open_media(self.media_dir, media_filename + '.part') as media_file:
+                    media_file.write(media_data)
+            try:
+                os.rename(join(self.media_folder, media_filename + '.part'), join(self.media_folder, media_filename))
+            except Exception, e:
+                sys.stderr.write('Error writing the media file: %s' % e)
+                return ''
+            media_response.close()
+        except (urllib2.URLError, urllib2.HTTPError, socket.error):
+            return ''
+        return _url(media_filename)
+
+
     def get_image_url(self, image_url, offset):
         """Saves an image if not saved yet. Returns the new URL or
         the original URL in case of download errors."""
@@ -567,7 +698,7 @@
             image_filename = account + '_' + self.ident + offset
         else:
             image_filename = image_url.split('/')[-1]
-        glob_filter = '' if '.' in image_filename else '.*'
+        glob_filter = '' if '.' in image_filename[-4] else '.*'
         # check if a file with this name already exists
         image_glob = glob(join(self.image_folder, image_filename + glob_filter))
         if image_glob:
@@ -582,7 +713,7 @@
             # return the original URL
             return image_url
         # determine the file type if it's unknown
-        if '.' not in image_filename:
+        if '.' not in image_filename[-4]:
             image_type = imghdr.what(None, image_data[:32])
             if image_type:
                 image_filename += '.' + image_type.replace('jpeg', 'jpg')

Tumblr Backup: unknown url type

$ python get.py -t self teddyadair

Exception in thread Thread-5: to 749 of 834                    
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "get.py", line 802, in handler
work()
File "get.py", line 574, in save_content
append_try('photo-caption')
File "get.py", line 550, in append_try
self.get_inline_image, elt
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "get.py", line 661, in get_inline_image
saved_name = self.download_image(image_url, image_filename)
File "get.py", line 678, in download_image
image_response = urllib2.urlopen(image_url, timeout=HTTP_TIMEOUT)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 393, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 255, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: teddyadair.tumblr.com

Then hangs on teddyadair: 0 remaining posts to save

[Enhancement] Provide pagination for index pages

It would allow easier navigation when months pages contain many picture posts.
Default value is set at 50 posts per page. It can be overridden with the -N option.
It's been tested in directory mode too.

--- tumblr_backup.py.orig   2014-10-05 11:55:41.019549600 +0200
+++ tumblr_backup_pagination.py 2014-10-06 10:29:59.294856000 +0200
@@ -271,6 +271,7 @@
             for year in sorted(self.index.keys(), reverse=options.reverse_index):
                 self.save_year(idx, year)
             idx.write(u'<p>Generated on %s.</p>\n' % strftime('%x %X'))
+            idx.write(self.footer())

     def save_year(self, idx, year):
         idx.write('<h3>%s</h3>\n<ul>\n' % year)
@@ -283,23 +284,66 @@
             ))
         idx.write('</ul>\n\n')

+    def get_page(self, direction, year, month):
+        list_years = sorted(self.index.keys(), reverse=direction)
+        list_years = list_years[list_years.index(year):]
+        for it_year in list_years:
+            if it_year == year:
+                list_months = sorted(self.index[it_year].keys(), reverse=direction) # false for next page
+                list_months = list_months[list_months.index(month)+1:]
+            else:
+                list_months = sorted(self.index[it_year].keys(), reverse=direction)
+            for it_month in list_months:
+                if len(self.index[it_year][it_month]):
+                    return it_year, it_month
+        return False, False
+
     def save_month(self, year, month, tm):
-        file_name = '%d-%02d' % (year, month)
-        if options.dirs:
-            arch = open_text(archive_dir, file_name, dir_index)
-            file_name += '/'
-        else:
-            file_name += '.html'
-            arch = open_text(archive_dir, file_name)
-        with arch:
-            arch.write('\n\n'.join([
-                self.header(strftime('%B %Y', tm), body_class='archive'),
-                '\n'.join(p.get_post() for p in sorted(
-                    self.index[year][month], key=lambda x: x.date, reverse=options.reverse_month
-                )),
-                '<p><a href=%s rel=contents>Index</a></p>\n' % save_dir
-            ]))
-        return file_name
+        postsperpage = options.number_page if options.number_page >= 1 else len(self.index[year][month])
+        file_stream = '\n\n'.join([self.header(strftime('%B %Y', tm), body_class='archive')])
+        for n, p in enumerate(sorted(self.index[year][month], key=lambda x: x.date, reverse=options.reverse_month), start=1):
+            file_stream += '\n'.join([p.get_post()])
+            if ((n % postsperpage == 0) or (n == len(self.index[year][month]))):
+                page = n / postsperpage if n % postsperpage == 0 else n / postsperpage + 1
+                file_name = '%d-%02d-p%s' % (year, month, page)
+                if options.dirs:
+                    arch = open_text(archive_dir, file_name, dir_index)
+                    file_name += '/%s' % dir_index
+
+                    if page > 1:
+                        pp = '%s%s/%d-%02d-p%s/%s' % (save_dir, archive_dir, year, month, page - 1, dir_index)
+                    else:
+                        previous_year, previous_month = self.get_page(False if options.reverse_month else True, year, month)
+                        pp = '%s%s/%d-%02d-p%s/%s' % (save_dir, archive_dir, previous_year, previous_month, len(self.index[previous_year][previous_month]) / postsperpage + 1, dir_index)  if previous_year else ''
+                    if len(self.index[year][month]) - n > 0:
+                        np = '%s%s/%d-%02d-p%s/%s' % (save_dir, archive_dir, year, month, page + 1, dir_index)
+                    else:
+                        next_year, next_month = self.get_page(True if options.reverse_month else False, year, month)
+                        np = '%s%s/%d-%02d-p%s/%s' % (save_dir, archive_dir, next_year, next_month, 1, dir_index) if next_year else ''
+
+                else:
+                    file_name += '.html'
+                    arch = open_text(archive_dir, file_name)
+
+                    if page > 1:
+                        pp = '%d-%02d-p%s.html' % (year, month, page - 1)
+                    else:
+                        previous_year, previous_month = self.get_page(False if options.reverse_month else True, year, month)
+                        pp = '%d-%02d-p%s.html' % (previous_year, previous_month, len(self.index[previous_year][previous_month]) / postsperpage + 1)  if previous_year else ''
+                    if len(self.index[year][month]) - n > 0:
+                        np = '%d-%02d-p%s.html' % (year, month, page + 1)
+                    else:
+                        next_year, next_month = self.get_page(True if options.reverse_month else False, year, month)
+                        np = '%d-%02d-p%s.html' % (next_year, next_month, 1) if next_year else ''
+
+                file_stream += '\n'.join([self.footer(previous_page=pp,next_page=np)])
+                with arch:
+                    arch.write(file_stream)
+                file_stream = '\n\n'.join([self.header(strftime('%B %Y', tm), body_class='archive')])
+                if page == 1:
+                    first_file = file_name
+
+        return first_file

     def header(self, title='', body_class='', subtitle='', avatar=''):
         root_rel = '' if body_class == 'index' else save_dir
@@ -323,6 +367,19 @@
             h += u'<p class=subtitle>%s</p>\n' % subtitle
         return h

+    def footer(self, previous_page='', next_page=''):
+        f = u'<footer id="footer">'
+        f += u'<div id="pagination">\n'
+        f += u'<a href="%s%s" class="index">Index</a>\n' % (save_dir, dir_index)
+        if ((previous_page) or (next_page)):
+            if (previous_page):
+                f += u'<a href="%s" class="previous">Previous</a>\n' % previous_page
+            if (next_page):
+                f += u'<a href="%s" class="next">Next</a>\n'% next_page
+        f += u'</div>\n'
+        f += u'</footer>\n</body>\n</html>'
+        return f
+
     def backup(self, account):
         """makes single files and an index for every post on a public Tumblr blog account"""

@@ -378,6 +435,7 @@

         # use the meta information to create a HTML header
         TumblrPost.post_header = self.header(body_class='post')
+        TumblrPost.post_footer = self.footer()

         # find the post number limit to back up
         last_post = options.count + options.skip if options.count else int(soup.posts('total'))
@@ -442,6 +500,7 @@
 class TumblrPost:

     post_header = ''    # set by TumblrBackup.backup()
+    post_footer = ''

     def __init__(self, post):
         self.content = ''
@@ -604,13 +663,14 @@
         post = self.post_header + u'<article class=%s id=p-%s>\n' % (self.typ, self.ident)
         post += u'<p class=meta><span class=date>%s</span>\n' % strftime('%x %X', self.tm)
         post += u'<a class=llink href=%s%s/%s>¶</a>\n' % (save_dir, post_dir, self.llink)
-        post += u'<a href=%s rel=canonical>●</a></p>\n' % self.url
+        post += u'<a href="%s">●</a></p>\n' % self.url
         if self.title:
             post += u'<h2>%s</h2>\n' % self.title
         post += self.content
         if self.tags:
             post += u'\n<p class=tags>%s</p>' % u''.join(self.tag_link(t) for t in self.tags)
         post += '\n</article>\n'
+        post += self.post_footer
         return post

     @staticmethod
@@ -760,6 +820,9 @@
     parser.add_option('-p', '--period', help="limit the backup to PERIOD"
         " ('y', 'm', 'd' or YYYY[MM[DD]])"
     )
+    parser.add_option('-N', '--number-page', type='int', default=50,
+        help="set NUMBER_PAGE posts per month page"
+    )
     parser.add_option('-P', '--private', help="password for a private tumblr",
         metavar='PASSWORD'
     )

[Enhancement] Threading

We could improve the backup speed by disconnecting the post fetching from the images saving.
Tell me what you think about it?

TODO: add an option to control the number of threads?
TODO: fix the output messages

--- tumblr_backup.py.orig   2014-10-01 09:57:32.142436800 +0200
+++ tumblr_backup_threading.py  2014-10-01 16:51:47.713677400 +0200
@@ -5,6 +5,8 @@
 from __future__ import with_statement
 import os
 from os.path import join, split, splitext
+import threading
+from threading import Thread, Event
 import sys
 import urllib
 import urllib2
@@ -68,6 +70,7 @@
 have_custom_css = False

 MAX_POSTS = 50
+imagePool = {}

 # ensure the right date/time format
 try:
@@ -199,6 +202,57 @@
         sys.stderr.write('Writing metadata failed for tags: %s in: %s\n' % (tags, image_name))


+def get_image_url(self, image_url, offset):
+   """Saves an image if not saved yet. Returns the new URL or
+   the original URL in case of download errors."""
+
+   def _url(fn):
+       return u'%s%s/%s' % (save_dir, image_dir, fn)
+
+   def _addexif(fn):
+       if options.exif and fn.endswith('.jpg'):
+           add_exif(fn, set(self.tags))
+
+   # determine the image file name
+   offset = '_' + offset if offset else ''
+   if options.image_names == 'i':
+       image_filename = self.ident + offset
+   elif options.image_names == 'bi':
+       image_filename = account + '_' + self.ident + offset
+   else:
+       image_filename = image_url.split('/')[-1]
+   glob_filter = '' if '.' in image_filename else '.*'
+   # check if a file with this name already exists
+   image_glob = glob(join(image_folder, image_filename + glob_filter))
+   if image_glob:
+       _addexif(image_glob[0])
+   # download the image data
+   try:
+       image_response = urllib2.urlopen(image_url)
+   except urllib2.HTTPError:
+       # return the original URL
+       return image_url
+   try:
+       image_data = image_response.read()
+   except urllib2.HTTPError:
+       # return the original URL
+       return image_url
+   image_response.close()
+   # determine the file type if it's unknown
+   if '.' not in image_filename:
+       image_type = imghdr.what(None, image_data[:32])
+       if image_type:
+           image_filename += '.' + image_type.replace('jpeg', 'jpg')
+   # save the image
+   with open_image(image_dir, image_filename) as image_file:
+       image_file.write(image_data)
+   _addexif(join(image_folder, image_filename))
+
+
+def add_to_pool(self, image_url, offset):
+    imagePool[image_url] = [self, offset];
+
+
 def save_style():
     with open_text(backup_css) as css:
         css.write('''\
@@ -396,7 +450,16 @@
                 self.post_count += 1
             return True

-        # Get the XML entries from the API, which we can only do for max 50 posts at once.
+        poolThreads = []
+        quitEvent = Event()
+        for t in range(0,7):
+            ts = SavePool(quitEvent)
+            ts.daemon = True
+            poolThreads.append(ts)
+        for j in poolThreads:
+            j.start()
+
+       # Get the XML entries from the API, which we can only do for max 50 posts at once.
         # Posts "arrive" in reverse chronological order. Post #0 is the most recent one.
         i = options.skip
         while i < last_post:
@@ -426,6 +489,9 @@

         log(account, "%d posts backed up\n" % self.post_count)
         self.total_count += self.post_count
+        quitEvent.set()
+        while (threading.activeCount() > 1):
+            time.sleep(1)


 class TumblrPost:
@@ -481,7 +547,8 @@
             url = escape(get_try('photo-link-url'))
             for p in post.photoset['photo':] if hasattr(post, 'photoset') else [post]:
                 src = unicode(p['photo-url'])
-                append(escape(self.get_image_url(src, p().get('offset'))), u'<img alt="" src="%s">')
+                add_to_pool(self, src, p().get('offset'))
+                append(escape(self.get_image_filename(src, p().get('offset'))), u'<img alt="" src="%s">')
                 if url:
                     content[-1] = u'<a href="%s">%s</a>' % (url, content[-1])
                 content[-1] = '<p>' + content[-1] + '</p>'
@@ -541,18 +608,7 @@
         for p in ('<p>(<(%s)>)', '(</(%s)>)</p>'):
             self.content = re.sub(p % 'p|ol|iframe[^>]*', r'\1', self.content)

-    def get_image_url(self, image_url, offset):
-        """Saves an image if not saved yet. Returns the new URL or
-        the original URL in case of download errors."""
-
-        def _url(fn):
-            return u'%s%s/%s' % (save_dir, image_dir, fn)
-
-        def _addexif(fn):
-            if options.exif and fn.endswith('.jpg'):
-                add_exif(fn, set(self.tags))
-
-        # determine the image file name
+    def get_image_filename(self, image_url, offset):
         offset = '_' + offset if offset else ''
         if options.image_names == 'i':
             image_filename = self.ident + offset
@@ -560,30 +616,8 @@
             image_filename = account + '_' + self.ident + offset
         else:
             image_filename = image_url.split('/')[-1]
-        glob_filter = '' if '.' in image_filename else '.*'
-        # check if a file with this name already exists
-        image_glob = glob(join(image_folder, image_filename + glob_filter))
-        if image_glob:
-            _addexif(image_glob[0])
-            return _url(split(image_glob[0])[1])
-        # download the image data
-        try:
-            image_response = urllib2.urlopen(image_url)
-        except urllib2.HTTPError:
-            # return the original URL
-            return image_url
-        image_data = image_response.read()
-        image_response.close()
-        # determine the file type if it's unknown
-        if '.' not in image_filename:
-            image_type = imghdr.what(None, image_data[:32])
-            if image_type:
-                image_filename += '.' + image_type.replace('jpeg', 'jpg')
-        # save the image
-        with open_image(image_dir, image_filename) as image_file:
-            image_file.write(image_data)
-        _addexif(join(image_folder, image_filename))
-        return _url(image_filename)
+        return u'%s%s/%s' % (save_dir, image_dir, image_filename + image_url[-4:])
+

     def get_post(self):
         """returns this post in HTML"""
@@ -621,6 +655,21 @@
                 f.write(self.xml_content)


+class SavePool(threading.Thread):
+    def __init__(self, quit):
+        threading.Thread.__init__(self)
+        self.quit = quit
+    def run(self):
+        imagecounter = 0
+        while not self.quit.isSet() or imagePool:
+            if imagePool:
+                key, value = imagePool.popitem()
+                get_image_url(value[0],key,value[1])
+                log(account, "%d images remaining to save\r" % (len(imagePool)))
+                imagecounter += 1 
+        log(account, "%d images backed up\n" % imagecounter)
+
+
 class BlosxomPost(TumblrPost):

     def get_image_url(self, image_url, offset):

tumblr_backup: option -p doesn't skip image download

When using -p to backup a past period, images more recent than the end of the period are also downloaded.

The problem is that get_image_url() is called on post creation already. It should be called later when we know whether the post should be saved.

Suggest deps in tumblr_backup.md for expanded usage or requirement for contribs to diagnose problems

Example:

pip
youtube-dl from pip, as ubuntus is stale
python-pyexiv2 from OS repository unless you want to pull that from pip as well.
Suggest a pip upgrade for all packages (pip doesn't make this easy but perhaps a way can be suggested).

I'd pull an update with the howto but figured it be better you write it. Your thoughts would be much appreciated on this.

Tumblr Backup: InvalidURL: nonnumeric port: ''

When doing python get.py -x -t me the-ghoulnextdoor (NSFW) hangs with..

Exception in thread Thread-1:sts 350 to 399 of 430
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "get.py", line 802, in handler
    work()
  File "get.py", line 560, in save_content
    append_try('regular-body')
  File "get.py", line 550, in append_try
    self.get_inline_image, elt
  File "/usr/lib/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
  File "get.py", line 661, in get_inline_image
    saved_name = self.download_image(image_url, image_filename)
  File "get.py", line 678, in download_image
    image_response = urllib2.urlopen(image_url, timeout=HTTP_TIMEOUT)
  File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 1172, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.6/urllib2.py", line 1118, in do_open
    h = http_class(host, timeout=req.timeout) # will parse host:port
  File "/usr/lib/python2.6/httplib.py", line 657, in __init__
    self._set_hostport(host, port)
  File "/usr/lib/python2.6/httplib.py", line 682, in _set_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
InvalidURL: nonnumeric port: ''

the-ghoulnextdoor: 0 remaining posts to save

Proposal: add exif tags to pictures

Hello,

The following proposal is a very specific use case of mine, so you probably won't accept it, but maybe someone else will find it useful.
The proposed patch adds two functions:

it includes the post tags as IPTC Keywords inside the post' images
it adds a command line option, "--exif", to add global keywords to all images

The goal is to keep the content creator(s) credits attached to the image.
It depends on pyexiv2: http://tilloy.net/dev/pyexiv2/overview.html

--- tumblr_backup.orig.py   2014-04-21 12:55:40.198076300 +0200
+++ tumblr_backup.py    2014-04-21 20:42:50.986345700 +0200
@@ -19,6 +19,7 @@

 # extra required packages
 import xmltramp
+import pyexiv2

 join = os.path.join

@@ -127,15 +128,39 @@
         return None
     return doc if doc._name == 'tumblr' else None

-def save_image(image_url):
+def add_exif(image_url, post):
+    try:
+        metadata = pyexiv2.ImageMetadata(image_url)
+        metadata.read()
+    except:
+        sys.stdout.write('Error reading metadata for image %s' % image_url)
+        return
+    try:
+        previous_tags = metadata['Iptc.Application2.Keywords'].value
+    except:
+        previous_tags = []
+    tags = post.tags + previous_tags
+    if options.exif: tags += options.exif
+    tags = list(set([item.lower() for item in tags]))
+    metadata['Iptc.Application2.Keywords'] = pyexiv2.IptcTag('Iptc.Application2.Keywords', tags)
+    try:
+        metadata.write()
+    except:
+        sys.stdout.write('Metadata failed for tags: %s' % tags)
+
+def save_image(image_url, post):
     """saves an image if not saved yet, returns the local file name"""
     def _url(fn):
         return u'../%s/%s' % (image_dir, fn)
+    def _addexif(fn):
+       imagepath = glob(join(image_folder, image_filename))[0]
+       if imghdr.what(imagepath) == 'jpeg': add_exif(imagepath, post)
     image_filename = image_url.split('/')[-1]
     glob_filter = '' if '.' in image_filename else '.*'
     # check if a file with this name already exists
     image_glob = glob(join(image_folder, image_filename + glob_filter))
     if image_glob:
+        _addexif(image_filename + glob_filter)
         return _url(os.path.split(image_glob[0])[1])
     # download the image data
     try:
@@ -153,6 +178,7 @@
     # save the image
     with open_image(image_dir, image_filename) as image_file:
         image_file.write(image_data)
+    _addexif(image_filename)
     return _url(image_filename)

 def save_style():
@@ -483,7 +509,7 @@
             self.content = re.sub(p % 'p|ol|iframe[^>]*', r'\1', self.content)

     def get_image_url(self, url):
-        return save_image(url)
+        return save_image(url, self)

     def get_post(self):
         """returns this post in HTML"""
@@ -552,6 +578,9 @@
         value = value.replace('text', 'regular').replace('chat', 'conversation').replace('photoset', 'photo')
         setattr(parser.values, option.dest, value.split(','))

+    def exif_callback(option, opt, value, parser):
+        setattr(parser.values, option.dest, value.split(','))
+
     parser = optparse.OptionParser("Usage: %prog [options] blog-name ...",
         description="Makes a local backup of Tumblr blogs."
     )
@@ -596,6 +625,9 @@
     parser.add_option('-T', '--type', type='string', action='callback',
         callback=type_callback, help="save only posts of type TYPE (comma-separated values)"
     )
+    parser.add_option('-e', '--exif', type='string', action='callback',
+        callback=exif_callback, help="add EXIF data to each picture (comma-separated values)"
+    )
     options, args = parser.parse_args()

     if options.auto is not None:

Best regards,
— Clare

Error when saving images with long query string

Might make sense to cut off filenames before a question mark.

Traceback of error when backing up workstream-piccolbo.tumblr.com:

Exception in thread Thread-10:osts 200 to 232 of 233                    
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 812, in handler
    work()
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 567, in save_content
    append_try('regular-body')
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 557, in append_try
    self.get_inline_image, elt
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 670, in get_inline_image
    saved_name = self.download_image(image_url, image_filename)
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 698, in download_image
    with open_image(self.image_dir, image_filename) as image_file:
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 130, in open_image
    return open_file(lambda f: open(f, 'wb'), parts)
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 120, in open_file
    return open_fn(path_to(*parts))
  File "/Users/noah/Desktop/.backup_tumblr_temp/tumblr_backup", line 130, in <lambda>
    return open_file(lambda f: open(f, 'wb'), parts)
IOError: [Errno 63] File name too long: u'/Users/noah/Desktop/workstream-piccolbo_2015-04-14/images/__utm.gif?utmwv=1&amp;utmn=932389005&amp;utmdt=Showing%20results%201%20through%2014%20%28of%2014%20total%29%20for%20&amp;utmhn=page2rss.com&amp;utmp=%2Fbb32164fa751392ded0a0cdc62b5f368%2F5648248%5F5703022%2Fshowing%2Dresults%2Dthrough%2Dof%2Dtotal%2Dfor%2D&amp;utmr=-&amp;utmac=UA-516402-1&amp;utmcc=__utma%3D155599162.1714606622.1320251460.1320251460.1316894.23B%2B__utmb%3D155599162%3B%2B__utmc%3D155599162%3B%2B__utmz%3D155599162.1320251460.1.1.utmccn%3D(feed)%7Cutmcsr%3Dhttp%3A%2F%2Farxiv%2Eorg%2Ffind%2Fgrp%5Fcs%2F1%2Fall%3A%2Bhadoop%2F0%2F1%2F0%2Fall%2F0%2F1%7Cutmcmd%3Drss%3B%2B.gif'

workstream-piccolbo: 233 posts backed up

Pagination issue

There's a problem with the current pagination algorithm.

In reverse mode, everything is fine.

In default order (reverse chronological):

on the first page, the latest posts of the month are displayed
- clicking next goes earlier in the month
- clicking previous goes to the latest day of the previous month, but it should logically go the first day of next month (i.e. the last page of the next month, or nothing if it's the most recent month).
on the last page, the earliest posts of the month are displayed
- clicking previous goes later in the month
- clicking next goes to latest day of the next month, whereas it should go to the last day of the previous month (i.e. the first page of the previous month, or nothing if it's the oldest month available).

Otherwise it breaks the "time continuum".

Exception in thread errors

Errors while doing python tumblr_backup.py -x -t nude fletchingboo (nsfw)

Exception in thread Thread-9:00 to 249 of 916
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(_self.__args, *_self.__kwargs)
File "tumblr_backup.py", line 802, in handler
work()
File "tumblr_backup.py", line 574, in save_content
append_try('photo-caption')
File "tumblr_backup.py", line 550, in append_try
self.get_inline_image, elt
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "tumblr_backup.py", line 661, in get_inline_image
saved_name = self.download_image(image_url, image_filename)
File "tumblr_backup.py", line 678, in download_image
image_response = urllib2.urlopen(image_url, timeout=HTTP_TIMEOUT)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 393, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 255, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.fletchingboo.tumblr.com

Exception in thread Thread-20:0 to 299 of 916
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(_self.__args, *_self.__kwargs)
File "tumblr_backup.py", line 802, in handler
work()
File "tumblr_backup.py", line 574, in save_content
append_try('photo-caption')
File "tumblr_backup.py", line 550, in append_try
self.get_inline_image, elt
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "tumblr_backup.py", line 661, in get_inline_image
saved_name = self.download_image(image_url, image_filename)
File "tumblr_backup.py", line 689, in download_image
with open_image(self.image_dir, image_filename) as image_file:
File "tumblr_backup.py", line 128, in open_image
return open_file(lambda f: open(f, 'wb'), parts)
File "tumblr_backup.py", line 118, in open_file
return open_fn(path_to(*parts))
File "tumblr_backup.py", line 128, in
return open_file(lambda f: open(f, 'wb'), parts)
IOError: [Errno 21] Is a directory: u'/fletchingboo/images/'

[BUG] Extension checking

If one user backups a blog using a domain name (sample.net), while using the option 'bi'=_, then the condition testing for a file extension will always be true:

--- tumblr_backup.py.orig   2014-10-07 08:59:34.413053500 +0200
+++ tumblr_backup_fixextension.py   2014-10-07 10:03:36.039336300 +0200
@@ -587,7 +587,7 @@
             image_filename = account + '_' + self.ident + offset
         else:
             image_filename = image_url.split('/')[-1]
-        glob_filter = '' if '.' in image_filename else '.*'
+        glob_filter = '' if '.' in image_filename[-4] else '.*'
         # check if a file with this name already exists
         image_glob = glob(join(self.image_folder, image_filename + glob_filter))
         if image_glob:
@@ -602,7 +602,7 @@
             # return the original URL
             return image_url
         # determine the file type if it's unknown
-        if '.' not in image_filename:
+        if '.' not in image_filename[-4]:
             image_type = imghdr.what(None, image_data[:32])
             if image_type:
                 image_filename += '.' + image_type.replace('jpeg', 'jpg')

Add option for tumblr_backup to save to subdirectories, based on year/month of post?

As a user backing up a big blog, that goes back a few years, I find that I end up with a directory that chokes ls, Finder, and Windows Explorer. I'd like to be able to easily browse the backup with a directory structure that is aware of the year/month of the post.

(complication: compatibility with the -D flag)

Option -s causes unexpected hang

On a blog that has 145 posts, the following command creates 45 folders, then hangs saying Downloading posts 145 to 194 of 145:
$ python2 tumblr_backup.py -D -s 100 ask-twilight-and-trixie (ponies blog)

This does not happen when the -s 100 is left out (it creates 145 folders, then exits normally). Ctrl^C'ing gives the following traceback:

Traceback (most recent call last):
  File "tumblr_backup.py", line 739, in <module>
    tb.backup(account)
  File "tumblr_backup.py", line 390, in backup
    soup = xmlparse(base, j - i, i)
  File "tumblr_backup.py", line 147, in xmlparse
    resp = urllib2.urlopen(url)
[... similar lines from urllib2.py, httplib.py, socket.py ...]
KeyboardInterrupt

Crash

./tumblr_backup.py --no-reblog --save-video -I o -t ME,SELF,US -O seattle255 seattle255

Exception in thread Thread-5:to 99 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-18:to 149 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-16:to 249 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-1: to 299 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-7: to 399 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-19:to 449 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-12:to 499 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-11:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-10:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-20:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-13:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-15:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-4: to 549 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-14:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

Exception in thread Thread-17:to 599 of 715
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "./tumblr_backup.py", line 892, in handler
work()
File "./tumblr_backup.py", line 609, in save_content
url = post['permalink_url'] if post['html5_capable'] else post['source_url']
KeyError: 'source_url'

seattle255: Getting posts 700 to 714 of 715

Tumblr Backup: Dashes in tags

Doing python tumblr_backup.py -t muh-face ixnay-on-the-oddk

Doesn't get images/posts tagged muh-face I guess it's because of the included dash?

NSFW: http://ixnay-on-the-oddk.tumblr.com/tagged/muh-face

missing URLs that really are there (mixed work or 404) in api-v2 (NSFW links)

Am I being throttled? Because,

http://41.media.tumblr.com/0ffc03585544d18d5751844373432b45/tumblr_nmbeb2BSx71rtx758o1_1280.png

Works, some of the others do too, but most don't.

Tumblr backup: unicode error

Hello,

Just a small bug:

Traceback (most recent call last):of 4300
  File "tumblr_backup.py", line 740, in <module>
    tb.backup(account)
  File "tumblr_backup.py", line 395, in backup
    if not _backup(posts):
  File "tumblr_backup.py", line 377, in _backup
    post.save_post()
  File "tumblr_backup.py", line 598, in save_post
    f.write(self.get_post())
  File "tumblr_backup.py", line 577, in get_post
    post += '<h2>%s</h2>\n' % self.title
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 16: ordinal
 not in range(128)

I haven't got time to make a patch right now, but I fixed it with:

            post += u'<h2>%s</h2>\n' % self.title

Provide a way to limit the pool

The workers are a lot slower than what feeds the queue. On a decent sized blog of dozens of thousand of posts, the queue will use several hundred MB of memory.
Limiting the queue to 1000 will pause the process of getting posts from Tumblr.

--- tumblr_backup.py.orig   2014-10-05 11:55:41.019549600 +0200
+++ tumblr_backup_poollimit.py  2014-10-06 10:37:31.151061500 +0200
@@ -676,7 +676,7 @@
 class ThreadPool:

     def __init__(self, count=20):
-        self.queue = Queue.Queue()
+        self.queue = Queue.Queue(1000)
         self.quit = threading.Event()
         self.threads = [threading.Thread(target=self.handler) for _ in range(count)]
         for t in self.threads:
@@ -698,6 +698,8 @@
                 if self.quit.is_set():
                     break
             else:
+                if (self.quit.is_set()):
+                    log(account, "%d remaining posts to save\r" % self.queue.qsize())
                 work()
                 self.queue.task_done()

Passworded blogs no longer work

./tumblr_backup.py --no-reblog --save-video -I o -t me,Me,ME -P REDACTED -O REDACTED REDACTED

tumblr_backup.py: error: no such option: -P

Enhancement: support per type and per tags backup

Hello,

I'd like to be able to use Tumblr backup to save posts with a specific type or specific tags.

I've tried to do it myself but my knowledge in coding is kind of non-existent.

--- tumblr-utils-master\tumblr_backup.py    Mon Nov 04 18:44:34 2013
+++ tumblr_backup.py    Mon Dec 23 13:29:58 2013
@@ -218,6 +218,13 @@
             f.write(css + '\n')


+def tags_callback(option, opt, value, parser):
+   setattr(parser.values, option.dest, value.split(','))
+   
+def type_callback(option, opt, value, parser):
+   value = value.replace('text','regular').replace('chat','conversation').replace('photoset','photo')
+   setattr(parser.values, option.dest, value.split(','))
+           
 class TumblrBackup:

     def __init__(self):
@@ -339,6 +348,12 @@
                         continue
                     if post.date < p_start:
                         return False
+                if options.type:
+                    if not post.typ in options.type:
+                       continue
+                if options.tags:
+                    if not set(options.tags).intersection(post.tags):
+                       continue
                 post.generate_content()
                 if post.error:
                     sys.stderr.write('%s%s\n' % (post.error, 50 * ' '))
@@ -386,7 +401,7 @@
         self.date = int(post('unix-timestamp'))
         self.tm = time.localtime(self.date)
         self.title = ''
-        self.tags = []
+        self.tags = ['%s' % t for t in post['tag':]]
         self.file_name = self.ident + post_ext
         self.error = None

@@ -456,7 +471,7 @@

         elif self.typ == 'answer':
             self.title = post.question
-            append(post.answer)
+            append_try('post.answer')

         elif self.typ == 'conversation':
             self.title = get_try('conversation-title')
@@ -564,6 +579,12 @@
         help="do a full backup at HOUR hours, otherwise do an incremental backup"
         " (useful for cron jobs)"
     )
+    parser.add_option('-t', '--tags', type='string', action='callback',
+        callback=tags_callback, help="save only posts tagged TAGS (comma-separated values)"
+   )
+    parser.add_option('-T', '--type', type='string', action='callback',
+        callback=type_callback, help="save only posts of type TYPE (comma-separated values)"
+   )
     parser.add_option('-n', '--count', type='int', help="save only COUNT posts")
     parser.add_option('-s', '--skip', type='int', default=0,
         help="skip the first SKIP posts"

Enhancement: "complex requests"

In order to fetch specific tags per post types:

--- tumblr_backup_apiv2.py.orig 2014-10-08 16:05:11.940811700 +0200
+++ tumblr_backup_apiv2_request.py  2014-10-08 19:06:12.277448300 +0200
@@ -74,6 +74,7 @@
     'text', 'quote', 'link', 'answer', 'video', 'audio', 'photo', 'chat'
 )
 POST_TYPES_SET = frozenset(POST_TYPES)
+POST_ANY_TYPES_SET = frozenset(POST_TYPES +('any',))

 MAX_POSTS = 50

@@ -454,6 +455,20 @@
                         continue
                     if post.date < options.p_start:
                         return False
+                if options.request:
+                    if ((post.typ in options.request) or ('any' in options.request)):
+                        if post.typ in options.request:
+                            if ((len(options.request[post.typ])) and (not set(options.request[post.typ]) & post.tags_lower)):
+                                if 'any' in options.request:
+                                    if ((len(options.request['any'])) and (not set(options.request['any']) & post.tags_lower)):
+                                        continue
+                                else:
+                                    continue
+                        else:
+                            if ((len(options.request['any'])) and (not set(options.request['any']) & post.tags_lower)):
+                                continue
+                    else:
+                        continue
                 if options.tags and not options.tags & post.tags_lower:
                     continue
                 if options.type and post.typ not in options.type:
@@ -514,7 +529,7 @@
         self.tm = time.localtime(self.date)
         self.title = ''
         self.tags = post['tags']
-        if options.tags:
+        if options.tags or options.request:
             self.tags_lower = set(t.lower() for t in self.tags)
         self.file_name = join(self.ident, dir_index) if options.dirs else self.ident + post_ext
         self.llink = self.ident if options.dirs else self.file_name
@@ -776,7 +791,17 @@
         if not types <= POST_TYPES_SET:
             parser.error("--type: invalid post types")
         setattr(parser.values, option.dest, types)
-
+    def request_callback(option, opt, value, parser):
+        raw_request = value.lower().split(';')
+        request = {}
+        for elt in raw_request:
+            if ':' in elt:
+                request.setdefault(elt.split(':')[0], elt.split(':')[1].split(','))
+            else:
+                request.setdefault(elt, '')
+        if not set(request.keys()) <= POST_ANY_TYPES_SET:
+            parser.error("--request: invalid post types")
+        setattr(parser.values, option.dest, request)
     parser = optparse.OptionParser("Usage: %prog [options] blog-name ...",
         description="Makes a local backup of Tumblr blogs."
     )
@@ -825,6 +850,10 @@
     )
     parser.add_option('-P', '--private', help="password for a private tumblr",
         metavar='PASSWORD'
+    )    
+    parser.add_option('-Q', '--request', type='string', action='callback',
+        callback=request_callback, help="Complex backup request. TYPE:TAGS;TYPE2:TAG,TAG;TYPE"
+        " case-insensitive)"
     )
     parser.add_option('-t', '--tags', type='string', action='callback',
         callback=tags_callback, help="save only posts tagged TAGS (comma-separated values;"

Tumblr Backup: Source dumped to screen

Running python get.py -x -t me eggkid is get https://i.imgur.com/pCDUuJy.png

Seems to complete okay, just not seen it do this before.

Bug: SAXParseException('no element found',)

I archive thousands of blogs and I'm always running into SAXParseException('no element found',) a fix for this would be great much appreciated.

Should be.. reproducible with python tumblr_backup.py -t me getsmewett (nsfw blog)

text/xml 'OK'etting posts 1200 to 1249 of 19915                    
SAXParseException('no element found',)

Tumblr Backup: Dashes in blog-name

Example python tumblr_backup.py -t me -skynet where -skynet is the blog-name.

Throws..

Usage: tumblr_backup.py [options] blog-name ...

tumblr_backup.py: error: option -s: invalid integer value: 'kynet'

Bug: backup of specified tags should be case insensitive

Currently, the function used to check for a tag is case sensitive:

                if options.tags and not options.tags.intersection(post.tags):
                    continue

Instead, the command line tags should be lowered and compared to lowered post.tags:

--- tumblr_backup.orig.py   2014-04-25 13:11:04.077440200 +0200
+++ tumblr_backup.py    2014-04-25 13:12:03.753853500 +0200
@@ -381,7 +381,7 @@
                         continue
                     if post.date < options.p_start:
                         return False
-                if options.tags and not options.tags.intersection(post.tags):
+                if options.tags and not options.tags.intersection((x.lower() for x in post.tags)):
                     continue
                 if options.type and not post.typ in options.type:
                     continue
@@ -606,7 +606,7 @@
     import optparse

     def tags_callback(option, opt, value, parser):
-        setattr(parser.values, option.dest, set(value.split(',')))
+        setattr(parser.values, option.dest, set(value.lower().split(',')))

     def type_callback(option, opt, value, parser):
         value = value.replace('text', 'regular').replace('chat', 'conversation').replace('photoset', 'photo')

Blog archives missing video data locally (NSFW links in issue)

./tumblr_backup.py -I o -t me,Me,ME -O naked-yogi naked-yogi

This will grab all the selected posts, it will grab posts where there was a video but not the tumblr-posted (or any, really) video file making it an incomplete archive.

It misses the video in this: http://naked-yogi.tumblr.com/post/120266402218/naked-yogi-1-2-3-4-hearts-lost

I don't know how you want me to demonstrate that the script doesn't pull the video, but, it doesn't.

A possible solution:

Detect youtube-dl installation, if found use that to do the backend work of doing the video ripping saving them to .../$blog-name/videos, and if not found print to STDOUT or log at least, the URL(s) of pages with a tumblr-sourced video so that we can go back later.

This way all you would have to do is have youtube-dl do the work of downloading the video, save it to .../$blog-name/videos and then simply re-link the videos source in the file to .../$blog-name/videos/file.mp4 in the particular post.

So the command to get that video file would be:

youtube-dl http://naked-yogi.tumblr.com/post/120266402218/naked-yogi-1-2-3-4-hearts-lost

all that needs to happen is for tumblr_backup.py to recognize when theres a video post on that blog, and act accordingly using some of the suggestions above.

Blog names with dashses in them causing seizures in tumblr_backup.py also possible DNS problems

`<urlopen error [Errno -2] Name or service not known> getting http://api.tumblr.com/v2/blog/p-e-n-e-l-o-p-e-m-a-c-h-i-n-e.tumblr.com/posts?api_key=8YUsKJvcJxo2MDwmWMDiXZGuMuIbeCwuQGP5ZHSEA4jBJPMnJT&limit=1&offset=8``

snip

HTTP Error 408: Request Time-out getting http://api.tumblr.com/v2/blog/p-e-n-e-l-o-p-e-m-a-c-h-i-n-e.tumblr.com/posts?api_key=8YUsKJvcJxo2MDwmWMDiXZGuMuIbeCwuQGP5ZHSEA4jBJPMnJT&limit=1&offset=8

snip

p-e-n-e-l-o-p-e-m-a-c-h-i-n-e: Getting posts 8 to 8 of 9

I'm seeing name resolution errors. I run NSCD and use OpenDNS on my router for resolution.

ValueError: unknown url type: //i.imgflip.com/8kv0m.gif

Got this error while doing python tumblr_backup.py -t me nikkiediamond

Traceback (most recent call last):
 File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
 self.run()
 File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
 File "tumblr_backup.py", line 799, in handler
work()
File "tumblr_backup.py", line 559, in save_content
append_try('regular-body')
File "tumblr_backup.py", line 549, in append_try
self.get_inline_image, elt
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "tumblr_backup.py", line 658, in get_inline_image
saved_name = self.download_image(image_url, image_filename)
File "tumblr_backup.py", line 675, in download_image
image_response = urllib2.urlopen(image_url, timeout=HTTP_TIMEOUT)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 393, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 255, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: //i.imgflip.com/8kv0m.gif

[BUG] If user asks to retrieve more posts than the blog contains

If options.count is superior to blog['posts'], the program hangs:

        # find the post number limit to back up
        last_post = options.count + options.skip if options.count else blog['posts']

It should be corrected to:

        # find the post number limit to back up
        last_post = options.count + options.skip if (options.count and options.count  + options.skip < blog['posts']) else blog['posts']

(this is the code for APIv2)

[BUG] Threading and Keyboard interruption

Hello,

The new Pooling system works fantastically, but the quitting the Main loop with a Keyboard interrupt leaves the child threads alive. I used to launch them as Daemon in my previous patch, i.e. they die if the main process ends.

Side note: I've converted the script to use the Tumblr API v2 because I needed to access some tags not available on v1 (i.e. source of the posts and video file url mainly). I'm not sure if such modifications interest you or not. Let me know if that's the case.

[Enhancement] Provide an option to skip images download

Image downloading is space and time consuming; since Tumblr always keeps pictures on its servers even after a blog's deletion, bypassing the image saving process would allow users to quickly keep the skeleton of their blogs. The backup would then load the missing images directly from Tumblr servers.

--- tumblr_backup.py.orig   2014-10-01 09:57:32.142436800 +0200
+++ tumblr_backup_skip.py   2014-10-01 10:24:32.788979900 +0200
@@ -481,7 +481,10 @@
             url = escape(get_try('photo-link-url'))
             for p in post.photoset['photo':] if hasattr(post, 'photoset') else [post]:
                 src = unicode(p['photo-url'])
-                append(escape(self.get_image_url(src, p().get('offset'))), u'<img alt="" src="%s">')
+                if options.skip_images:
+                    append(u'<img alt="" src="%s">' % (escape(src)))
+                else:
+                    append(escape(self.get_image_url(src, p().get('offset'))), u'<img alt="" src="%s">')
                 if url:
                     content[-1] = u'<a href="%s">%s</a>' % (url, content[-1])
                 content[-1] = '<p>' + content[-1] + '</p>'
@@ -689,6 +692,9 @@
     parser.add_option('-i', '--incremental', action='store_true',
         help="incremental backup mode"
     )
+    parser.add_option('-k', '--skip-images', action='store_true',
+        help="Do not save images"
+    )
     parser.add_option('-x', '--xml', action='store_true',
         help="save the original XML source"
     )

Bug: number of posts saved does not match what is requested

Hello,

Since the merge of gh-13, I have found a new bug:

tumblr_backup.py -n 49 account
account: 49 posts backed up

tumblr_backup.py -n 50 account
account: 20 posts backed up

tumblr_backup.py -n 80 account
account: 80 posts backed up

tumblr_backup.py -n 100 account
account: 40 posts backed up

It seems each time the script hits a multiple of 50, it only saves 20 posts.

I thought it was from the enhancements I've proposed, but the culprit seems to be located in this commit ae72b60 :

Filtering out reblogs in a target blog from being archived

Sometimes I only want what the target blog posted, not what they share from others. This would make a dandy flag in the backup script.

Handling socket.error

Sometimes xml = resp.read() or fetching the image data exit with a socket.error which is not currently caught.

--- tumblr_backup.py.orig   2014-10-05 11:55:41.019549600 +0200
+++ tumblr_backup_errorfix.py   2014-10-06 10:52:58.267506700 +0200
@@ -18,6 +18,7 @@
 import time
 import urllib
 import urllib2
+import socket
 from xml.sax import SAXException
 from xml.sax.saxutils import escape

@@ -165,15 +166,19 @@
     url = base + '?' + urllib.urlencode(params)
     for _ in range(10):
         try:
-            resp = urllib2.urlopen(url)
-        except (urllib2.URLError, urllib2.HTTPError) as e:
+            resp = urllib2.urlopen(url, None, 30)
+        except (urllib2.URLError, urllib2.HTTPError, socket.error) as e:
             sys.stderr.write('%s getting %s\n' % (e, url))
             continue
         if resp.info().gettype() == 'text/xml':
             break
     else:
         return None
-    xml = resp.read()
+    try:
+        xml = resp.read()
+    except (urllib2.URLError, urllib2.HTTPError, socket.error) as e:
+        sys.stderr.write('%s getting %s\n' % (e, url))
+        return None
     try:
         doc = xmltramp.parse(xml)
     except SAXException as e:
@@ -409,11 +414,15 @@
         while i < last_post:
             # find the upper bound
             j = min(i + MAX_POSTS, last_post)
-            log(account, "Getting posts %d to %d of %d\r" % (i, j - 1, last_post))
-
-            soup = xmlparse(base, j - i, i)
+            for e in range(3):
+                log(account, "Getting posts %d to %d of %d%s\r" % (i, j - 1, last_post, '' if not e else ', retry ' + str(e)))
+                soup = xmlparse(base, j - i, i)
+                if soup is None:
+                    continue
+                else:
+                    break
             if soup is None:
-                i += 50         # try the next batch
+                i += MAX_POSTS         # try the next batch
                 continue

             posts = soup.posts['post':]
@@ -582,10 +591,10 @@
             return _url(split(image_glob[0])[1])
         # download the image data
         try:
-            image_response = urllib2.urlopen(image_url)
+            image_response = urllib2.urlopen(image_url, None, 30)
             image_data = image_response.read()
             image_response.close()
-        except urllib2.HTTPError:
+        except (urllib2.URLError, urllib2.HTTPError, socket.error):
             # return the original URL
             return image_url
         # determine the file type if it's unknown

Only new content from previous blog rips should be fetched (git-like) on subsiquent runs.

Example, I run this today for the first time:

./tumblr_backup.py -O justgirlythings justgirlythings

In 5 days, I run it again, and it gets all posts again, instead of journaling in /justgirlythings/progress.json (or something like that) when the last time I ran the command so by default it will skip older posts and focus only on whats been missed. This saves on archiving time and traffic generated.

This would be a very handy feature, if you don't find it ideal to make this the default action then something like -n for new posts only (assuming the commands been run before) would suffice. Think of it like resuming a paused download, or more accurately git where only new changes are fetched.

Rip videos, fetch new posts only by default, filter reblogs option

At least download the video file into /videos. Save extension as mp4 or mov, what the extension is is near where the video data link is in the code for video posts.
Only new content from previous rips should be fetched.
Filter reblogs or allow a flag to pull posts from that speciic blog only

bbolli / tumblr-utils Goto Github PK

tumblr-utils's Introduction

tumblr-utils

Notice

License

tumblr-utils's People

Contributors

Stargazers

Watchers

Forkers

tumblr-utils's Issues

Recommend Projects

Recommend Topics

Recommend Org