Git Product home page Git Product logo

Comments (3)

GoogleCodeExporter avatar GoogleCodeExporter commented on June 7, 2024
Some thoughts on these ideas:

It makes sense that (a) the initial --listBlocks operation and (b) maintaining 
a zero block bitmap be separate functions (with the former implying the latter).

However, maintaining a zero bit map must be optional, because (in current 
implementation) it costs 1 bit per block and this could be too big for memory 
in some combinations of small block size and large filesystem. So I think the 
right answer is to add a --zeroTrack flag (or whatever) which is implicitly 
enabled by --listBlocks.

Performing --listBlocks in the background is a good improvement, however there 
is an implicit race condition which must be handled, namely a previously zero 
block is written to right at the same time the --listBlocks response returns 
saying the block is zero. Unless the cache is enabled (which we can't assume) 
you need to somehow track this scenario. E.g., a simple way to do this is to 
track the highest block number written.

Regarding changed #3 through #5, this is more troubling to me. You are changing 
the fundamental assumption that S3 is authoritative and s3backer's job is only 
to access and cache the data from S3. Having said that, fortunately there is 
already an easy way to do what you want, i.e., assert the local cache as 
authoritative, which is simply to always specify the `--blockCacheNoVerify' 
flag. It seems that this flag would give you the same net effect you're looking 
for.


Original comment by [email protected] on 24 Oct 2010 at 8:04

  • Changed state: Accepted
  • Added labels: Type-Enhancement
  • Removed labels: Type-Defect

from s3backer.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 7, 2024
Hi Archie,

I hadn't thought of the memory problem with tracking zero blocks on really 
large filesystems. You've obviously correct about that. I've attached a new 
patch with the --zeroTrack flag you suggested.

On the subject of the race condition with doing --listBlocks in the background, 
my thinking was to grab the HTTP I/O mutex while fetching and parsing each page 
of the list, and to yield to the other block I/O threads between pages. Do you 
think that's sufficient to avoid a race condition.

On the subject of refreshing lost blocks, I think that "the fundamental 
assumption that S3 is authoritative" is one model for the use of s3backer, but 
it doesn't need to be the only model that makes sense for the use of the tool. 
If someone is only using an s3backer filesystem from a single host, then it is 
perfectly reasonable for the cache to be considered authoritative for that 
filesystem on that host. I wouldn't make this the default behavior, but I think 
it's a good and useful to make it an option.

Let me explain a bit more about how I am using s3backer to explain why this 
makes sense to me (and I think it will to others as well). I already keep a 
complete backup of all of my data on a separate SATA disk that's used for 
nothing but backups. The S3 backup is therefore intended to be used only for 
disaster recovery, e.g., if God forbid my house burns down and both my primary 
and backup SATA drive are lost.

Therefore, if I ever actually do need to use the S3 backup to restore data, 
it'll be because the s3backer cache on my local computer is no longer 
accessible, and all of the data in the S3 filesystem *will* need to be correct 
and complete.

The problem is that, with the current architecture, even with 
--blockCacheNoVerify set, there's no guarantee that that will be the case. S3's 
SLA explicitly states that files can be lost, especially in RRS (which is very 
attractive since it's 33% cheaper than standard storage). Once I've written a 
backup file to the S3 filesystem, there's no reason why I would ever look at 
that file's blocks again until/unless I have to restore it. So there is a 
significant risk of blocks disappearing from the S3 filesystem and me not 
knowing about them until I actually have to use the data to do a restore.

--blockCacheNoVerify doesn't solve the problem because it only "papers over" 
missing blocks that are available in the local cache. It doesn't repair the 
damage (i.e., it doesn't put a missing block back onto S3 when it finds one), 
which means that if I ever have to remove my cache, the data is lost forever, 
or if I ever have to use my backup for what it's actually intended, i.e., 
disaster recovery when my primary computer is completely toasted, it won't help 
because there won't be a local cache to get missing blocks from.

In short, fixing missing blocks from the cache doesn't make sense when an S3 
filesystem might be used from multiple locations, but it makes a lot of sense 
when it's only intended to be used from one place at a time.

Please let me know your thoughts.

(Would it be better to have this discussion on the s3backer-devel list rather 
than here?)

Original comment by [email protected] on 25 Oct 2010 at 12:15

Attachments:

from s3backer.

archiecobbs avatar archiecobbs commented on June 7, 2024

Fixed in 034fff3.

from s3backer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.