This issue represents an opportunity for discussion of <a href="https://github.com/joy

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RFD 112: Discussion,about tritondatacenter/rfd

Comments (6)

kellymclaughlin commented on August 10, 2024

This is a really nice write-up. Here's a few different thoughts that
are not necessarily related to one another.

The metadata sweep and additional index on the manta table do seem a
little worrying given the number of indexes already on that table
and the key role it plays in servicing manta requests. We could
instead create an entirely new table for the audit information that
references the manta table via a foreign key reference to avoid more
write/update pressure on the manta table.
It might be valuable to complement the periodic sweep auditing with
some opportunistic auditing. If we relate this to some of the ideas
from the Dynamo paper I would map the sweep auditing to the active
anti-entropy protocol they describe, but we could also implement an
audit version of read repair to opportunistically audit an
individual object asynchronously when it is read and if it has been
sufficiently long since the object's last audit. This would also be
good because we really care about detecting integrity issues asap for
objects that are being accessed. It's less urgent for archival
objects that are never read and the periodic auditing will suffice
for those. This also spreads out the auditing work so ideally the
metadata sweeps produce fewer objects to be audited all at once.
Could we do reads for the audits from either the sync or async to
avoid putting more pressure on the manatee primaries for
housekeeping tasks? Perhaps we could default to reading from the
sync and fallback to the primary if a certain amount of lag is
detected.
It seems like the auditing would be more efficient if we covered an
entire storage zone at once rather than on a per-object basis. What
if we determined a mapping of storage zone to (objectId, digest)
pairs for all shards as a first step and then transfer the relevant
data to each storage zone. The storage zone could use something
like md5sum --check (or something similar in Node code) to make a
comparison of all the objects in one pass. Alternatively it could be
done a per-shard basis to avoid the step of combining all the
results. We already have the metadata about the storage zones to use
to compile the mapping, but one problem is that getting at that
information in a text column in an SQL query is probably too
expensive. We could move to a jsonb column (eventually) or maybe
break that information out into a separate table of three columns:
objectId, digest, storage zone. Or we just pull the raw _value
data along with the objectId and digest information and transform
the data in code.

from rfd.

davepacheco commented on August 10, 2024

Thanks a lot for taking a close look!

The metadata sweep and additional index on the manta table do seem a
little worrying given the number of indexes already on that table
and the key role it plays in servicing manta requests. We could
instead create an entirely new table for the audit information that
references the manta table via a foreign key reference to avoid more
write/update pressure on the manta table.

Yes, that's an option. There are a few challenges with that:

How do you maintain the separate table in such a way that it's either guaranteed to be correct (the way storing the audit metadata with the object metadata is) or trivially verifiable? The obvious ways I can think of rely on Muskie correctly identifying certain cases that it doesn't handle today. To the extent possible, I'd prefer that the auditor not rely on the correct functioning of the system, since its whole purpose is to verify that.
If the implementation relies on a true foreign key, we'd likely need to either modify Moray to support that or else bypass Moray for this subsystem.
How do you tell whether the auditor is working? (With in-table metadata, we can at least trivially enumerate objects that have never been audited or not recently audited.)

My hope is that we can tune the audit interval so that the cost of the sweeps won't be so significant. Still, the concern is valid. Maybe one of the early steps should be to better quantify the marginal cost.

It might be valuable to complement the periodic sweep auditing with
some opportunistic auditing. If we relate this to some of the ideas
from the Dynamo paper I would map the sweep auditing to the active
anti-entropy protocol they describe, but we could also implement an
audit version of read repair to opportunistically audit an
individual object asynchronously when it is read and if it has been
sufficiently long since the object's last audit.

When an object is read today, if it's missing from one shark, Muskie will at least try one of the other sharks on which it's stored. (It may not log this, though.) As we stream the data out, if the contents don't match the md5 sum that we stored when the object was initially saved, we'll note that as well -- at least in the log. Clients are encouraged to compute the md5 sum as they read the data, and with that, it should be impossible for clients to actually read the wrong data. That's not quite the same as auditing on every GET, though.

Automatic repair may be a useful future extension, though it's worth emphasizing that instances of such corruption should be exceedingly rare, should require multiple failures within a short period, and shouldn't be normalized.

Could we do reads for the audits from either the sync or async to
avoid putting more pressure on the manatee primaries for
housekeeping tasks? Perhaps we could default to reading from the
sync and fallback to the primary if a certain amount of lag is
detected.

That's likely a possibility, though there are a number of minor complexities to deal with, including whether we bypass Moray or we add a read-from-sync option to Moray; we couldn't have multiple auditors working on a single shard (but that's probably fine); and we really want to make sure that doing so wouldn't exacerbate the known problems around lag.

from rfd.

KodyKantor commented on August 10, 2024

Nice. I'm very excited to have the current audit pipeline behind us. I mostly have questions.

To find the next set of entries to audit, the auditor queries Moray for objects in the "manta" bucket sorted in increasing order of "timeAuditSweep". It would also separately query for objects with null values of "timeAuditSweep". This allows the auditor to balance auditing of least-recently-audited entries and newly-created entries according to whatever parameters it wants.

Does this mean that we would audit objects in chunks? For example, objects with timeAuditSweep between October 23rd and October 24th, then objects with timeAuditSweep between October 25th and 26th? Would we further limit this by only taking a certain number of objects within that range per query? For example, audit the first 5000 objects with timeAuditSweep between October 23rd and October 24th, then the next 5000, and so on? If we wouldn't have limitations like this I think we would probably still run into the dead tuple problem you describe later.

Like the auditor, it will have some target period -- perhaps a day or a week -- and seek to scan the whole filesystem in that time. As we gain confidence in the cache's correctness, we can tune this period to be quite long.

Assuming this cache were populated initially from a full file scan, there are only a few operations that would need to modify it: ...

There are some other cases when the cache could get out of sync, which I'm sure we know about. Operator intervention (if we restore missing files, which hopefully never needs to happen, or accidentally/intentionally remove files directly from the sharks), the cruft job, and the rebalance job each would need to somehow update the cache. I think that we'll need some way of manually initiating a partial rescan of the filesystem. Maybe that would involve having the storage auditor take a list of file paths to re-scan.

Could we speed up a scan of the filesystem by having multiple filesystem scanners? From what I understand, upload_mako_ls.sh looks at the filesystem serially, beginning at the root /manta directory, which could be slow.

I would love to trust the storage auditor and have it sweep the filesystem only every week or so. Unfortunately if we make an operational mistake and accidentally delete objects by running a buggy manta-oneach without realizing it, the audit cache will happily report the objects as present until the next filesystem sweep. Maybe what I'm trying to say is that we can run the high-level audit process continuously, but it will only be as accurate as the audit cache. I think that getting the time between filesystem sweeps as low as is reasonable should be a goal. I hope that I was able to communicate my worries properly. I'm guessing you've considered these issues, so I'm curious what your thoughts are.

It would be nice to hear what you think about the MPU case as well. MPU part files are concatenated into the final object, so the parts disappear. Will Muskie be another component that updates the storage auditor to notify it of removed upload parts?

I think one of the core problems in the audit pipeline is that we don't have easily accessible metadata about files from the makos. I think that a database on each mako is a good idea. Anecdotally, when I first thought of online audit, I thought it would be ideal if we could be directly in the file IO path and update the database before we complete a major file operation (create, modify, delete). I don't know if that's feasible, possible, or efficient.

Thanks for writing this up! I'm excited to read your reply.

from rfd.

kellymclaughlin commented on August 10, 2024

Yes, that's an option. There are a few challenges with that:

How do you maintain the separate table in such a way that it's either guaranteed to be correct (the way storing the audit metadata with the object metadata is) or trivially verifiable? The obvious ways I can think of rely on Muskie correctly identifying certain cases that it doesn't handle today. To the extent possible, I'd prefer that the auditor not rely on the correct functioning of the system, since its whole purpose is to verify that.

If the implementation relies on a true foreign key, we'd likely need to either modify Moray to support that or else bypass Moray for this subsystem.

How do you tell whether the auditor is working? (With in-table metadata, we can at least trivially enumerate objects that have never been audited or not recently audited.)

My hope is that we can tune the audit interval so that the cost of the sweeps won't be so significant. Still, the concern is valid. Maybe one of the early steps should be to better quantify the marginal cost.

And it's just a concern in theory at this point to be sure. Perhaps it would be premature optimization and probably worth collecting data before going down this path. If we did go that route I would envision it being baked into Moray rather than a bolted on bypass. That would probably make it a lot more work, but I think if that's what we chose to do then changing Moray is the right way to go about it. I am not sure I quite follow the concern about the correctness of the data in a separate table versus the metadata in a single table, but I feel like that's a problem that could be handled with appropriate use of CTEs and transactions to ensure all objects always have a row in the audit table.

When an object is read today, if it's missing from one shark, Muskie will at least try one of the other sharks on which it's stored. (It may not log this, though.) As we stream the data out, if the contents don't match the md5 sum that we stored when the object was initially saved, we'll note that as well -- at least in the log. Clients are encouraged to compute the md5 sum as they read the data, and with that, it should be impossible for clients to actually read the wrong data. That's not quite the same as auditing on every GET, though.

Automatic repair may be a useful future extension, though it's worth emphasizing that instances of such corruption should be exceedingly rare, should require multiple failures within a short period, and shouldn't be normalized.

To be clear, I'm not advocating for actual read-repair, just a similar mechanism for marking a last_audited time in the object metadata in cases where we already may have that information available. The goal strictly being to cut down on how much work the full audit sweeps may have to do.

from rfd.

davepacheco commented on August 10, 2024

@kellymclaughlin Thanks again for taking a thoughtful look.

I am not sure I quite follow the concern about the correctness of the data in a separate table versus the metadata in a single table, but I feel like that's a problem that could be handled with appropriate use of CTEs and transactions to ensure all objects always have a row in the audit table.

Well, suppose we create a separate manta_audit table that's intended to have one record for each entry in the manta table. We need to ensure that consistency: the record needs to be created when a new object is created or a new link is created to an existing object, and the record needs to be removed when the link is removed. We can use transactions to eliminate races, but that's only one possible source of inconsistency. The number of different bugs related to Muskie's manipulation of metadata (e.g., MANTA-3175, MANTA-3097, MANTA-1852) and other issues with its handling of edge cases (e.g., MANTA-3156, MANTA-3199, MANTA-3200) reinforce my feeling that the audit system cannot assume this sort of metadata has been managed correctly. So I'd feel like we'd end up needing an auditor auditor that would look for entries in manta that aren't in manta_audit, and I'm not sure how to do that scalably.

By contrast, if the audit state is stored directly with the canonical object metadata (i.e., in the manta table), that structurally eliminates the possibility of an object being left outside the audit process. And if we have an index on a column like timeAudited, then we can very quickly identify whether there are any objects that are visible but not audited within some recent time interval. I'd have a lot higher confidence that we weren't missing objects.

To be clear, I'm not advocating for actual read-repair, just a similar mechanism for marking a last_audited time in the object metadata in cases where we already may have that information available. The goal strictly being to cut down on how much work the full audit sweeps may have to do.

Ah, I see. That sounds like a good idea.

from rfd.

davepacheco commented on August 10, 2024

@KodyKantor Thanks for the comments and questions!

Does this mean that we would audit objects in chunks?

Yes, I expect we'd audit in chunks of some fixed number of records.

There are some other cases when the cache could get out of sync, which I'm sure we know about. Operator intervention (if we restore missing files, which hopefully never needs to happen, or accidentally/intentionally remove files directly from the sharks), the cruft job, and the rebalance job each would need to somehow update the cache. I think that we'll need some way of manually initiating a partial rescan of the filesystem.

Good point. I will update the RFD this afternoon to reflect those other cases.

In terms of partial rescans: I was envisioning that we'd have an API for requesting an update to a particular file (or files). That's how I imagine the updates from mako, GC, and the other jobs would work.

Could we speed up a scan of the filesystem by having multiple filesystem scanners? From what I understand, upload_mako_ls.sh looks at the filesystem serially, beginning at the root /manta directory, which could be slow.

I definitely think we can make it a lot faster by increasing the concurrency of the scanner. I've been prototyping this in Node, and I have a version that scans almost 500,000 objects in about about 70 seconds. That includes time required to invoke stat(2) for each one.

An incremental approach may yield dramatically better results. upload_mako_ls.sh rescans everything from scratch every day. By keeping a database of what we believe is present, we can avoid invoking stat(2) for the large fraction of objects that haven't changed. My prototype takes about 40 seconds to scan the same filesystem without invoking stat(2).

I would love to trust the storage auditor and have it sweep the filesystem only every week or so. Unfortunately if we make an operational mistake and accidentally delete objects by running a buggy manta-oneach without realizing it, the audit cache will happily report the objects as present until the next filesystem sweep. Maybe what I'm trying to say is that we can run the high-level audit process continuously, but it will only be as accurate as the audit cache. I think that getting the time between filesystem sweeps as low as is reasonable should be a goal. I hope that I was able to communicate my worries properly. I'm guessing you've considered these issues, so I'm curious what your thoughts are.

Yes, this will be an important consideration. Ultimately, we'll have to strike a balance between frequency of audit and resources dedicated to auditing. In the extreme, we could max out the drives and CPUs just doing auditing -- but with significant impact to the data path. And we may still be too far behind to quickly detect an accidental removal. (This could potentially be improved using the FEM API mentioned in the RFD to be notified about filesystem changes, though I'm not yet sure if this will scale well enough to work here.) The period will need to be tunable, and we'll have to find the right period through operational experience.

It would be nice to hear what you think about the MPU case as well. MPU part files are concatenated into the final object, so the parts disappear. Will Muskie be another component that updates the storage auditor to notify it of removed upload parts?

I believe it's mako (running mako-finalize) that removes the parts. I'll add that to the list of cases in the RFD. I think this is not fundamentally very different from the "file create" case, since they both happen in nginx, and we can look at the same options.

I think one of the core problems in the audit pipeline is that we don't have easily accessible metadata about files from the makos. I think that a database on each mako is a good idea. Anecdotally, when I first thought of online audit, I thought it would be ideal if we could be directly in the file IO path and update the database before we complete a major file operation (create, modify, delete). I don't know if that's feasible, possible, or efficient.

The RFD does describe options where we would trigger a database update from nginx on writes. Is that what you're referring to? I'd like to do this, but it needs to be done carefully, as I don't think we want to introduce data path failures when the storage auditor is down and we don't necessarily want to throttle writes at the speed that the auditing database can write.

Again, FEM would be another way to detect these changes quickly, but I'm not yet sure how feasible it will turn out to be.

from rfd.

RFD 112: Discussion about rfd HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent