Parallelism

Parallelisation without sacrificing deduplication (or major overhauls in how the archival works) requires locking on root commits. Specifically, the Git module would have to lock all root commits of a clone before it begins searching bundles and release them only after writing bundle and metadata to storage. This must be a ‘lock all or none’ logic, i.e. if a process fails to acquire the lock on any of its root commits, it must release any locks it has already acquired. If it didn’t, there would be a deadlock potential since two processes might attempt to lock the same root commits in different orders.

Allowing parallel archival of related repositories is a much trickier problem. Some thoughts:

Don’t write singular bundles per repo clone. Instead, process previously unknown commits in chunks. Each chunk would be deduped against already produced chunk bundles from parallel processes. However, chunking the commit lists is probably hard: the chunks need to follow commit ancestry boundaries, else the bundle exclusions get very complicated and could potentially even make restoring nigh on impossible due to circular dependencies.
Keep singular bundles, but exchange commits between parallel processes. Process A arrives at the bundling step first; it locks on the root commits, calculates its new commits against all previous relevant bundles, then shares these with the storage layer. It can now release the lock and start bundling the commits. Process B arrives at the bundling step; it locks, then considers also those temporarily marked commits for its deduping, then continues like process A, marking its new commits, unlocking, bundling. At the end, if there is overlap with the temporarily marked commits, process B must wait until process A finishes successfully, lest it might depend on data not actually successfully bundled. Once A finishes and writes its bundle to storage, B can add the bundle to its dependencies. The temporarily marked commits would be cleared by process A upon writing the bundle.

It’s entirely unclear to me at this time how any of this would apply to other VCS as I don’t have anywhere near as much experience with them as with Git. But that’s a bridge to cross when #7 and #9 get implemented.

Currently, codearchiver is not well-suited for parallel execution. In particular, for the Git module, no two repos with overlapping root commits can run at the same time, else there might be duplication between bundles, which (besides wasting storage) could cause trouble later on. Since the root commit overlap is not known beforehand, this effectively means codearchiver cannot be run on the same data collection in parallel. Parallelisation without sacrificing deduplication (or major overhauls in how the archival works) requires locking on root commits. Specifically, the Git module would have to lock all root commits of a clone before it begins searching bundles and release them only after writing bundle and metadata to storage. This must be a 'lock all or none' logic, i.e. if a process fails to acquire the lock on any of its root commits, it must release any locks it has already acquired. If it didn't, there would be a deadlock potential since two processes might attempt to lock the same root commits in different orders. Allowing parallel archival of related repositories is a much trickier problem. Some thoughts: * Don't write singular bundles per repo clone. Instead, process previously unknown commits in chunks. Each chunk would be deduped against already produced chunk bundles from parallel processes. However, chunking the commit lists is probably hard: the chunks need to follow commit ancestry boundaries, else the bundle exclusions get very complicated and could potentially even make restoring nigh on impossible due to circular dependencies. * Keep singular bundles, but exchange commits between parallel processes. Process A arrives at the bundling step first; it locks on the root commits, calculates its new commits against all previous relevant bundles, then shares these with the storage layer. It can now release the lock and start bundling the commits. Process B arrives at the bundling step; it locks, then considers also those temporarily marked commits for its deduping, then continues like process A, marking its new commits, unlocking, bundling. At the end, if there is overlap with the temporarily marked commits, process B must wait until process A finishes successfully, lest it might depend on data not actually successfully bundled. Once A finishes and writes its bundle to storage, B can add the bundle to its dependencies. The temporarily marked commits would be cleared by process A upon writing the bundle. It's entirely unclear to me at this time how any of this would apply to other VCS as I don't have anywhere near as much experience with them as with Git. But that's a bridge to cross when #7 and #9 get implemented.

JustAnotherArchivist added the

enhancement

label 1 year ago

JustAnotherArchivist referenced this issue from a commit 1 year ago

Support parallel runs against the same storage Closes #15

JustAnotherArchivist closed this issue 1 year ago

The commit above implements the last idea but with a simple global lock rather than the root commit lock. The commit diff calculation should be fast enough for this locking to not be an issue, and it simplifies the implementation quite significantly.

#15 Parallelism