#13 Git submodule support

Open
opened 1 year ago by JustAnotherArchivist · 1 comments

Submodules are stored in the .gitmodules file. It contains a URL, which may be relative to the superproject repository. Cf gitmodules(5) for more details.
The submodule’s commit ID is stored directly in the repo where it’s included; it’s a special file mode and contains something like Subproject commit <commit-id>.

Ideally, Git repo archives should recursively descend to all submodules that have ever been present in the repo. The commit IDs referenced in the repo should be used as extraBranches to ensure those are fetched even if they’re no longer reachable via a ref in the subproject repo.

Submodules are stored in the `.gitmodules` file. It contains a URL, which may be relative to the superproject repository. Cf `gitmodules(5)` for more details. The submodule's commit ID is stored directly in the repo where it's included; it's a special file mode and contains something like `Subproject commit <commit-id>`. Ideally, Git repo archives should recursively descend to all submodules that have ever been present in the repo. The commit IDs referenced in the repo should be used as `extraBranches` to ensure those are fetched even if they're no longer reachable via a ref in the subproject repo.
JustAnotherArchivist added the
enhancement
label 1 year ago
JustAnotherArchivist added the
module:git
label 1 year ago

This turns out to be surprisingly tricky.

Firstly, a slight correction: the commit ID is stored in the tree object using just the commit ID (rather than a tree or blob object ID as for normal dirs/files). The Subproject commit <hex-commit-id> format is produced by git diff. Also, the special file mode is 160000.

Collecting the repositories themselves is easy enough. git log --format=format:%H --diff-filter=d --all -- .gitmodules returns all commits where the .gitmodules file was altered in some way (and not deleted), and then git cat-file blob ${commitId}:.gitmodules can be used to retrieve the contents and git config --file - --get-regexp '\.url$' to extract the URLs.

Collecting the commits is a whole different beast. My first thought was to walk the trees of all commits, but that is very inefficient. It doesn’t appear to be possible to filter the git log by file mode, and the file mode is only included in diff/patch output. Attempting to parse that is a horrible idea. One alternative would be an external diff tool GIT_EXTERNAL_DIFF=... or -c diff.external=... which only emits the relevant file mode details, but that still doesn’t fully solve the parsing problem (commit message would still be shown), and it doesn’t scale to large repositories as it requires one process per modified file for each commit. It might be feasible to modify git log/diff.c to change its output based on the presence of an environment variable (i.e. omit the commit message and running the actual diff; the relevant code for the latter is the builtin_diff function).

Also, submodule repository URLs can be relative, and they’re evaluated relative to within the parent repo (cf. man git-submodule).

Random examples from the wild:

This turns out to be surprisingly tricky. Firstly, a slight correction: the commit ID is stored in the tree object using just the commit ID (rather than a tree or blob object ID as for normal dirs/files). The `Subproject commit <hex-commit-id>` format is produced by `git diff`. Also, the special file mode is 160000. Collecting the repositories themselves is easy enough. `git log --format=format:%H --diff-filter=d --all -- .gitmodules` returns all commits where the `.gitmodules` file was altered in some way (and not deleted), and then `git cat-file blob ${commitId}:.gitmodules` can be used to retrieve the contents and `git config --file - --get-regexp '\.url$'` to extract the URLs. Collecting the commits is a whole different beast. My first thought was to walk the trees of all commits, but that is very inefficient. It doesn't appear to be possible to filter the `git log` by file mode, and the file mode is only included in diff/patch output. Attempting to parse that is a horrible idea. One alternative would be an external diff tool `GIT_EXTERNAL_DIFF=...` or `-c diff.external=...` which only emits the relevant file mode details, but that still doesn't fully solve the parsing problem (commit message would still be shown), and it doesn't scale to large repositories as it requires one process per modified file for each commit. It might be feasible to modify `git log`/`diff.c` to change its output based on the presence of an environment variable (i.e. omit the commit message and running the actual diff; the relevant code for the latter is the `builtin_diff` function). Also, submodule repository URLs can be relative, and they're evaluated relative to *within* the parent repo (cf. `man git-submodule`). Random examples from the wild: * Relative URLs: https://github.com/gitextensions/gitextensions * Nested submodules (e.g. qemu also uses submodules): https://github.com/riscv-collab/riscv-gnu-toolchain
JustAnotherArchivist referenced this issue from a commit 1 year ago
JustAnotherArchivist changed title from Support for Git submodules to Git submodule support 1 year ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.