OpenStack Swift as backend for Git – Part 2

A few days ago we published a blog post called OpenStack Swift as backend for Git – part 1 where we explained what could be the advantages of using Swift as backend for Git and gave you some details about what happens in Git (server side) when a client pushes or fetches objects.

In this second blog post, we will first do a quick introduction of Dulwich, the project we use to tackle our challenge and describe how we handle Swift as a backend to store repositories. Then we’ll finish by giving you the necessary resources to try Dulwich with Swift as its backend.

Quick overview of Dulwich
To get over our challenge we decided to use the amazing Python project Dulwich. It is a Python library developed by Jelmer Vernooij that gives an interface to local and remote Git repositories. This library handles a lot of stuff like :

  • Create, read, manage loose objects (blob, tree, commit, tag)
  • Create, read, manage pack files
  • Create, read, manage references files
  • Manage staging area
  • Manage a local copy
  • Implement the Git smart protocol through git-upload-pack and git-receive-pack
  • Implement the Git, HTTP, SSH listeners to start Dulwich as a Git server
  • Implement some client side command like pull, fetch, clone, … and some other porcelain commands

The Dulwich library has all the needed base elements for tackling our challenge especially by offering a full Python implementation of git-upload-pack and git-receive-pack. The really interesting parts for us are its server capabilities, Git smart protocol implementation and repository interface.

How do we handle Swift as backend with Dulwich
We added an additional repository implementation SwiftRepo along with the traditional Repo (File system backend) and the MemoryRepo. As you can see, the full implementation is located in the dulwich/swift.py module. Below are some explanations about the Dulwich implementation of the Swift backend interface:

The repository layout in a Swift account
The SwiftRepo implementation authenticates against Swift and manages repositories at account’s container level. The repository’s container will include the following objects:

  • info/refs
  • objects/pack/[pack-sha-1.pack, pack-sha-1.idx, pack-sha-1.info]*

It also includes other objects like description, config, info/exclude that can be ignored for now. These are the minimal requirements to have a working repository.

  • info/refs object stores the reference’s names and the corresponding object’s sha-1
  • pack objects store the Git objects

The info/refs object
Instead of using the standard way to store references one file by reference we prefered to store all references in one file. The common way will produce a long list of Swift objects while the amount of branches and tags grow. The discovery process of all references will require a bunch of Swift GET requests. This is why in our Swift backend we use info/refs because it requires just one GET request to load all the references.

The pack files
The C Git implementation of git-receive-pack will explode a received pack file from client to a bunch of loose Git objects (tree, blob, commit, tag). Dulwich will instead keep the pack format to store the objects. We kept the Dulwich way to store the objects in order to reduce the amount of Swift objects we need to store in an unique container. In our experience, Swift does not deal efficiently with containers that contain a huge amount of objects.

A pack file can contain a huge amount of objects. The advantage of the pack format against storing each individual loose object in a file is that an object can be a delta of a base object. Storing delta instead of full object content can significantly reduce the size of a repository. To a better understanding of what a pack file is, have a look at the pack format documentation.

On a file system, retrieving an object from a pack file requires to seek into it (at a known offset) and load a known amount of bytes. The pack index “.idx” contains the offset of all the objects included in a corresponding .pack file. In our Swift backend implementation for Dulwich, we use the Range header of the GET request to only read the required parts of a pack to retrieve the objects by their sha-1.

Concurrency
To improve the performance and reduce the delay when seeking over stored packs, creating or verifying a pack it was quite obvious that adding concurrency at object retrieval was a better option. Dulwich does not rely on any sort of concurrency when walking over a pack as local disk IO are generally pretty fast.

Our Swift backend implementation is able to perform HTTP requests to Swift concurrently. This, for instance, is particularly efficient when we need to build a custom pack for a client. When we know all the sha-1 of the objects we need to integrate in the pack we can concurrently perform the requests to Swift.

In addition, we use a controlled pool of HTTP connections that can be reused, thanks to the geventhttpclient library and the minimal Swift client integrated in dulwich/swift.py.

The pack.info objects
In a traditional Git repository a pack is always accompanied by an index file. We decided to add a third file. This .info object is like an index with more information than the .idx object. For instance it contains the parents commit listing for each commit of a pack. With this content we can quickly build the parent commit’s chain for a given reference simply by reading this file.

This .info file is automatically created when a pack is pushed by client and stored in the Swift backend. Without this file we would need to walk (GET request the pack file) over all the commit objects one by one (synchronously) in order to build the commit parent chain from a given reference. This can be slow for some projects with a big amount of commits. The .info object contains other additional useful information to speed up the object discovery.

Configuration
A configuration file is needed by the Swift repo implementation. The configuration file lets you specify the user credentials to perform the requests against Swift (tenant, user and password) together with the authentication method (v1 or v2). You can also configure the concurrency limit and the size of the HTTP connections pool. Please have a look at the configuration template.

How to retrieve and use
The Swift repository implementation for Dulwich is currently usable in the eNovance fork of Dulwich. The installation and usage instructions can be found in the README.swift. There is currently a pull request for this feature on the official Dulwich repository.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

    • Hello,
      I’ve done a couple of benchmark you can find here:
      https://docs.google.com/document/d/1IRPYwmSWzsAt9X66Rw9nnAZeTKBi17TMTczsjkZkiTg/edit?usp=sharing

      I think there is no limitation regarding the amount of objects
      stored in the Git repo with the Swift backend or even some
      slow-down as Swift will performs the same even if you have a
      huge amount of objects stored in it.
      Currently this implementation doesn’t handle the large object support
      of Swift that means that any pack file greater than 5GB will be rejected.
      There is no differences between the regular FS backend and the Swift backend in Dulwich regarding binary files so binary files are handled
      the same way.
      About fetch and push performances it will really depend
      on the capabilities of the Swift cluster. The default in the Swift backend
      implementation is to fetch data from Swift in Git pack by range of 12KB. If your
      Swift cluster is able to handle more than 1000 GET/s for this
      file size you can expect quite good performances. (have a look at ssbench tool to benchmark a Swift cluster).
      Thanks for sharing this link.

  1. What are the implications of eventual consistency on using Swift as a back end for source control?

    • Hello Paul,

      I think there is not so much implication. Below are where eventual consistency can add some drawbacks :

      – when references are retrieved. We store all references in one object “info/refs” that will be updated often. If a client want to fetch a reference and if Dulwich retrieves an old version of “info/refs” due to eventual consistency then Dulwich will get an old object’s sha for the reference the client want to fetch and then the fetch will complete but the client local copy will be a bit outdated. But a next fetch will probably fix that if this time the “info/refs” replica read by Dulwich is the last recent one. I assume we can improve that by using the X-newest header when Dulwich request “info/refs”.

      – when packs are discovered. If a container replica is not up to date and a pack is missing in the container listing then Dulwich will fail when it want to create a custom pack for the client if one of the requested objects was contained in this missing pack. Git client will then fail telling it is unable to fetch the requested reference. A next fetch will probably work as expected if container’s listing is this time up to date.

      Loose objects and packs in a Git repository are immutable, once created there no reason the objects or packs change. So I think there is no problem regarding eventual consistency when Dulwich retrieve objects to build a custom pack for a client. Note that in the Dulwich Swift backend we only rely on pack object.

  2. Pingback: OpenStack Community Weekly Newsletter (Jan 17 – 24) » The OpenStack Blog