Cloudzip – Mount a remote zip and access its files without downloading everything

Let’s say you have a huge zip archive stored somewhere in the cloud, say on an S3 bucket, and you need to access a few specific files inside. What do you do? Well, like everyone …

Cloudzip – Mount a remote zip and access its files without downloading everything


Let’s say you have a huge zip archive stored somewhere in the cloud, say on an S3 bucket, and you need to access a few specific files inside. What do you do? Well, like everyone else, you download the whole 32GB, unzip the whole thing, and all that to retrieve 3 unfortunate files…

Well, imagine that I found a really nice little tool that will make your life easier: Cloudzip ! This allows you to mount your remote zip archive directly on your machine, like an external hard drive, so you can access the files you need, copy them, use them, all without having to download the entire archive.

Example :

cz ls s3://example-bucket/path/to/archive.zip

Pretty cool, right?

The way Cloudzip works is quite ingenious. It is based on two simple but devilishly effective principles:

  1. Zip files allow random read access. They have a ” central directory » stored at the end of the archive that describes all the files contained, with their offsets. No need to read the entire archive to find a file.
  2. Most HTTP servers and cloud storage services (S3, Google Cloud Storage, Azure Blob Storage, etc.) support HTTP requests with “range” headers. Basically, this allows you to retrieve only a portion of a remote file.

By combining these two principles, Cloudzip is able to retrieve just the central directory of your zip archive (which weighs only a few KB) to have the list of files, and then download only the bits of files you need at the time you access them!

To install it:

git clone 
cd cloudzip
go build -o cz main.go

Then copy the cz binary to a location accessible via your $PATH:

cp cz /usr/local/bin/

And where it gets even crazier (oops sorry, I meant “interesting”) is that with the mount parameter, Cloudzip can actually mount your remote zip archive as a local directory. In fact, it will start a small NFS server locally, and mount this NFS directory in the folder of your choice.

Another example:

cz mount s3://example-bucket/path/to/archive.zip some_dir/

This way, you have access to all your files as if they were local, you can open them directly in your applications, process them, and all this without ever having to download the entire archive.

And the best part is that Cloudzip works with just about any remote storage you can imagine. Sure, there’s S3, but also HTTP, HTTPS, GCS, Azure, and even… drum roll… Kaggle!

Ah Kaggle, that Data Scientists’ den where datasets are bigger than a Bitcoin miner’s electricity meter… Cloudzip is able to use the Kaggle API to directly retrieve the zip of a dataset without having to download it. So you can literally mount a Kaggle dataset locally and start working on it in a second. And if you ever need a particular file to test something, no worries, it will be downloaded on demand.

Now of course, it’s not perfect. NFS mounting, for example, is only available on Linux and macOS for now. And don’t expect crazy performance either, we’re still talking about downloading bits of files over the network. But for all those cases where you need to access a few files in a huge zip archive, it’s perfect!

And what’s more, it’s open-source (you didn’t think I was going to recommend something proprietary, did you?). You can find the project on GitHub.



Source link

Leave a Comment