0

I have > 100 million image files (book covers) as a flat list of files under a single "directory":

/images/000000093e7d1825b346e9fc01387c7e449e1ed7
/images/000000574c67d7b8c5726f7cfd7bb1c5b3ae2ddf
/images/0000005ae12097d69208f6548bf600bd7d270a6f
...

A long time ago, these were stored on Amazon S3, and are now on Backblaze B2 (which is S3-compatible).

So far, this worked fine:

  • storing a new file is very quick;
  • retrieving an existing file is very quick.

I'm in the process of migrating once again, to iDrive E2 (S3-compatible as well).

I'm experimenting with moving them using rclone, but after 30 min of waiting for rclone copy to start, I realized that rclone does not start transferring files until it has received the whole file list.

The problem is:

  • a quick benchmark of rclone ls on the /images/ directory tells me that transferring the whole file list would take almost 10 hours
  • any problem during transfer (which will take many days) would restart from zero, forcing rclone to download the whole file list again
  • listing files costs money with B2

I tried configuring rclone to copy only a batch of files:

  • rclone copy "backblaze:/images/0000*", with or without *, does not find any file
  • rclone copy "backblaze:/images/" --include "/0000*" seems to download the whole file list as well, and filter on the client

Strangely, it looks like rclone has no problem retrieving from the server a list of files that are under a given "directory", for example /images/, but cannot do the same with a prefix, such as /images/0000.

I thought that S3, and by extension all S3-compatible storages, stored file paths as a flat structure, and that / was just a character like any other, and that you could easily list files under any prefix, ending or not with a /.

Am I mistaken?

I my next storage (E2), should I store files under sub-directories, such as images/0/0/0/0/, images/0/0/0/1, etc., just like we did in the good old days of storing files in a traditional filesystem?

1 Answer 1

1

I realized that rclone does not start transferring files until it has received the whole file list.

This is telling me that your problem is less the storage providers and more rclone itself. A solution that started the list-stream and then chunked files as they arrive would be more appropriate than one that needs the entire file list before operating.

I thought that S3, and by extension all S3-compatible storages, stored file paths as a flat structure,

That's definitely how S3 does it, which broke my file-server admin brain when I first ran into it. Given the issues here seem to be metadata related rather than file-layout, it likely doesn't matter.

2
  • I finally decided to store files in 2 levels of virtual directories, with S3 keys such as 12/34/12345678.... Even though storing everything in the same "directory" should work with all S3-compatible storages and tools, it looks like it may still save me from a few migration headaches from time to time!
    – BenMorel
    May 3 at 22:00
  • @BenMorel "Should don't make is" -- true way more often than I like.
    – sysadmin1138
    May 3 at 23:23

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .