Welcome to Techfuel

Storing thousands/millons of files/images and organizing them.

Question: "How Facebook does it?", how can facebook store all these gazillion number of images and get away with it?

More so, what happens if I decide to upload say 20 photos of relatives and every single one of them is called the same: say "family.jpg" (and I want to put them in the same album)?

Enter the world of Hashing.

Though explaining hashing is not the goal of the article, it becomes pretty handful to do what we want to accomplish.

First, in order to store one trillion of files/images in our server requires that we address the following constraints:

Not all files can be in the same directory location (being overkill, and additionally most OSes have problems with such approach).
Must be able to accommodate files with the same "name".
Must be portable, if the hard drive is full, to easily start using another one (or even a remote server).
Adding the image payload in a database (blob) is not a good idea.

So, how do we do it?

Enter Mighty Python.

Obviously this can be accomplished in every single programming language, but since Python it the badass master, let's see how can we start fixing things using this awesome language:

First, we need to come up with some solutions, let's tackle the "splitting" of files into multiple folders: We should not think about storing all files under a "user" folder (e.g. ./Users/john/), why? - Because the time will come in which you cannot add more files to it (disk full condition for example).

A solution for this would be splitting. Let's take the file's first and second letters as an example, and create a directory structure, and store some files, like so:

miranda.jpg - ./images/m/i/miranda.jpg
nebula.jpg - ./images/n/e/nebula.jpg
dderidex.jpg - ./images/d/d/dderidex.jpg

So far so good, but this does not fix the "same file name" issue, imagine trying to add another "nebula.jpg" image.

Salting:

Let's add some salt, say we add a "random" set of letters to our filenames so now we can accomplish this:

miranda.jpg --> aqdemiranda.jpg --> ./images/a/q/aqdemiranda.jpg
nebula.jpg --> srccnebula.jpg --> ./images/s/r/srccnebula.jpg
dderidex.jpg --> draxdderidex.jpg --> ./images/d/r/draxdderidex.jpg

Note now how we can have files with the same "names". The trick now is to make it even more "random" since naming collisions can still occur:

Python provides several standard libraries that will allow us to handle this in a more automated way, let's use Python's hashlib library to generate hashes of filenames (or any string for that matter):

julio@dukat-local:CodeRepo/techfuel-website ‹master›$ python
Python 3.9.5 (default, May 11 2021, 08:20:37) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> hashlib.md5(b'miranda.jpg').hexdigest()
'c8382372556a88b8ba930a327719caff'
>>> hashlib.md5(b'miranda.jpg').hexdigest()
'c8382372556a88b8ba930a327719caff'
>>>

Note how the generated hash is the same for the same filename. So now, let's salt it with the help of the uuid library:

julio@dukat-local:CodeRepo/techfuel-website ‹master›$ python
Python 3.9.5 (default, May 11 2021, 08:20:37) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uuid
>>> uuid.uuid4()
UUID('168cacc8-f039-441f-9442-50facd274922')
>>> uuid.uuid4()
UUID('c8316575-561a-467b-853e-16183849e26e')
>>>

Python's uuid4 function generates a random UUID, excellent, different salt values for our project!

Getting Closer:

julio@dukat-local:CodeRepo/techfuel-website ‹master›$ python
Python 3.9.5 (default, May 11 2021, 08:20:37) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import uuid
>>> print(hashlib.md5(str(uuid.uuid4()).encode('utf-8') + 'miranda.jpg'.encode('utf-8')).hexdigest())
999c4e914d20132272712fc2899d7e9f
>>> print(hashlib.md5(str(uuid.uuid4()).encode('utf-8') + 'miranda.jpg'.encode('utf-8')).hexdigest())
edc766844dda34846ec622c29c81a6f7
>>> print(hashlib.md5(str(uuid.uuid4()).encode('utf-8') + 'miranda.jpg'.encode('utf-8')).hexdigest())
1ad68040bce4832566439a78db9cc446
>>>

Nice!, different "hashes" for the same file!

Now the final step is to put all together, splitting these files in, say, three levels, and using hex notation for the sub-folder names, will allow for combinations of 256³ , or 16 million possible combinations, (4.2 billion if you go four levels) - So: the "hash" 999c4e914d20132272712fc2899d7e9f can be subdivided in /99/9c/4e/999c4e914d20132272712fc2899d7e9f. Most likely we'd like to store all this metadata information in our database, along with the original file and maybe a content type indicating what kind if image we're talking about (jpg, png, gif, etc), among other information that might be pertinent for your uses, such as size, resolution, etc.

The following small program will do just that, from a filename, creating the basic necessary information to store into a database, it will not check for the existence of a real image or data stream (i.e. an uploaded file), but it will take care of the hashing and entropy of the filename itself:

# -*- coding: utf8 -*-
# gen_hashes.py - filename hash generator.

import hashlib
import uuid

# Use imghdr (imghdr.what(fname[,stream])) to find out image type

def generate_hash_from_filename(fname):
    return hashlib.md5(str(uuid.uuid4()).encode('utf-8') + fname.encode('utf-8')).hexdigest()

def get_path(fname):
    hashed = generate_hash_from_filename(fname)
    path_info = (fname, hashed[:2], hashed[2:4], hashed[4:6], hashed,)
    return path_info

if __name__ == '__main__':
    print(get_path('miranda.jpg'))
    print(get_path('miranda.jpg'))
    print(get_path('nebula.png'))
    print(get_path('nebula.jpg'))
    print(get_path('averylongfilename.jpg'))

Generating the following output:

julio@dukat-local:~ $ python gen_hashes.py 
('miranda.jpg', '46', '55', '28', '4655281dd89a2036479605e4f8cd71f1')
('miranda.jpg', '0f', '38', '77', '0f3877af6087757236e768ce1c4d6479')
('nebula.png', 'cf', '25', '4f', 'cf254fe952c91d4a1091b79403b749d2')
('nebula.jpg', 'ac', '71', '32', 'ac7132d417e3f8ccd71ec3aa11d444ad')
('averylongfilename.jpg', 'be', '42', '62', 'be4262597da9e0d6f555c57b8eaad45b')

Note how even with file names with the same name, the hash/salt combination creates unique identifiers tor them.

And before we finish, let's unit test, basically create a test which verifies that 100,000 hashes for the same file are all unique:

# -*- coding: utf8 -*-
# test_gen_hashes.py - Tester for gen_hashes.py

import unittest

from gen_hashes import generate_hash_from_filename

class TestHashFiles(unittest.TestCase):

    def test_uniqueness(self):
        """ Generates a sample of ten thousand hashes for the
        same file and fails if uniqueness is false
        """
        self.fname = 'test.jpg'
        self.fnames = []
        for index in range(100000):
            self.fnames.append(generate_hash_from_filename(self.fname))
        self.assertEqual(len(self.fnames), len(set(self.fnames)))

if __name__ == '__main__':
    unittest.main()

Resulting in:

julio@dukat-local:~ $ python test_gen_hashes.py
.
----------------------------------------------------------------------
Ran 1 test in 0.395s

OK

So there you have it, in case you want to work on the next facebook :)