More so, what happens if I decide to upload 20 images of my family, and every single one of them is called the same: "family.jpg" (and I want to put them in the same album)?
Enter the word of Hashing.
Though explaining what hashing is is not the goal of the article, it becomes pretty handful to do what we want to accomplish.
First, in order to store one trillion of files/images in our server requires that we address the following constraints:
- Not all files can be in the same filesystem directory (overkill, and most OSes have problems with this).
- Must be able to accommodate files with the same "name".
- Must be portable, if the hard drive is full, to easily start using another one (or even a remote server).
- Adding the image payload in a database (blob) is not only a terrible idea but an anti-pattern in itself.
So how do we do it?
Enter Mighty Python.
Obviously this can be accomplished in every single programming language, but since Python it the badass master, I'll show how can we start fixing things using this awesome language:
First, we need to come up with some solutions, let's tackle the "splitting" of files into multiple folders: We should not think about storing all files under the "user" folder (e.g. /Users/john/*), why? - Because the time will come in which you cannot add more files to it (disk full for example).
Solution for this is splitting. let's take the file's first and second letters for example and create a directory structure, and store some files, like so:
- defiant.jpg — /Images/d/e/defiant.jpg
- ourparty.jpg — /Images/o/u/ourparty.jpg
- sister.jpg — /Images/s/i/sister.jpg
So far so good, but this does not fix the "same file name" issue, imagine trying to add another "sister.jpg" image.
Let's say we add a "random" set of letters to our filenames so we can accomplish this:
- defiant.jpg — aqerdefiant.jpg — /Images/a/q/aqerdefiant.jpg
- ourparty.jpg — srtcourparty.jpg — /Images/s/r/srtcourparty.jpg
- sister.jpg — rybssister.jpg — /Images/r/y/tybssister.jpg
- defiant.jpg — trlodefiant.jpg — /Images/t/r/trloaqerdefiant.jpg
Note how we can now have files with the same "names". The trick now is to make it even more "random" since collisions can still occur:
Python provides several standard libraries that will allow us to handle this in a more automated way, let's use Python's hashlib library to generate hashes of filenames (or any string):
Rotarran:~ julio$ python Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import hashlib >>> hashlib.md5('family.jpg').hexdigest() 'ccafdb2280972b6e8d05e1a5bc40f979' >>> hashlib.md5('family.jpg').hexdigest() 'ccafdb2280972b6e8d05e1a5bc40f979' >>>
Note how the hash is the same for the same filename. Let's salt it with the help of the uuid library:
Rotarran:~ julio$ python Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import uuid >>> uuid.uuid4() UUID('8574e336-57cd-45ba-9b52-ddbff2f6ebc9') >>> uuid.uuid4() UUID('c64b2298-64c3-4035-9c4b-cc07c4d0322d') >>>
Nice, different salt values for our project!
Putting it all together:
Rotarran:~ julio$ python Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import hashlib >>> import uuid >>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest()) 402beab8d58219c5b28918c3368c6fa4 >>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest()) 093b1272053e3fdd96ac2689f2d3a2a3 >>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest()) 820a49e57b868fad6dd57dfb383aba23 >>>
Nice!, different "hashes" for the same file!
Now the final step is to put all together, splitting these files in, say, three levels, will allow for combinations of 2563 , or 16 million combinations possible, so: the "hash" 820a49e57b868fad6dd57dfb383aba23 can be subdivided in /82/0a/49/820a49e57b868fad6dd57dfb383aba23. Most likely we'd like to store all this metadata information in our database, along with the original file and maybe a content type indicating what kind if image we're talking about (jpg, png, gif, etc), among other information that might be pertinent for your uses, such as size, resolution, etc.
The following small program will do just that, from a filename, creating the basic necessary information to store into a database, it will *not* check for the existence of a real image or data stream (i.e. an uploaded file), but it will take care of the hashing and entropy of the filename itself:
# -*- coding: utf8 -*- # hashfiles.py - filename hash generator. import hashlib import uuid # imghdr (imghdr.what(fname[,stream])) to find out image type def generate_hash_from_filename(fname): return hashlib.md5(str(uuid.uuid4()) + fname).hexdigest() def get_path(fname): hashed = generate_hash_from_filename(fname) path_info = (fname, hashed[:2], hashed[2:4], hashed[4:6], hashed,) return path_info if __name__ == '__main__': print(get_path('hello.jpg')) print(get_path('averylongname.png')) print(get_path('hello.jpg')) print(get_path('thisisaverylongimagename.jpg'))
Generating the following output:
Rotarran:Projects julio$ python hashfiles.py ('hello.jpg', 'bd', '2a', '9a', 'bd2a9a1cd81852ed2d8db03971d81000') ('averylongname.png', '44', '11', '9b', '44119b1b6f78eaaa99a969e62006eab6') ('hello.jpg', '20', 'dc', 'e2', '20dce2209c67d2752fdee9b9973b1a90') ('thisisaverylongimagename.jpg', '27', '49', '5d', '27495db67281b29f11555f2f9aaf7b0f') Rotarran:Projects julio$
And before we finish, its corresponding unit test module, basically testing whether 10,000 hashes generated for the same file are all unique:
# -*- coding: utf8 -*- import unittest from hashfiles import generate_hash_from_filename class TestHashFiles(unittest.TestCase): def test_uniqueness(self): """ Generates a sample of ten thousand hashes for the same file and fails if uniqueness is false """ self.fname = 'test.jpg' self.fnames =  for index in xrange(10000): self.fnames.append(generate_hash_from_filename(self.fname)) self.assertEqual(len(self.fnames), len(set(self.fnames))) if __name__ == '__main__': unittest.main()
So there you have it, in case you want to work on the next facebook :)