.. Yes, blogs still do exist ..
What is TechFuel? - The original idea of Techfuel.net is fairly simple: Publish an eclectic, mostly personal, mix of funny and not-so-funny topics with a sprinkle of technical information for my loyal cybernauts. As an avid advocate of Open Source and a Die-Hard python developer, many entries will follow this line.

Storing thousands/millons of files and organizing them

Added 2 months ago, Modified 1 month ago, Under Programming
Questioning "How Facebook does it?", how can facebook store all these gazillion number of images and get away with it?

More so, what happens if I decide to upload 20 images of my family, and every single one of them is called the same: "family.jpg" (and I want to put them in the same album)?

Enter the word of Hashing.

Though explaining what hashing is is not the goal of the article, it becomes pretty handful to do what we want to accomplish.

First, in order to store one trillion of files/images in our server requires that we address the following constraints:

  1. Not all files can be in the same filesystem directory (overkill, and most OSes have problems with this).
  2. Must be able to accommodate files with the same "name".
  3. Must be portable, if the hard drive is full, to easily start using another one (or even a remote server).
  4. Adding the image payload in a database (blob) is not only a terrible idea but an anti-pattern in itself.

So how do we do it?

Enter Mighty Python.

Obviously this can be accomplished in every single programming language, but since Python it the badass master, I'll show how can we start fixing things using this awesome language:

First, we need to come up with some solutions, let's tackle the "splitting" of files into multiple folders: We should not think about storing all files under the "user" folder (e.g. /Users/john/*), why? - Because the time will come in which you cannot add more files to it (disk full for example).

Solution for this is splitting. let's take the file's first and second letters for example and create a directory structure, and store some files, like so:

  • defiant.jpg — /Images/d/e/defiant.jpg
  • ourparty.jpg — /Images/o/u/ourparty.jpg
  • sister.jpg — /Images/s/i/sister.jpg

So far so good, but this does not fix the "same file name" issue, imagine trying to add another "sister.jpg" image.

Salting.

Let's say we add a "random" set of letters to our filenames so we can accomplish this:

  • defiant.jpgaqerdefiant.jpg — /Images/a/q/aqerdefiant.jpg
  • ourparty.jpgsrtcourparty.jpg — /Images/s/r/srtcourparty.jpg
  • sister.jpgrybssister.jpg — /Images/r/y/tybssister.jpg
  • defiant.jpg — trlodefiant.jpg — /Images/t/r/trloaqerdefiant.jpg

Note how we can now have files with the same "names". The trick now is to make it even more "random" since collisions can still occur:

Python provides several standard libraries that will allow us to handle this in a more automated way, let's use Python's hashlib library to generate hashes of filenames (or any string):

Rotarran:~ julio$ python
Python 2.7.6 (default, Sep  9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> hashlib.md5('family.jpg').hexdigest()
'ccafdb2280972b6e8d05e1a5bc40f979'
>>> hashlib.md5('family.jpg').hexdigest()
'ccafdb2280972b6e8d05e1a5bc40f979'
>>>

Note how the hash is the same for the same filename. Let's salt it with the help of the uuid library:

Rotarran:~ julio$ python
Python 2.7.6 (default, Sep  9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import uuid
>>> uuid.uuid4()
UUID('8574e336-57cd-45ba-9b52-ddbff2f6ebc9')
>>> uuid.uuid4()
UUID('c64b2298-64c3-4035-9c4b-cc07c4d0322d')
>>>

Nice, different salt values for our project!

Putting it all together:

Rotarran:~ julio$ python
Python 2.7.6 (default, Sep  9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import uuid
>>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest())
402beab8d58219c5b28918c3368c6fa4
>>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest())
093b1272053e3fdd96ac2689f2d3a2a3
>>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest())
820a49e57b868fad6dd57dfb383aba23
>>>

Nice!, different "hashes" for the same file!

Now the final step is to put all together, splitting these files in, say, three levels, will allow for combinations of 2563 , or 16 million combinations possible, so: the "hash" 820a49e57b868fad6dd57dfb383aba23 can be subdivided in /82/0a/49/820a49e57b868fad6dd57dfb383aba23. Most likely we'd like to store all this metadata information in our database, along with the original file and maybe a content type indicating what kind if image we're talking about (jpg, png, gif, etc), among other information that might be pertinent for your uses, such as size, resolution, etc.

The following small program will do just that, from a filename, creating the basic necessary information to store into a database, it will *not* check for the existence of a real image or data stream (i.e. an uploaded file), but it will take care of the hashing and entropy of the filename itself:

# -*- coding: utf8 -*-
# hashfiles.py - filename hash generator.

import hashlib
import uuid

# imghdr (imghdr.what(fname[,stream])) to find out image type

def generate_hash_from_filename(fname):
    return hashlib.md5(str(uuid.uuid4()) + fname).hexdigest()

def get_path(fname):
    hashed = generate_hash_from_filename(fname)
    path_info = (fname, hashed[:2], hashed[2:4], hashed[4:6], hashed,)
    return path_info

if __name__ == '__main__':
    print(get_path('hello.jpg'))
    print(get_path('averylongname.png'))
    print(get_path('hello.jpg'))
    print(get_path('thisisaverylongimagename.jpg'))

Generating the following output:

Rotarran:Projects julio$ python hashfiles.py
('hello.jpg', 'bd', '2a', '9a', 'bd2a9a1cd81852ed2d8db03971d81000')
('averylongname.png', '44', '11', '9b', '44119b1b6f78eaaa99a969e62006eab6')
('hello.jpg', '20', 'dc', 'e2', '20dce2209c67d2752fdee9b9973b1a90')
('thisisaverylongimagename.jpg', '27', '49', '5d', '27495db67281b29f11555f2f9aaf7b0f')
Rotarran:Projects julio$

And before we finish, its corresponding unit test module, basically testing whether 10,000 hashes generated for the same file are all unique:

# -*- coding: utf8 -*-

import unittest

from hashfiles import generate_hash_from_filename

class TestHashFiles(unittest.TestCase):
    
    def test_uniqueness(self):
        """ Generates a sample of ten thousand hashes for the
        same file and fails if uniqueness is false
        
        """
        self.fname = 'test.jpg'
        self.fnames = []
        for index in xrange(10000):
            self.fnames.append(generate_hash_from_filename(self.fname))
        self.assertEqual(len(self.fnames), len(set(self.fnames)))

if __name__ == '__main__':
    unittest.main()

So there you have it, in case you want to work on the next facebook :)

My Projects in BitBucket

Added 5 months ago, Modified 5 months ago, Under Programming
My Projects in BitBucket, if interested you may find them in:

https://bitbucket.org/speedbird

QA-Stack (A stack-overflow-inspired python web app in web2py) is located in:

https://bitbucket.org/speedbird/qastack
Live Website: http://www.qa-stack.com

ITrack (An issue tracking system python web app in web2py) is located in:

https://bitbucket.org/speedbird/i-track
Live Website: http://www.i-track.org

pyForum (A message board python web app in web2py) is located in:

https://bitbucket.org/speedbird/pyforum
Live Website: http://www.pyforum.org

(All non-https links should open in its own new window)

Web services for work

Added 5 months ago, Modified 5 months ago, Under Web Applications

Imagine having in your company 2 or 3 applications that "kind of" do the same thing, maybe say, both get information from your clients, one gives you his basic contact information, and the other one more detailed contact information.

Then you decide to add to your client a new element of information, DOB for instance, and are required to return the "age" of it based on the DOB.

Your developers will now have to not only modify two applications to accommodate for the new field, but also add logic into each application, to calculate this value.

Enter Web Services

You grab your developers and ask them to create a "centralized" way of getting that information so you don't have to modify the logic of your programs every time there is a change of this nature. They in turn create this "service", which, when queries, would return anything you want both your applications to process.

Your calling applications become leaner, because you offset all the fuzzy business logic from your programs and stick it into one service that will always provide the latest, up-to-date information of any client you choose.

Internally, your applications will query this service and establish a connection with your central web service, without going into details, this central web service will return the needed information (with the age calculated) using a predefined format that would be consumed by all the applications in your firm.

Web services can be extremely complex, but once in place, can be consumed by a wide range of devices, your phone, a web page, even real speech (Siri?)