.. Yes, blogs still do exist ..
What is TechFuel? - The original idea of Techfuel.net is fairly simple: Publish an eclectic, mostly personal, mix of funny and not-so-funny topics with a sprinkle of technical information for my loyal cybernauts. As an avid advocate of Open Source and a Die-Hard python developer, many entries will follow this line.

Trouble with Arrays

Added 1 month ago, Modified 1 month ago, Under Web Applications
I spent some time with a friend trying to optimize a simple algorithm: Given an array of positive integers, extract two elements (values "a", and "b") for which its sum equals a control number "s". This looks much like an interview question, and my understanding is that this exercise is rather old, but here we go:

Problem: having the following array: [1,3,5,8,9,15], find out which two elements sum 17, Results should be: { a: 8, b: 9 }

Language: JavaScript

Part I, the un-optimized, brute force, and ugly code:

/* So.. s = a + b (keep this in mind) */
function sum2(array, s) {
    var arrLength = array.length,
        arrIdx = 0,
        a = 0,
        b = 0,
        controlArray = [0],
        ctrlArrIdx = 0;

    /* Now Loop through the main array */
    for (arrIdx = 0; arrIdx < arrLength; arrIdx++) {
        /* Our "a" value will be the element of the
           array, which we are currently in..
        */
        a = array[arrIdx];
        /* Loop through our "control" array */
        for (ctrlArrIdx = 0;
             ctrlArrIdx < controlArray.length;
             ctrlArrIdx++) {
            if (arrIdx === 0) {
                b = -1;
            } else {
                /* Our "b" value is the value
                   of our "control" array..
                */
                b = controlArray[ctrlArrIdx];
            }
            /* If the formula is true... Success */
            if (b >= 0 && a + b === s) {
                return { a: a, b: b };
            } else {
                /* No match yet, add the number
                   to our control array */
                controlArray.push(num);
            }
        }
    }
    /* If all else fail */
    return { a: "Not Found", b: "Not Found" };
}
console.log(sum2([1,3,5,8,9,15], 17));

Now the cool, thought out one..

// s = a + b
// (therefore) a = s - b
function sum2(array, s) {
    var ctrlArray =[],
        a = 0,
        b = 0,
        results = {a: 'Not Found', b: 'Not Found'};
        
    for (var i = 0; i < array.length; i++) {
        /* Get out "known" "b" value from
           the main array
        */
        b = array[i];
        /* Apply formula, now we now what value
           "a" we need!
        */
        var a = s - b;
        /* if value of "a" is found, then we're
           set!, otherwise temporarily store so
           we can scan it on our next interaction
        */
        if (ctrlArray.indexOf(a) >= 0) {
            results = {a: a, b: b};
            break;
        } else {
            ctrlArray.push(b);
        }
    }
    return results;
}

console.log(sum2([1,3,5,8,9,15], 17));

It took me a bit of time to figure out why a second loop is not necessary needed, and also why sorting the array is also not necessarily needed, much cleaner code, and easier to understand. Hope it helps you folks.

Storing thousands/millons of files and organizing them

Added 2 years ago, Modified 2 years ago, Under Programming
Questioning "How Facebook does it?", how can facebook store all these gazillion number of images and get away with it?

More so, what happens if I decide to upload 20 images of my family, and every single one of them is called the same: "family.jpg" (and I want to put them in the same album)?

Enter the word of Hashing.

Though explaining what hashing is is not the goal of the article, it becomes pretty handful to do what we want to accomplish.

First, in order to store one trillion of files/images in our server requires that we address the following constraints:

  1. Not all files can be in the same filesystem directory (overkill, and most OSes have problems with this).
  2. Must be able to accommodate files with the same "name".
  3. Must be portable, if the hard drive is full, to easily start using another one (or even a remote server).
  4. Adding the image payload in a database (blob) is not only a terrible idea but an anti-pattern in itself.

So how do we do it?

Enter Mighty Python.

Obviously this can be accomplished in every single programming language, but since Python it the badass master, I'll show how can we start fixing things using this awesome language:

First, we need to come up with some solutions, let's tackle the "splitting" of files into multiple folders: We should not think about storing all files under the "user" folder (e.g. /Users/john/*), why? - Because the time will come in which you cannot add more files to it (disk full for example).

Solution for this is splitting. let's take the file's first and second letters for example and create a directory structure, and store some files, like so:

  • defiant.jpg — /Images/d/e/defiant.jpg
  • ourparty.jpg — /Images/o/u/ourparty.jpg
  • sister.jpg — /Images/s/i/sister.jpg

So far so good, but this does not fix the "same file name" issue, imagine trying to add another "sister.jpg" image.

Salting.

Let's say we add a "random" set of letters to our filenames so we can accomplish this:

  • defiant.jpgaqerdefiant.jpg — /Images/a/q/aqerdefiant.jpg
  • ourparty.jpgsrtcourparty.jpg — /Images/s/r/srtcourparty.jpg
  • sister.jpgrybssister.jpg — /Images/r/y/tybssister.jpg
  • defiant.jpg — trlodefiant.jpg — /Images/t/r/trloaqerdefiant.jpg

Note how we can now have files with the same "names". The trick now is to make it even more "random" since collisions can still occur:

Python provides several standard libraries that will allow us to handle this in a more automated way, let's use Python's hashlib library to generate hashes of filenames (or any string):

Rotarran:~ julio$ python
Python 2.7.6 (default, Sep  9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> hashlib.md5('family.jpg').hexdigest()
'ccafdb2280972b6e8d05e1a5bc40f979'
>>> hashlib.md5('family.jpg').hexdigest()
'ccafdb2280972b6e8d05e1a5bc40f979'
>>>

Note how the hash is the same for the same filename. Let's salt it with the help of the uuid library:

Rotarran:~ julio$ python
Python 2.7.6 (default, Sep  9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import uuid
>>> uuid.uuid4()
UUID('8574e336-57cd-45ba-9b52-ddbff2f6ebc9')
>>> uuid.uuid4()
UUID('c64b2298-64c3-4035-9c4b-cc07c4d0322d')
>>>

Nice, different salt values for our project!

Putting it all together:

Rotarran:~ julio$ python
Python 2.7.6 (default, Sep  9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import uuid
>>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest())
402beab8d58219c5b28918c3368c6fa4
>>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest())
093b1272053e3fdd96ac2689f2d3a2a3
>>> print(hashlib.md5(str(uuid.uuid4()) + 'family.jpg').hexdigest())
820a49e57b868fad6dd57dfb383aba23
>>>

Nice!, different "hashes" for the same file!

Now the final step is to put all together, splitting these files in, say, three levels, will allow for combinations of 2563 , or 16 million combinations possible, so: the "hash" 820a49e57b868fad6dd57dfb383aba23 can be subdivided in /82/0a/49/820a49e57b868fad6dd57dfb383aba23. Most likely we'd like to store all this metadata information in our database, along with the original file and maybe a content type indicating what kind if image we're talking about (jpg, png, gif, etc), among other information that might be pertinent for your uses, such as size, resolution, etc.

The following small program will do just that, from a filename, creating the basic necessary information to store into a database, it will *not* check for the existence of a real image or data stream (i.e. an uploaded file), but it will take care of the hashing and entropy of the filename itself:

# -*- coding: utf8 -*-
# hashfiles.py - filename hash generator.

import hashlib
import uuid

# imghdr (imghdr.what(fname[,stream])) to find out image type

def generate_hash_from_filename(fname):
    return hashlib.md5(str(uuid.uuid4()) + fname).hexdigest()

def get_path(fname):
    hashed = generate_hash_from_filename(fname)
    path_info = (fname, hashed[:2], hashed[2:4], hashed[4:6], hashed,)
    return path_info

if __name__ == '__main__':
    print(get_path('hello.jpg'))
    print(get_path('averylongname.png'))
    print(get_path('hello.jpg'))
    print(get_path('thisisaverylongimagename.jpg'))

Generating the following output:

Rotarran:Projects julio$ python hashfiles.py
('hello.jpg', 'bd', '2a', '9a', 'bd2a9a1cd81852ed2d8db03971d81000')
('averylongname.png', '44', '11', '9b', '44119b1b6f78eaaa99a969e62006eab6')
('hello.jpg', '20', 'dc', 'e2', '20dce2209c67d2752fdee9b9973b1a90')
('thisisaverylongimagename.jpg', '27', '49', '5d', '27495db67281b29f11555f2f9aaf7b0f')
Rotarran:Projects julio$

And before we finish, its corresponding unit test module, basically testing whether 10,000 hashes generated for the same file are all unique:

# -*- coding: utf8 -*-

import unittest

from hashfiles import generate_hash_from_filename

class TestHashFiles(unittest.TestCase):
    
    def test_uniqueness(self):
        """ Generates a sample of ten thousand hashes for the
        same file and fails if uniqueness is false
        
        """
        self.fname = 'test.jpg'
        self.fnames = []
        for index in xrange(10000):
            self.fnames.append(generate_hash_from_filename(self.fname))
        self.assertEqual(len(self.fnames), len(set(self.fnames)))

if __name__ == '__main__':
    unittest.main()

So there you have it, in case you want to work on the next facebook :)

My Projects in BitBucket

Added 2 years ago, Modified 2 years ago, Under Programming
My Projects in BitBucket, if interested you may find them in:

https://bitbucket.org/speedbird

QA-Stack (A stack-overflow-inspired python web app in web2py) is located in:

https://bitbucket.org/speedbird/qastack
Live Website: http://www.qa-stack.com

ITrack (An issue tracking system python web app in web2py) is located in:

https://bitbucket.org/speedbird/i-track
Live Website: http://www.i-track.org

pyForum (A message board python web app in web2py) is located in:

https://bitbucket.org/speedbird/pyforum
Live Website: http://www.pyforum.org

(All non-https links should open in its own new window)