« Home | Chess client interface in YUI. » | Unicode, things to know. » | Differences between myisam and innodb » | Removing unnecessary .svn folders from » | Java collections and generics. » | Mysql cluster & replication. » | Finding prime numbers with in a given number n. » | Javascript performance. » | Sorting algorithms. » | Multi tenant challenges. » 

Monday, November 19, 2012 

Data structures and algorithms.

Is a structure where you have key and value, you store the data in dictionary giving it the key and value and retrieve the value back by giving it the key.

Could be implemented using:
avl tree,
red black tree,

Ordered array:
- In this we always store the data in sorted order.
Insertion/Deletion: Hence insertion would take - O(log n) to figure out where the element should be place and then O(n) to insert as you have to move the elements to the right after the inserted element.
Search: Search would take O(log n) as we do binary search as elements are already sorted.

Disadvantages: Size increase is a problem, as we have to create a bigger array and copy the entire new array. Such dynamic arrays also have problem with deletion.
Applications: To store data where inserts and retrieval by index esp for fixed length its best. Also used as the internal data structure for constructing other structures.
Growing arrays:
Tight strategy:  new array length  f(n) = n+c
Growth strategy: new array length  f(n) = 2n

Unordered array:
In this we randomly store the elements one after the other.
Insertion / deletion: Insertion takes O(1). Deletion takes O(n) as you have to find the element. If you know the  location, it can be done in O(1).
Search : Takes O(n) time as there is no where other than comparing the element which we are searching for with all other elements.

Direct addressing:
Here we store the data into continuous locations, covering the entire range. That is if your keys are in the range say 1-100.  Then even if there are 10 elements, we consider the space of 100 and then store the keys at their corresponding location.
Insertion / deletion / search - all in this case take O(1).
However the space complexity is much higher.

For elements n map them to a space of m. Look up can be done in O(1).
Collision resolution - incase several keys map  to same location can be done using

Implement linked list of keys mapping to same location.
Given a hash table T with m slots, the load factor(a) for storing n elements in n/m.
Search time: Incase of chaining would be O(1+a).  (because average linked list size would be n/m).
Insertion time: Incase of chaining would be O(1+a).
If n is equal to m - then it generally takes O(1) to do search, insertion and deletion (deletion when its a doubly linked list).

Open addressing:
This is done, where in all the elements are stored in the hash itself. So number of elements n should be less than m. For the hash function we now pass a probe / location along with the key, till we find an empty location while storing / till we find the element while retrieving. The probe sequence can be done in the below ways.
Linear probing:
Here we simple increment the probe and go to the next location. So, let say for a given key 0 probe, we find that the location is occupied, we go to the immediate next location, till we find the empty location.
Slower than chaining.
When we delete element, we just place a tombstone at that place where the element is deleted. Where the tombstone are more, we do rehashing.

Double Hashing:
h(k) for probing and h(k) as offset incase the location is occupied, instead of using the immediate next available location. Note that in this case, m length of the hash table should be a prime number.
Search time incase of chaining is unsuccessful - O(1/(1-a))  - succesfull - O((1/a )*log(1/(1-a)))

Hash functions:
Hash functions have two parts
Hashcode map - which maps string to integer.
Compression map - which maps integer to range [0,m-1].

Integer casting- for character and integers. (will not work for long and doubles,strings as there are a lot chances that they would map to the same int after chopping).

Component sum: For long and double, we can add the 32 bit parts.
static int hashCode(long i) { return (int)((i >> 32) + (int) i):} 
This would not work for strings as a+d sum = b +c.

Polynomial accumulation:
Take each character of string and multiply it with a polynomial based on the location. using some thing like below. Horner's form
xk-1 + a(xk-2 + a(xk-3 + ... + a(x2 + a(x1 + ax0)) ... ))
For 50,000 English words as a function of the  constants a.  They have found that a = 33, 37, and 41 have less then 7 collisions.  Java uses a = 31, which I suspect also gives low collision rates.

Compression maps:
After the hash map say k is the number then compression map can be done using.
h(k) = k mod m
If m is power of 2, then its bad as all the keys ending in there are alteast e numbers with same last digit which will go to same location.  where m = b to power e.
If m is prime and not close to the power of 2, its good.

h(k) = [m (k A mod 1)];
where a is a constant 0 < A < 1, KA mod 1, gives the fractional part after multiplying k with a. and then we multiply that fractional part with m and take the integer of it.

You can have a collection of hash functions which you use for different set of data at different times. This collection is called universal if for any randomly choosen f from H, (and two keys k and I) Pr{f(k) = f(i)} <= 1/m.

Ordered dictionary:
addition to the look up property of a dictionary ordered dictionary has the following methods.

Binary tree is a tree where every node has max of two children. Binary search tree is a tree which has search property on it. Basically given a node, all the elements in its left subtree would be less than or equal to it and all the elements in the right subtree would be greater than or equal to it.

Binary tree:
Ordered tree: is one in which nodes are ordered / arranged as per some criterion.
Binary tree: is an ordered tree in which each node has 2 children.

Complete binary tree: Is the tree in which there are
1. Node at level i has (2 pow i) nodes.
2. In a tree of height h, all leaves would be at level h.
3. No.of leaves would be (2 pow h)
4. No.of internal nodes = 1+2+2*2+2*2*2+....+2*2*...(h-1) times. = (2 pow h - 1)
5. So number of internal nodes would be  = number of leaves  - 1.
6. Total number of nodes = internal nodes + leaves = ((2 pow h) - 1)+(2 pow h) = ((2 pow (h+1)) -1).
7. Height = (log (n+1) to base 2 - 1)   (basically function of log n).
8. If there are n nodes roughly half of them would be leaves = (n+1)/2

Balanced binary tree: Balanced binary tree is a tree where all the internal nodes child nodes are having a max height difference of 1.
Height = O(log n ) similar to that of a complete binary tree.

Binary tree:
1. Minimum height = height of a complete binary tree = (log (n+1) to base 2 - 1)
2. Maximum height = n-1.
3. Number of leaves <= 1+number of internal nodes.

Data structures for representing binary tree:
Generally you write the binary tree using a custom data structure which has left, right and parent elements.
You can also simulate this using an array. However it is optimized only for complete binary trees for others the space is wasted.this method wastes no space. In this compact arrangement, if a node has an index i, its children are found at indices 2i + 1(for the left child) and 2i + 2(for the right), while its parent (if any) is found at index (assuming the root has index zero). This method benefits from more compact storage and better locality of reference, particularly during a preorder traversal. However, it is expensive to grow and wastes space proportional to 2h - n for a tree of height h with n nodes.
This method of storage is often used for binary heaps. No space is wasted because nodes are added in breadth-first order.

Binary trees for expression evaluation.
Binary trees for sorting.

Binary search tree:
-          Binary search tree is an ordered dictionary/sorted series of numbers representation. ( as an ordered array generally takes o(n) time for insertion , deletion, updation – we use binary trees whose average time is log(n)  and is generaaly proportional to height of tree).
-          Worst case height of binary tree is n-1.
-          To find min, go to the left of the tree till you find a node without left child.
-          To find max, go to the right of the tree till you find a node without right child.
-          Successor – of a node is a the next smallest node that is greater than the given node.
o   Case1: If there is a right child/subtree of the current node x, then successor is the left most node of the right subtree.
o   Case2: If right subtree is empty, then the closest ancestor of x whose left child is also an ancestor of x would be the successor.
§  Basically find the first parent of current node, which has a left child which is also a parent of current node.
-          Deletion:
o   Case1: If there are no children just remove the element by setting its parent reference to this element to null.
o   Case2: If x has exactly one child, simply remove x and make the child point to its parent.
o   Case3: If x has two children
§  Find the successor / predecessor of x. Note that the successor / predecessor would have max one child.
§  Remove the success / predecessor using above cases.
§  Replace x with the successor  / predecessor.
-          Creating binary tree of n numbers take
o   Worst case O(n2)
o   Best / average = O(n log n).
-          As every operation of binary search tree takes order(h) time, we want to keep h optimal. The optimal val for h is log n.  And red black trees / avl / 2-4trees all ensure this optimal height – they are called balanced binary search trees.

AVL trees
-          Height difference of the children of every node in the tree cannot be more than 1.  (height balanced b tree)
-          In AVL tree if there is a leaf at Kth level then the height of the tree cannot exceed 2k-1
-          In AVL tree of height h the number of nodes would be between  2 pow ((h-1)/2)  and 2 pow h.

AVL trees, red black trees are all balanced binary trees whose height is proportional to the log n. and insert / delete and search takes log n times.

Priority queues:
they provide an interface to retrieve / min / max of a sequence in short time.

Unordered sequence:
minimum - operation would take O(n)
insertion - insertion would take O(1)
deletemin- would take O(n)
Ordered sequence:
minimum - would take O(1) time.
insertion would take O(n) time.
deletemin - would take O(1).

Binary Heap:
Binary heap is a complete binary tree - all the nodes except the last level are filled.  And additionally they are left-filled. (ie, if there is a gap it would be towards the right).
Also the value of any node is less than or equal to that of its children.
Implementing heap with array,
parent of element at i, is floor(i/2).
left child of element at i, is 2i+1.
right child of element at i, is 2i+2.

Insertion in heap: - Just add the new element to the bottom right of the heap. And keep swapping it till it gets to the right position. O(log n) as we might have to do h comparisons & swaps.
Heapify: if the heap is violated at node i, however its left and sub trees are heaps, then we can run heapify to make the node i a heap. the process is to take the smaller of its two children and swap it with this element. Once swapped, check the node with the swapped element is still a heap / not, if not run heapify on it as well.
Even heapify, we are moving the elements down the heap. As max we can move till the height of the heap, it takes O(log n).
Minimum:To find min of an heap, we just do the O(1) operation of take the element from the top.
Delete-min: For delete min, we just delete min at the top, and then take the right most trees, right most element (basically the last element, move it to the top of the tree and then do a heapify). Hence delete min essentially takes O(lon n)

Create a heap: can be done using repeated insertions which takes n log n. Instead of this, we can create the heap using bottom up approach. we first arbitarily insert the elements and then do an heapify on them from bottom up, so first heapify on the leaves and then on the next level etc. This would make it O(n).

Sorting with heap: Build heap, do a repeated delete-min. Once deleted, move the deleted element to the end of the heap array. The heap would be inplace sorted and you would have sorted sequence.  O(n)+(logn+log(n-1)+log(n-3)+..) = n+log(n!) = O(n)+ n log n. = O(n log n)

Searching for patterns - takes O(n+m), where m - is the length of the pattern and n is the length of the text. We process the pattern, and compute a failure function for each letter of m as how much shift to be done to continue the pattern matching.

If we preprocess the text and build a tire (basically a tree with 27 alphabets - worst case- at each level - best case is based on the words in the text ), we can take much less time. Traversing from the root to the end of the tire would give us all the words in the text. At each leaf we would have the (index of the locations where it can be found).
Here search would take the O(dm) (d - alphabet size, length of word - m )and is not dependent upon n. (implemented using linked list).
insertion/delete - O(dm)
space O(w) - w - size of unique strings in text.

compressed tire - obtained by compressing single node childs into one - by referencing to the original word - reduces the space complexity. space would be now - non matching character length of w unique words.

Suffix trie: Is a trie with all the suffixes of a text.
From the root we would have all the alphabets in the text, with all suffixes. (again these are compressed). the leaves would be holding the starting location of that suffix.
Order of creating suffix tree is O(n2).

Huffman trie: instead of fixed length encoding, we encode charecters with more repetition using smaller code - basically variable length encoding.
With variable length encoding problems can arise while decoding. we might not be uniquely able to identify the characters. To do this we need to ensure that all the characters follow "prefix rule - no character encoding is a prefix of another charechter encoding." 
encoding tire - can be creating using the 0/1 instead of alphabets. And then can be used to encode/ decode any text. To compress the text, we should have the letter with the highest repetetions to be the left most child of the encoding tire with out any children.
We construct the tree using the huff man encoding. Based on the weights we construct the tire. The letters with the least weights, form the leaves of the node at the last level. Then the next weighted character would become the right node of the next level with the earlier node as the left node .. and so on. And once the tire is formed, we give 0 to the right and 1 to the left edge of each node. For each character, reading out digits along its path would give its code.

Veritices - points/nodes of the graphs.
Edges - paths between the nodes.
Degree of a vertex: is the number of the edges that are incedent on the vertex. Sum of degree of vertices is twice the sum of edges in the graph.
Adjacent vertices are vertices connected via a edge.
Simple path: is a path between two vertices where no vertex is repeated once.
Cycle: is a path where last vertex is same as first vertex.
Graph is connected: If there is a path between every pair of vertices.
Subgraph: is a graph formed by taking some of the vertices and edges of a graph.
Connected component - is a maximal connected subgraph of a graph.

Tree: is a connected graph with out cycles.
complete graph: there is a edge between every pair of vertices.
A tree of n vertices has n-1 edges.

Spanning tree of a graph: Is a connected subgraph formed with all the nodes of a graph which is tree.

Euler tour: Is visiting all the nodes of the graph along the edges with out crossing each edge twice.
This is possible only if the graph has all the nodes with even degree. Except 2 nodes.
Graphs data structures:
Edge list: stores edges and vertices in two unordered sequences. And there are pointers from edges to vertices.
Adjacency list: Is a edge list along with each vertex having pointers to the list of its adjacent vertices.
Adjacency matrix: Here we have row and columns as the vertices and the edges are stored in side the matrix.

Graph searching algorithms:
For a graph we can determine the spanning tree rooted at a vertex. So for each node we determine its predecessor and distance from the root vertex.