-----------------------------------
zylum
Fri Aug 12, 2005 11:16 pm

interesting problem
-----------------------------------
You are part of a team that is working on a piece of software to handle text compression. Your team has designed the compression algorithm as follows:

To compress a given string of text, two strings, each 3 characters in length, will be chosen as compression keys. The strings may contain any combination of capital letters and/or spaces. Then, a compressed string will be generated from the original such that replacing "*1" in the compressed string with the first string and replacing "*2" with the second string will recreate the original text.

For example, if the two compression keys are "FOO" and "BAR", then the compressed string "*2X*1" would decompress to "BARXFOO", and "*1*1 *2" would decompress to "FOOFOO BAR".

You have been tasked with writing a proof of concept implementation to test the effectiveness of this compression scheme. You will be given a String original. Your goal is to optimally select the two compression keys, and generate the compressed text to be as short as possible. You are to return the length of the shortest possible text.


Examples:

0)
    
"BARXFOO"
Returns: 5
The first example from the problem statement.


1)
    
"FOOFOO BAR"
Returns: 7
The second example from the problem statement.


2)
    
"ABCDEFGHIJKLMNOPQRSTUVWXYABCDEFGHIJKLMNOPQRSTUVWXY"
Returns: 46
It's a long string, but no 3-character substring appears more than twice.


3)
    
"QWERTYUIOPASDFGHJKLZXCVBNM"
Returns: 24
Here, no substring repeats itself at all. The best we can do is to pick any two substrings to replace.


4)
    
"BBAAAABBBBAAABABABBBABABAAABABABAAABBAABABBBABAAAB"
Returns: 40



hint: there are 26^7 possible combinations of two keys which is too many to test.


note: this problem if from topcoder.com

-----------------------------------
bugzpodder
Sat Aug 13, 2005 4:53 pm


-----------------------------------
hint to hint: it would be stupid to do all 26^7 combinations, (i dont even know where the 26^7 comes from, isnt it 26^6)... all you need to do is pick two keys that exists at least once in the string.  so thats only about 200^2 (if the length of string is 200).  and 40000 is reasonable.  so all that is left to do is grunt work... what is the limit on the string length anyways?

-----------------------------------
wtd
Sat Aug 13, 2005 9:38 pm


-----------------------------------
This seems to work reasonably well.

module Compression where

import List
import Char

allPossibleKeys len str 
  | length str < len = []
  | valid candidateKey = candidateKey : allPossibleKeys len (drop 3 str)
  | otherwise = allPossibleKeys len (tail str)
  where
    valid = all validChar
    validChar ch = isUpper ch || isSpace ch
    candidateKey = take len str
    
firstKey len = head . allPossibleKeys len

count _ [] = 0
count v (x:xs)
  | v == x = 1 + countRest
  | otherwise = countRest
  where
    countRest = count v xs
    
rankedKeys len str =
  sortBy (\k1 k2 -> snd k2 `compare` snd k1) $ map (\k -> (k, count k keys)) $ nub keys
  where
    keys = allPossibleKeys len str

strReplace _ _ "" = ""
strReplace "" _ str = str
strReplace s1 s2 str@(x:xs) 
  | lenS1 > length str = str
  | s1 `isPrefixOf` str = s2 ++ strReplace s1 s2 (drop lenS1 str)
  | otherwise = x : strReplace s1 s2 xs
  where
    lenS1 = length s1
    lenS2 = length s2
  
compress len str =
  (key1, key2, secondTransform)
  where
    (key1, _) = head $ rankedKeys len str
    firstTransform = strReplace key1 "*1" str
    (key2, _) = head $ rankedKeys len firstTransform
    secondTransform = strReplace key2 "*2" firstTransform
    
decompress key1 key2 compressedStr =
  strReplace "*2" key2 $ strReplace "*1" key1 compressedStr

-----------------------------------
zylum
Sat Aug 13, 2005 11:14 pm


-----------------------------------
hint to hint: it would be stupid to do all 26^7 combinations, (i dont even know where the 26^7 comes from, isnt it 26^6)... all you need to do is pick two keys that exists at least once in the string.  so thats only about 200^2 (if the length of string is 200).  and 40000 is reasonable.  so all that is left to do is grunt work... what is the limit on the string length anyways?

strings are 1-50 characters long and consist of upper case letters and space characters...

wtd >> if you give a solution in java / c++ / c# / vb.net then i could varify your answer... 

btw it's not as easy as it seems...

hint2: even if you do have the correct keys, you still have to figure out what the shortest possible string the original string can be compressed into using these keys... the examples do not have any cases which will display this...

-----------------------------------
wtd
Sat Aug 13, 2005 11:50 pm


-----------------------------------
C:\> ghci Compression
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.4, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base-1.0 ... linking ... done.
Compiling Compression      ( Compression.hs, interpreted )
Ok, modules loaded: Compression.
*Compression> let s = "BBAAAABBBBAAABABABBBABABAAABABABAAABBAABABBBABAAAB"
*Compression> length s
50
*Compression> let ((key1,_),(key2,_),str) = compress 3 s
Loading package haskell98-1.0 ... linking ... done.
*Compression> str
"BB*2ABBBB*2*1ABB*1AB*2*1AB*2BBAA*1B*1*2B"
*Compression> length str
40
*Compression> let dstr = decompress key1 key2 str
*Compression> dstr
"BBAAAABBBBAAABABABBBABABAAABABABAAABBAABABBBABAAAB"
*Compression> length dstr
50
*Compression> s == dstr
True
*Compression> key1
"BAB"
*Compression> key2
"AAA"
*Compression>

-----------------------------------
wtd
Sat Aug 13, 2005 11:56 pm


-----------------------------------
wtd >> if you give a solution in java / c++ / c# / vb.net then i could varify your answer... 

Yes, but it's so much easier to write it in Haskell.  :)

-----------------------------------
zylum
Sun Aug 14, 2005 12:35 am


-----------------------------------
i suppose... but i don't even understand the code you wrote so even i couldnt tell you whether it might pass all test cases. maybe you could explain the algorithm you used?

-----------------------------------
wtd
Sun Aug 14, 2005 1:38 am


-----------------------------------
Create the Compression module.

module Compression where

Import functions from the Char and List modules.

import List
import Char

The allPossibleKeys function finds, well... all possible keys.

allPossibleKeys len str 
  | length str < len = []
  | valid candidateKey = candidateKey : allPossibleKeys len (drop 3 str)
  | otherwise = allPossibleKeys len (tail str)
  where
    valid = all validChar
    validChar ch = isUpper ch || isSpace ch
    candidateKey = take len str
    
This function takes two arguments: the length of the keys, and the original string.

It uses guard conditions, local functions, and recursion.

Obviously, if the string isn't long enough to represent a key, there are no valid keys.  Return an empty list.

Otherwise, if the candidate key is valid, return that candidate, tacked onto the result of evaluating the function on the rest of the string.

The candidate key is the first three characters of the input string.

The validChar function tests to see if a character is an uppercase letter or space.

The valid function tests to see if all characters in a string conform to the rules set in the validChar function.  This uses partial function application and foregoes an explicit function parameter.

Otherwise, if there are enough characters, but they don't form a valid key, drop a single character from the beginning of the input string and search it for keys.

The firstKey function is an example of function composition.  The string as the second parameter is inferred.  It gets the first key from the list of all possible keys using the standard "head" function.  The "." is the Haskell function composition operator. 

firstKey len = head . allPossibleKeys len

The count function is a straightforward recursive function.

count _ [] = 0
    
Regardless of what you're counting, there aren't any in an empty list.

count v (x:xs)
  | v == x = 1 + countRest
  | otherwise = countRest
  where
    countRest = count v xs
    
For a non-empty list, if the value you're looking for equals the first element in the list, add one to the result of counting the rest of the list.  Otherwise, just count the rest of the list.

The "rankedKeys" function first finds all possible keys.

rankedKeys len str =
  sortBy (\k1 k2 -> snd k2 `compare` snd k1) $ map (\k -> (k, count k keys)) $ nub keys
  where
    keys = allPossibleKeys len str
    
Then it uses the standard "nub" function to get unique keys, and for each of those maps its count in the original list of keys.

It then uses these counts to sort the list, such that keys which appear most frequently are first.

The "strReplace" function is a pretty straightforward way to replace one string with another in a larger string.

Now, since we can get a list of possible keys, sorted by the number of times they occur in a string, we can find the most common key just by taking the first one using the "head" function.

compress len str =
  (key1, key2, secondTransform)
  where
    (key1, _) = head $ rankedKeys len str
    firstTransform = strReplace key1 "*1" str
    (key2, _) = head $ rankedKeys len firstTransform
    secondTransform = strReplace key2 "*2" firstTransform
    
That's exactly what I've done with:
    
    (key1, _) = head $ rankedKeys len str
    firstTransform = strReplace key1 "*1" str
    
Now, since I have to assume that I've changed the string, I search the string again, and find the new best candidate, then do another string replacement to get the final compression.  The function returns both keys and the compressed string in a tuple.

-----------------------------------
zylum
Sun Aug 14, 2005 2:26 am


-----------------------------------
wow, that was really confusing...

by explaining your algorithm i meant something more like:

"First we generate all possible 3 letter keys. Then try every combination of two keys that exist in the string and see which combination reduces the original string the most"

anyhoo, what does your solution return for 'AABBAABBBAABBA' ?

-----------------------------------
wtd
Sun Aug 14, 2005 4:48 am


-----------------------------------
It actually breaks my solution, since my solution doesn't test to make sure there are two valid keys.  The first key it goes for is "AAB".  Once that's replaced with "*1", though, there are no further 3 character keys left.

-----------------------------------
wtd
Sun Aug 14, 2005 5:24 am


-----------------------------------
And about my explanation... well, I thought that if the code itself were explained a bit, the algorithm would be fairly apparent.

-----------------------------------
bugzpodder
Sun Aug 14, 2005 9:04 am


-----------------------------------
hey zylum, why not try a more interesting problem?  like those three in the IOI practice set!  They look fantastically good!

-----------------------------------
zylum
Sun Aug 14, 2005 9:35 pm


-----------------------------------
hey zylum, why not try a more interesting problem?  like those three in the IOI practice set!  They look fantastically good!

sounds good! can you post a link?

btw here's my solution to who ever is interested..

import java.util.*;
public class CompressionText {
	public int shortestLength(String original) {
		int s = original.length();
		Vector keys = new Vector();
		for (int i = 0; i