Computer Science Canada

interesting problem

Author:  zylum [ Fri Aug 12, 2005 11:16 pm ]
Post subject:  interesting problem

You are part of a team that is working on a piece of software to handle text compression. Your team has designed the compression algorithm as follows:

To compress a given string of text, two strings, each 3 characters in length, will be chosen as compression keys. The strings may contain any combination of capital letters and/or spaces. Then, a compressed string will be generated from the original such that replacing "*1" in the compressed string with the first string and replacing "*2" with the second string will recreate the original text.

For example, if the two compression keys are "FOO" and "BAR", then the compressed string "*2X*1" would decompress to "BARXFOO", and "*1*1 *2" would decompress to "FOOFOO BAR".

You have been tasked with writing a proof of concept implementation to test the effectiveness of this compression scheme. You will be given a String original. Your goal is to optimally select the two compression keys, and generate the compressed text to be as short as possible. You are to return the length of the shortest possible text.


Examples:

0)

"BARXFOO"
Returns: 5
The first example from the problem statement.


1)

"FOOFOO BAR"
Returns: 7
The second example from the problem statement.


2)

"ABCDEFGHIJKLMNOPQRSTUVWXYABCDEFGHIJKLMNOPQRSTUVWXY"
Returns: 46
It's a long string, but no 3-character substring appears more than twice.


3)

"QWERTYUIOPASDFGHJKLZXCVBNM"
Returns: 24
Here, no substring repeats itself at all. The best we can do is to pick any two substrings to replace.


4)

"BBAAAABBBBAAABABABBBABABAAABABABAAABBAABABBBABAAAB"
Returns: 40



hint: there are 26^7 possible combinations of two keys which is too many to test.


note: this problem if from topcoder.com

Author:  bugzpodder [ Sat Aug 13, 2005 4:53 pm ]
Post subject: 

hint to hint: it would be stupid to do all 26^7 combinations, (i dont even know where the 26^7 comes from, isnt it 26^6)... all you need to do is pick two keys that exists at least once in the string. so thats only about 200^2 (if the length of string is 200). and 40000 is reasonable. so all that is left to do is grunt work... what is the limit on the string length anyways?

Author:  wtd [ Sat Aug 13, 2005 9:38 pm ]
Post subject: 

This seems to work reasonably well.

code:
module Compression where

import List
import Char

allPossibleKeys len str
  | length str < len = []
  | valid candidateKey = candidateKey : allPossibleKeys len (drop 3 str)
  | otherwise = allPossibleKeys len (tail str)
  where
    valid = all validChar
    validChar ch = isUpper ch || isSpace ch
    candidateKey = take len str
   
firstKey len = head . allPossibleKeys len

count _ [] = 0
count v (x:xs)
  | v == x = 1 + countRest
  | otherwise = countRest
  where
    countRest = count v xs
   
rankedKeys len str =
  sortBy (\k1 k2 -> snd k2 `compare` snd k1) $ map (\k -> (k, count k keys)) $ nub keys
  where
    keys = allPossibleKeys len str

strReplace _ _ "" = ""
strReplace "" _ str = str
strReplace s1 s2 str@(x:xs)
  | lenS1 > length str = str
  | s1 `isPrefixOf` str = s2 ++ strReplace s1 s2 (drop lenS1 str)
  | otherwise = x : strReplace s1 s2 xs
  where
    lenS1 = length s1
    lenS2 = length s2
 
compress len str =
  (key1, key2, secondTransform)
  where
    (key1, _) = head $ rankedKeys len str
    firstTransform = strReplace key1 "*1" str
    (key2, _) = head $ rankedKeys len firstTransform
    secondTransform = strReplace key2 "*2" firstTransform
   
decompress key1 key2 compressedStr =
  strReplace "*2" key2 $ strReplace "*1" key1 compressedStr

Author:  zylum [ Sat Aug 13, 2005 11:14 pm ]
Post subject: 

bugzpodder wrote:
hint to hint: it would be stupid to do all 26^7 combinations, (i dont even know where the 26^7 comes from, isnt it 26^6)... all you need to do is pick two keys that exists at least once in the string. so thats only about 200^2 (if the length of string is 200). and 40000 is reasonable. so all that is left to do is grunt work... what is the limit on the string length anyways?


strings are 1-50 characters long and consist of upper case letters and space characters...

wtd >> if you give a solution in java / c++ / c# / vb.net then i could varify your answer...

btw it's not as easy as it seems...

hint2: even if you do have the correct keys, you still have to figure out what the shortest possible string the original string can be compressed into using these keys... the examples do not have any cases which will display this...

Author:  wtd [ Sat Aug 13, 2005 11:50 pm ]
Post subject: 

code:
C:\> ghci Compression
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.4, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base-1.0 ... linking ... done.
Compiling Compression      ( Compression.hs, interpreted )
Ok, modules loaded: Compression.
*Compression> let s = "BBAAAABBBBAAABABABBBABABAAABABABAAABBAABABBBABAAAB"
*Compression> length s
50
*Compression> let ((key1,_),(key2,_),str) = compress 3 s
Loading package haskell98-1.0 ... linking ... done.
*Compression> str
"BB*2ABBBB*2*1ABB*1AB*2*1AB*2BBAA*1B*1*2B"
*Compression> length str
40
*Compression> let dstr = decompress key1 key2 str
*Compression> dstr
"BBAAAABBBBAAABABABBBABABAAABABABAAABBAABABBBABAAAB"
*Compression> length dstr
50
*Compression> s == dstr
True
*Compression> key1
"BAB"
*Compression> key2
"AAA"
*Compression>

Author:  wtd [ Sat Aug 13, 2005 11:56 pm ]
Post subject: 

zylum wrote:
wtd >> if you give a solution in java / c++ / c# / vb.net then i could varify your answer...


Yes, but it's so much easier to write it in Haskell. Smile

Author:  zylum [ Sun Aug 14, 2005 12:35 am ]
Post subject: 

i suppose... but i don't even understand the code you wrote so even i couldnt tell you whether it might pass all test cases. maybe you could explain the algorithm you used?

Author:  wtd [ Sun Aug 14, 2005 1:38 am ]
Post subject: 

Create the Compression module.

code:
module Compression where


Import functions from the Char and List modules.

code:
import List
import Char


The allPossibleKeys function finds, well... all possible keys.

code:
allPossibleKeys len str
  | length str < len = []
  | valid candidateKey = candidateKey : allPossibleKeys len (drop 3 str)
  | otherwise = allPossibleKeys len (tail str)
  where
    valid = all validChar
    validChar ch = isUpper ch || isSpace ch
    candidateKey = take len str


This function takes two arguments: the length of the keys, and the original string.

It uses guard conditions, local functions, and recursion.

Obviously, if the string isn't long enough to represent a key, there are no valid keys. Return an empty list.

Otherwise, if the candidate key is valid, return that candidate, tacked onto the result of evaluating the function on the rest of the string.

The candidate key is the first three characters of the input string.

The validChar function tests to see if a character is an uppercase letter or space.

The valid function tests to see if all characters in a string conform to the rules set in the validChar function. This uses partial function application and foregoes an explicit function parameter.

Otherwise, if there are enough characters, but they don't form a valid key, drop a single character from the beginning of the input string and search it for keys.

The firstKey function is an example of function composition. The string as the second parameter is inferred. It gets the first key from the list of all possible keys using the standard "head" function. The "." is the Haskell function composition operator.

code:
firstKey len = head . allPossibleKeys len


The count function is a straightforward recursive function.

code:
count _ [] = 0


Regardless of what you're counting, there aren't any in an empty list.

code:
count v (x:xs)
  | v == x = 1 + countRest
  | otherwise = countRest
  where
    countRest = count v xs


For a non-empty list, if the value you're looking for equals the first element in the list, add one to the result of counting the rest of the list. Otherwise, just count the rest of the list.

The "rankedKeys" function first finds all possible keys.

code:
rankedKeys len str =
  sortBy (\k1 k2 -> snd k2 `compare` snd k1) $ map (\k -> (k, count k keys)) $ nub keys
  where
    keys = allPossibleKeys len str


Then it uses the standard "nub" function to get unique keys, and for each of those maps its count in the original list of keys.

It then uses these counts to sort the list, such that keys which appear most frequently are first.

The "strReplace" function is a pretty straightforward way to replace one string with another in a larger string.

Now, since we can get a list of possible keys, sorted by the number of times they occur in a string, we can find the most common key just by taking the first one using the "head" function.

code:
compress len str =
  (key1, key2, secondTransform)
  where
    (key1, _) = head $ rankedKeys len str
    firstTransform = strReplace key1 "*1" str
    (key2, _) = head $ rankedKeys len firstTransform
    secondTransform = strReplace key2 "*2" firstTransform


That's exactly what I've done with:

code:
    (key1, _) = head $ rankedKeys len str
    firstTransform = strReplace key1 "*1" str


Now, since I have to assume that I've changed the string, I search the string again, and find the new best candidate, then do another string replacement to get the final compression. The function returns both keys and the compressed string in a tuple.

Author:  zylum [ Sun Aug 14, 2005 2:26 am ]
Post subject: 

wow, that was really confusing...

by explaining your algorithm i meant something more like:

"First we generate all possible 3 letter keys. Then try every combination of two keys that exist in the string and see which combination reduces the original string the most"

anyhoo, what does your solution return for 'AABBAABBBAABBA' ?

Author:  wtd [ Sun Aug 14, 2005 4:48 am ]
Post subject: 

It actually breaks my solution, since my solution doesn't test to make sure there are two valid keys. The first key it goes for is "AAB". Once that's replaced with "*1", though, there are no further 3 character keys left.

Author:  wtd [ Sun Aug 14, 2005 5:24 am ]
Post subject: 

And about my explanation... well, I thought that if the code itself were explained a bit, the algorithm would be fairly apparent.

Author:  bugzpodder [ Sun Aug 14, 2005 9:04 am ]
Post subject: 

hey zylum, why not try a more interesting problem? like those three in the IOI practice set! They look fantastically good!

Author:  zylum [ Sun Aug 14, 2005 9:35 pm ]
Post subject: 

bugzpodder wrote:
hey zylum, why not try a more interesting problem? like those three in the IOI practice set! They look fantastically good!


sounds good! can you post a link?

btw here's my solution to who ever is interested..

code:
import java.util.*;
public class CompressionText {
        public int shortestLength(String original) {
                int s = original.length();
                Vector keys = new Vector();
                for (int i = 0; i <= original.length()-3; i++)
                        if (!keys.contains(original.substring(i, i + 3)))
                                keys.add(original.substring(i, i + 3));
                for (int i = 0; i < keys.size(); i++)
                        for (int j = 0; j < keys.size(); j++)
                                s = Math.min(s, Search(original, (String) keys.get(i), (String) keys.get(j)));   
                return s;
        }
        int Search (String s, String k1, String k2) {
                return Math.min(s.indexOf(k1) < 0 ? s.length() : Search(s.replaceFirst(k1, "xx"), k1, k2),
                                s.indexOf(k2) < 0 ? s.length() : Search(s.replaceFirst(k2, "xx"), k1, k2));
        }
}


first i generate all keys by spliting the string into 3 letter substrings... i add each new key to a key vector making sure i dont add repeating keys. then i loop through all two key combinations. for each combination, i recursively replace the keys from the original string try all the ways i can replace the keys (some orders of replacing are better than others). the recursive function returns the length of the sortest possible string that can be generated by the two keys.

Author:  bugzpodder [ Mon Aug 15, 2005 10:45 am ]
Post subject: 

zylum, see the link posted by prof cormack (gvc) in the contest forum

Author:  MysticVegeta [ Tue Aug 16, 2005 1:24 pm ]
Post subject: 

I dont get the question... please use simpler language??

Author:  [Gandalf] [ Tue Aug 16, 2005 2:58 pm ]
Post subject: 

It's actually a pretty straightforward problem, just look at the output for the examples. The result is the length of the compressed string. So instead of having BARXFOO you would have *2X*1 because *2 is BAR and *1 is FOO. "*2X*1" is 5 characters so that is what you would get as a result (as opposed to 7 characters without the compression).

Author:  wtd [ Tue Aug 16, 2005 3:37 pm ]
Post subject: 

Of course, in reality, you have to add 6 characters to communicate what the keys are. Smile

Author:  [Gandalf] [ Tue Aug 16, 2005 3:44 pm ]
Post subject: 

Yes, so it's only really a compression technique if there are more than one of the 'keys' in the string. You could also add detection for that, so that you don't make some file bigger instead of smaller Smile.

Author:  bugzpodder [ Tue Aug 16, 2005 6:17 pm ]
Post subject: 

[Gandalf] wrote:
It's actually a pretty straightforward problem, just look at the output for the examples. The result is the length of the compressed string. So instead of having BARXFOO you would have *2X*1 because *2 is BAR and *1 is FOO. "*2X*1" is 5 characters so that is what you would get as a result (as opposed to 7 characters without the compression).


the more interesting case is when keys overlap.
such as for keys AAA and AAB

Author:  wtd [ Tue Aug 16, 2005 7:46 pm ]
Post subject: 

bugzpodder wrote:
[Gandalf] wrote:
It's actually a pretty straightforward problem, just look at the output for the examples. The result is the length of the compressed string. So instead of having BARXFOO you would have *2X*1 because *2 is BAR and *1 is FOO. "*2X*1" is 5 characters so that is what you would get as a result (as opposed to 7 characters without the compression).


the more interesting case is when keys overlap.
such as for keys AAA and AAB


Yes, since "compressing" based on the first key inherently changes the result of "compressing" using the second key.

Author:  [Gandalf] [ Tue Aug 16, 2005 10:10 pm ]
Post subject: 

Hmm, ya I see. What I would do though is just replace the first key, and then you have nothing to worry about since the overlap doesn't exist anymore Very Happy. Yes, me and my lazy ways...


: