Computer Science Canada Word-count |
Author: | Icharus [ Tue Jul 28, 2009 2:27 pm ] |
Post subject: | Word-count |
I read a word-count program in this forum already, but I have a slightly different request. Also, I'd appreciate it if nobody would answer with code or snippets, I'm just looking for pointers in the right direction ![]() I'm wanting to get a word count on a large document. Then, once the count is done, I want to be able to list the number of ocurrences of each word (something like a histogram). Once I have that list, I'd like to be able to output the words that are unique (used only once). At the moment, I'm thinking of using file i/o for inputting and outputting the information, and considering the use of the unix wc tool. I'd like to know of any better ways, or any ideas on how to do the specific counts. I'm pretty new, so I appreciate all pointers, but I'd really appreciate if nobody would just "give" me the code.[/i] |
Author: | Clever Error [ Tue Jul 28, 2009 3:37 pm ] |
Post subject: | RE:Word-count |
I would use a dictionary object that would map each word to a count of the occurrences of it. I'm not sure if C has that or if you'll need to make your own. After that you'll just need to read the file one word at a time and increment your word count and add/increment the occurrence count of the word. Hope that helps getting you started. |
Author: | Icharus [ Tue Jul 28, 2009 5:11 pm ] |
Post subject: | Re: Word-count |
Epic help indeed! Thanks for the idea on the dictionary object. Any other ideas or is that really the ideal way to go? |
Author: | md [ Tue Jul 28, 2009 6:24 pm ] |
Post subject: | RE:Word-count |
If the document is really really large you might want to hash the words and use some sort of dynamic structure to track totals. |
Author: | saltpro15 [ Tue Jul 28, 2009 7:29 pm ] |
Post subject: | RE:Word-count |
how large is large? like 2MB? or bigger? |
Author: | DtY [ Tue Jul 28, 2009 9:14 pm ] |
Post subject: | Re: RE:Word-count |
md @ Tue Jul 28, 2009 6:24 pm wrote: If the document is really really large you might want to hash the words and use some sort of dynamic structure to track totals.
That's what Clever Error suggested, just different wording. If you're looking for a library, google Hash Table library for C (Or if someone here knows of a good one/a standard one..) |
Author: | md [ Wed Jul 29, 2009 10:14 pm ] |
Post subject: | RE:Word-count |
No, he suggested some ideas without mentioning any real data structures ![]() ![]() For C I do not believe there is a standard library hash function. And usually (from what I remember of Data Structures) you get the best results when you tailor your hash function to your data; so googling for one that is similar to your needs is probably the best place to start. |
Author: | rizzix [ Wed Jul 29, 2009 10:38 pm ] |
Post subject: | RE:Word-count |
A dictionary is a hashtable. |
Author: | DemonWasp [ Wed Jul 29, 2009 11:42 pm ] |
Post subject: | RE:Word-count |
On "large": 2MB is probably nothing now, though it has been in the past. You could do a simple word-count program with a list if you wanted, it would just be O ( N ) to count each word, where N is the number of words you already have. If you have M words in the file, of which there are N distinct words, it would take this simple implementation O ( M * N ) time. It would be much smarter to do this with a dictionary (also called a map or table), which can offer approximately O ( 1 ) access, assuming it's well-behaved. This would decrease the runtime of the program to O ( M ). What may end up being even faster (or, more likely, smaller) is a trie. This would reduce the amount of space required to store strings substantially, but would increase the access time for each string to O ( log ( N ) ) - and a total runtime of O ( M * log ( N ) ). Plus, they're a pretty cool data structure. |
Author: | OneOffDriveByPoster [ Thu Jul 30, 2009 5:37 pm ] |
Post subject: | Re: RE:Word-count |
rizzix @ Wed Jul 29, 2009 10:38 pm wrote: A dictionary is a hashtable. A dictionary is a "map", which may be a hashtable. It could be an associative array for all we know. I agree with the person suggesting a trie, but I am not sure where the log(N) came from. The access time for an entry in a trie is O(the length of the key). |
Author: | md [ Thu Jul 30, 2009 11:37 pm ] |
Post subject: | RE:Word-count |
A dictionary is a java interface which does not exist in C and may (or may not) be implemented as a hash table. tries might work as well, but whether the space/speed trade-off is worth it would be highly dependent on the average size of the file being analyzed. |
Author: | rizzix [ Fri Jul 31, 2009 5:39 pm ] |
Post subject: | RE:Word-count |
Actually "Dictionary" is commonly associated with Hashtables in quite a few programming languages including Perl and Python. However, yes it could be generalised to Maps. |
Author: | btiffin [ Sat Aug 01, 2009 2:12 pm ] |
Post subject: | RE:Word-count |
Old guy aside; a term you can use to describe this type of word counting is concordance Cheers |
Author: | Icharus [ Sun Aug 02, 2009 11:54 am ] |
Post subject: | Re: Word-count |
Wow. Gmail waited until today to notify me this thread was still active. I apologize for the delay in reply. It looks like I have a lot to learn about. Thanks to you all for your ideas and suggestions. As to the size of the file, it's looking like 5.8 megs. I'm interested by DemonWasp's proposal. I'm afraid I have to admit I'm a beginning coder with a background in history rather than mathematics. Although I want to go back to school to brush up on some math and physics, your O, M, N statement went right over my head. I think I gather the basic premise, but what is the O? I'll be researching both hashtables and Tries. Cheers again, mates! Keep any ideas coming, this is immensely helpful! |
Author: | saltpro15 [ Sun Aug 02, 2009 12:56 pm ] |
Post subject: | RE:Word-count |
O is order. In other words, how many operations(steps) must your program use in order to run? For a simple count, your runtime would be O(M * N) assuming the file contains 5 000 000 words, and we'll say 200 000 distinct words, your worst case run time is O(1 000 000 000 000) Using a hashtable, your runtime becomes O(M) and your worst case run time becomes O(5 000 000) you can see which will run faster ![]() I'll leave a better explanation to the more experienced people, but this should help |
Author: | Analysis Mode [ Sun Aug 02, 2009 7:32 pm ] |
Post subject: | Re: Word-count |
Another method is to use a set (available in the C++ STL). Run through the document and insert every word into the set. All elements in a set are distinct and every insertion will take log N time, where N is the number of elements currently in the set (and is also the maximum number of distinct words in the document), the total run time will be O(M log N), where M is the total number of words. To handle the count part, you could have a set of pairs, where the first member of the pair is the word and the second member the number of occurrences and have the set sort by the first element. |
Author: | Icharus [ Tue Aug 04, 2009 1:42 pm ] |
Post subject: | RE:Word-count |
No worries on being an "old guy"...I dig appropriate words. Concordance is much shorter than my previous description. Also, thanks a ton on the clarification of the O M N variables--that's a nifty little formula. Analysis: I think that's a great way to do it, but I'm limited to using C with GCC and YACC. Part of the goal is to make this code as elegant, portable, and modular as possible. I'll keep the C++ in mind though for future applications. It never hurts to know more! |
Author: | gianni [ Tue Aug 04, 2009 6:04 pm ] |
Post subject: | Re: RE:Word-count |
Icharus @ Tue Aug 04, 2009 1:42 pm wrote: Part of the goal is to make this code as elegant, portable, and modular as possible.
Perhaps you might consider using a scripting language (e.g. ruby, python, perl, &c...), as they generally excel in these areas? |
Author: | Icharus [ Thu Aug 06, 2009 2:04 pm ] |
Post subject: | Re: Word-count |
Gianni, If you're not careful, you'll have people saying I'm just difficult ![]() I'm trying to really get familiar with the inner workings of the computer, simultaneously, I'm working on developing a habit of elegant and reusable code. It can be done in C. The friend who gave me this assignment showed me his library. It's awe-inspiring. |
Author: | gianni [ Thu Aug 06, 2009 2:30 pm ] |
Post subject: | Re: Word-count |
Icharus @ Thu Aug 06, 2009 2:04 pm wrote: Gianni,
If you're not careful, you'll have people saying I'm just difficult ![]() I'm trying to really get familiar with the inner workings of the computer, simultaneously, I'm working on developing a habit of elegant and reusable code. It can be done in C. The friend who gave me this assignment showed me his library. It's awe-inspiring. Do it up! Also, I'll have to show you some Ruby code, you might just poop yourself because it's so elegant! |
Author: | wtd [ Fri Aug 07, 2009 1:29 am ] |
Post subject: | RE:Word-count |
And my friend, ruby.h, would like to point out that C and high-level scripting languages are not mutually exclusive. |