Posted: Tue Jul 28, 2009 2:27 pm Post subject: Word-count
I read a word-count program in this forum already, but I have a slightly different request. Also, I'd appreciate it if nobody would answer with code or snippets, I'm just looking for pointers in the right direction .
I'm wanting to get a word count on a large document. Then, once the count is done, I want to be able to list the number of ocurrences of each word (something like a histogram). Once I have that list, I'd like to be able to output the words that are unique (used only once).
At the moment, I'm thinking of using file i/o for inputting and outputting the information, and considering the use of the unix wc tool. I'd like to know of any better ways, or any ideas on how to do the specific counts.
I'm pretty new, so I appreciate all pointers, but I'd really appreciate if nobody would just "give" me the code.[/i]
Sponsor Sponsor
Clever Error
Posted: Tue Jul 28, 2009 3:37 pm Post subject: RE:Word-count
I would use a dictionary object that would map each word to a count of the occurrences of it. I'm not sure if C has that or if you'll need to make your own.
After that you'll just need to read the file one word at a time and increment your word count and add/increment the occurrence count of the word.
If the document is really really large you might want to hash the words and use some sort of dynamic structure to track totals.
That's what Clever Error suggested, just different wording.
If you're looking for a library, google Hash Table library for C (Or if someone here knows of a good one/a standard one..)
md
Posted: Wed Jul 29, 2009 10:14 pm Post subject: RE:Word-count
No, he suggested some ideas without mentioning any real data structures But yes it does amount to the same thing
For C I do not believe there is a standard library hash function. And usually (from what I remember of Data Structures) you get the best results when you tailor your hash function to your data; so googling for one that is similar to your needs is probably the best place to start.
rizzix
Posted: Wed Jul 29, 2009 10:38 pm Post subject: RE:Word-count
A dictionary is a hashtable.
Sponsor Sponsor
DemonWasp
Posted: Wed Jul 29, 2009 11:42 pm Post subject: RE:Word-count
On "large": 2MB is probably nothing now, though it has been in the past. You could do a simple word-count program with a list if you wanted, it would just be O ( N ) to count each word, where N is the number of words you already have. If you have M words in the file, of which there are N distinct words, it would take this simple implementation O ( M * N ) time.
It would be much smarter to do this with a dictionary (also called a map or table), which can offer approximately O ( 1 ) access, assuming it's well-behaved. This would decrease the runtime of the program to O ( M ).
What may end up being even faster (or, more likely, smaller) is a trie. This would reduce the amount of space required to store strings substantially, but would increase the access time for each string to O ( log ( N ) ) - and a total runtime of O ( M * log ( N ) ). Plus, they're a pretty cool data structure.
A dictionary is a "map", which may be a hashtable. It could be an associative array for all we know. I agree with the person suggesting a trie, but I am not sure where the log(N) came from. The access time for an entry in a trie is O(the length of the key).
md
Posted: Thu Jul 30, 2009 11:37 pm Post subject: RE:Word-count
A dictionary is a java interface which does not exist in C and may (or may not) be implemented as a hash table.
tries might work as well, but whether the space/speed trade-off is worth it would be highly dependent on the average size of the file being analyzed.
rizzix
Posted: Fri Jul 31, 2009 5:39 pm Post subject: RE:Word-count
Actually "Dictionary" is commonly associated with Hashtables in quite a few programming languages including Perl and Python. However, yes it could be generalised to Maps.
btiffin
Posted: Sat Aug 01, 2009 2:12 pm Post subject: RE:Word-count
Old guy aside; a term you can use to describe this type of word counting is concordance
Cheers
Icharus
Posted: Sun Aug 02, 2009 11:54 am Post subject: Re: Word-count
Wow. Gmail waited until today to notify me this thread was still active. I apologize for the delay in reply. It looks like I have a lot to learn about. Thanks to you all for your ideas and suggestions. As to the size of the file, it's looking like 5.8 megs. I'm interested by DemonWasp's proposal. I'm afraid I have to admit I'm a beginning coder with a background in history rather than mathematics. Although I want to go back to school to brush up on some math and physics, your O, M, N statement went right over my head. I think I gather the basic premise, but what is the O?
I'll be researching both hashtables and Tries. Cheers again, mates! Keep any ideas coming, this is immensely helpful!
saltpro15
Posted: Sun Aug 02, 2009 12:56 pm Post subject: RE:Word-count
O is order. In other words, how many operations(steps) must your program use in order to run?
For a simple count, your runtime would be O(M * N)
assuming the file contains 5 000 000 words, and we'll say 200 000 distinct words, your worst case run time is O(1 000 000 000 000)
Using a hashtable, your runtime becomes O(M) and your worst case run time becomes O(5 000 000)
you can see which will run faster
I'll leave a better explanation to the more experienced people, but this should help