
-----------------------------------
Icharus
Tue Jul 28, 2009 2:27 pm

Word-count
-----------------------------------
I read a word-count program in this forum already, but I have a slightly different request.  Also, I'd appreciate it if nobody would answer with code or snippets, I'm just looking for pointers in the right direction :).
I'm wanting to get a word count on a large document.  Then, once the count is done, I want to be able to list the number of ocurrences of each word (something like a histogram).  Once I have that list, I'd like to be able to output the words that are unique (used only once).
At the moment, I'm thinking of using file i/o for inputting and outputting the information, and considering the use of the unix wc tool.  I'd like to know of any better ways, or any ideas on how to do the specific counts.
I'm pretty new, so I appreciate all pointers, but I'd really appreciate if nobody would just "give" me the code.[/i]

-----------------------------------
Clever Error
Tue Jul 28, 2009 3:37 pm

RE:Word-count
-----------------------------------
I would use a dictionary object that would map each word to a count of the occurrences of it. I'm not sure if C has that or if you'll need to make your own.

After that you'll just need to read the file one word at a time and increment your word count and add/increment the occurrence count of the word.

Hope that helps getting you started.

-----------------------------------
Icharus
Tue Jul 28, 2009 5:11 pm

Re: Word-count
-----------------------------------
Epic help indeed!
Thanks for the idea on the dictionary object.  Any other ideas or is that really the ideal way to go?

-----------------------------------
md
Tue Jul 28, 2009 6:24 pm

RE:Word-count
-----------------------------------
If the document is really really large you might want to hash the words and use some sort of dynamic structure to track totals.

-----------------------------------
saltpro15
Tue Jul 28, 2009 7:29 pm

RE:Word-count
-----------------------------------
how large is large? like 2MB? or bigger?

-----------------------------------
DtY
Tue Jul 28, 2009 9:14 pm

Re: RE:Word-count
-----------------------------------
If the document is really really large you might want to hash the words and use some sort of dynamic structure to track totals.
That's what Clever Error suggested, just different wording.
If you're looking for a library, google Hash Table library for C (Or if someone here knows of a good one/a standard one..)

-----------------------------------
md
Wed Jul 29, 2009 10:14 pm

RE:Word-count
-----------------------------------
No, he suggested some ideas without mentioning any real data structures ;-) But yes it does amount to the same thing :P

For C I do not believe there is a standard library hash function. And usually (from what I remember of Data Structures) you get the best results when you tailor your hash function to your data; so googling for one that is similar to your needs is probably the best place to start.

-----------------------------------
rizzix
Wed Jul 29, 2009 10:38 pm

RE:Word-count
-----------------------------------
A dictionary is a hashtable.

-----------------------------------
DemonWasp
Wed Jul 29, 2009 11:42 pm

RE:Word-count
-----------------------------------
On "large": 2MB is probably nothing now, though it has been in the past. You could do a simple word-count program with a list if you wanted, it would just be O ( N ) to count each word, where N is the number of words you already have. If you have M words in the file, of which there are N distinct words, it would take this simple implementation O ( M * N ) time.

It would be much smarter to do this with a dictionary (also called a map or table), which can offer approximately O ( 1 ) access, assuming it's well-behaved. This would decrease the runtime of the program to O ( M ).

What may end up being even faster (or, more likely, smaller) is a [url=http://en.wikipedia.org/wiki/Trie]trie. This would reduce the amount of space required to store strings substantially, but would increase the access time for each string to O ( log ( N ) ) - and a total runtime of O ( M * log ( N ) ). Plus, they're a pretty cool data structure.

-----------------------------------
OneOffDriveByPoster
Thu Jul 30, 2009 5:37 pm

Re: RE:Word-count
-----------------------------------
A dictionary is a hashtable.A dictionary is a "map", which may be a hashtable.  It could be an associative array for all we know.  I agree with the person suggesting a trie, but I am not sure where the log(N) came from.  The access time for an entry in a trie is O(the length of the key).

-----------------------------------
md
Thu Jul 30, 2009 11:37 pm

RE:Word-count
-----------------------------------
A dictionary is a java interface which does not exist in C and may (or may not) be implemented as a hash table.

tries might work as well, but whether the space/speed trade-off is worth it would be highly dependent on the average size of the file being analyzed.

-----------------------------------
rizzix
Fri Jul 31, 2009 5:39 pm

RE:Word-count
-----------------------------------
Actually "Dictionary" is commonly associated with Hashtables in quite a few programming languages including Perl and Python. However, yes it could be generalised to Maps.

-----------------------------------
btiffin
Sat Aug 01, 2009 2:12 pm

RE:Word-count
-----------------------------------
Old guy aside; a term you can use to describe this type of word counting is concordance

Cheers

-----------------------------------
Icharus
Sun Aug 02, 2009 11:54 am

Re: Word-count
-----------------------------------
Wow.  Gmail waited until today to notify me this thread was still active.  I apologize for the delay in reply.  It looks like I have a lot to learn about.  Thanks to you all for your ideas and suggestions.  As to the size of the file, it's looking like 5.8 megs. I'm interested by DemonWasp's proposal.  I'm afraid I have to admit I'm a beginning coder with a background in history rather than mathematics.  Although I want to go back to school to brush up on some math and physics, your O, M, N statement went right over my head.  I think I gather the basic premise, but what is the O?
I'll be researching both hashtables and Tries.  Cheers again, mates!  Keep any ideas coming, this is immensely helpful!

-----------------------------------
saltpro15
Sun Aug 02, 2009 12:56 pm

RE:Word-count
-----------------------------------
O is order.  In other words, how many operations(steps) must your program use in order to run?

For a simple count, your runtime would be O(M * N)
assuming the file contains 5 000 000 words, and we'll say 200 000 distinct words, your worst case run time is O(1 000 000 000 000)

Using a hashtable, your runtime becomes O(M) and your worst case run time becomes O(5 000 000)

you can see which will run faster :D

I'll leave a better explanation to the more experienced people, but [url=http://en.wikipedia.org/wiki/Time_complexity#Big_O_notation]this should help

-----------------------------------
Analysis Mode
Sun Aug 02, 2009 7:32 pm

Re: Word-count
-----------------------------------
Another method is to use a set (available in the C++ STL).  Run through the document and insert every word into the set.  All elements in a set are distinct and every insertion will take log N time, where N is the number of elements currently in the set (and is also the maximum number of distinct words in the document), the total run time will be O(M log N), where M is the total number of words.

To handle the count part, you could have a set of pairs, where the first member of the pair is the word and the second member the number of occurrences and have the set sort by the first element.

-----------------------------------
Icharus
Tue Aug 04, 2009 1:42 pm

RE:Word-count
-----------------------------------
No worries on being an "old guy"...I dig appropriate words.  Concordance is much shorter than my previous description.
Also, thanks a ton on the clarification of the O M N variables--that's a nifty little formula.
Analysis: I think that's a great way to do it, but I'm limited to using C with GCC and YACC.  Part of the goal is to make this code as elegant, portable, and modular as possible.
I'll keep the C++ in mind though for future applications.  It never hurts to know more!

-----------------------------------
gianni
Tue Aug 04, 2009 6:04 pm

Re: RE:Word-count
-----------------------------------
Part of the goal is to make this code as elegant, portable, and modular as possible.

Perhaps you might consider using a scripting language (e.g. ruby, python, perl, &c...), as they generally excel in these areas?

-----------------------------------
Icharus
Thu Aug 06, 2009 2:04 pm

Re: Word-count
-----------------------------------
Gianni,
If you're not careful, you'll have people saying I'm just difficult  :) .  I am working on learning C, and only C at the moment--I have some experience with scripting languages, but they're not on my list to improve at the moment.  The main reason why I'm learning C is to be as close to the machine code as possible.
I'm trying to really get familiar with the inner workings of the computer, simultaneously, I'm working on developing a habit of elegant and reusable code.  It can be done in C.  The friend who gave me this assignment showed me his library.  It's awe-inspiring.

-----------------------------------
gianni
Thu Aug 06, 2009 2:30 pm

Re: Word-count
-----------------------------------
Gianni,
If you're not careful, you'll have people saying I'm just difficult  :) .  I am working on learning C, and only C at the moment--I have some experience with scripting languages, but they're not on my list to improve at the moment.  The main reason why I'm learning C is to be as close to the machine code as possible.
I'm trying to really get familiar with the inner workings of the computer, simultaneously, I'm working on developing a habit of elegant and reusable code.  It can be done in C.  The friend who gave me this assignment showed me his library.  It's awe-inspiring.

Do it up! Also, I'll have to show you some Ruby code, you might just poop yourself because it's so elegant!

-----------------------------------
wtd
Fri Aug 07, 2009 1:29 am

RE:Word-count
-----------------------------------
And my friend, ruby.h, would like to point out that C and high-level scripting languages are not mutually exclusive.
