Programming C, C++, Java, PHP, Ruby, Turing, VB
Computer Science Canada 
Programming C, C++, Java, PHP, Ruby, Turing, VB  

Username:   Password: 
 RegisterRegister   
 Word-count
Index -> Programming, C -> C Help
Goto page 1, 2  Next
View previous topic Printable versionDownload TopicSubscribe to this topicPrivate MessagesRefresh page View next topic
Author Message
Icharus




PostPosted: Tue Jul 28, 2009 2:27 pm   Post subject: Word-count

I read a word-count program in this forum already, but I have a slightly different request. Also, I'd appreciate it if nobody would answer with code or snippets, I'm just looking for pointers in the right direction Smile.
I'm wanting to get a word count on a large document. Then, once the count is done, I want to be able to list the number of ocurrences of each word (something like a histogram). Once I have that list, I'd like to be able to output the words that are unique (used only once).
At the moment, I'm thinking of using file i/o for inputting and outputting the information, and considering the use of the unix wc tool. I'd like to know of any better ways, or any ideas on how to do the specific counts.
I'm pretty new, so I appreciate all pointers, but I'd really appreciate if nobody would just "give" me the code.[/i]
Sponsor
Sponsor
Sponsor
sponsor
Clever Error




PostPosted: Tue Jul 28, 2009 3:37 pm   Post subject: RE:Word-count

I would use a dictionary object that would map each word to a count of the occurrences of it. I'm not sure if C has that or if you'll need to make your own.

After that you'll just need to read the file one word at a time and increment your word count and add/increment the occurrence count of the word.

Hope that helps getting you started.
Icharus




PostPosted: Tue Jul 28, 2009 5:11 pm   Post subject: Re: Word-count

Epic help indeed!
Thanks for the idea on the dictionary object. Any other ideas or is that really the ideal way to go?
md




PostPosted: Tue Jul 28, 2009 6:24 pm   Post subject: RE:Word-count

If the document is really really large you might want to hash the words and use some sort of dynamic structure to track totals.
saltpro15




PostPosted: Tue Jul 28, 2009 7:29 pm   Post subject: RE:Word-count

how large is large? like 2MB? or bigger?
DtY




PostPosted: Tue Jul 28, 2009 9:14 pm   Post subject: Re: RE:Word-count

md @ Tue Jul 28, 2009 6:24 pm wrote:
If the document is really really large you might want to hash the words and use some sort of dynamic structure to track totals.

That's what Clever Error suggested, just different wording.
If you're looking for a library, google Hash Table library for C (Or if someone here knows of a good one/a standard one..)
md




PostPosted: Wed Jul 29, 2009 10:14 pm   Post subject: RE:Word-count

No, he suggested some ideas without mentioning any real data structures Wink But yes it does amount to the same thing Razz

For C I do not believe there is a standard library hash function. And usually (from what I remember of Data Structures) you get the best results when you tailor your hash function to your data; so googling for one that is similar to your needs is probably the best place to start.
rizzix




PostPosted: Wed Jul 29, 2009 10:38 pm   Post subject: RE:Word-count

A dictionary is a hashtable.
Sponsor
Sponsor
Sponsor
sponsor
DemonWasp




PostPosted: Wed Jul 29, 2009 11:42 pm   Post subject: RE:Word-count

On "large": 2MB is probably nothing now, though it has been in the past. You could do a simple word-count program with a list if you wanted, it would just be O ( N ) to count each word, where N is the number of words you already have. If you have M words in the file, of which there are N distinct words, it would take this simple implementation O ( M * N ) time.

It would be much smarter to do this with a dictionary (also called a map or table), which can offer approximately O ( 1 ) access, assuming it's well-behaved. This would decrease the runtime of the program to O ( M ).

What may end up being even faster (or, more likely, smaller) is a trie. This would reduce the amount of space required to store strings substantially, but would increase the access time for each string to O ( log ( N ) ) - and a total runtime of O ( M * log ( N ) ). Plus, they're a pretty cool data structure.
OneOffDriveByPoster




PostPosted: Thu Jul 30, 2009 5:37 pm   Post subject: Re: RE:Word-count

rizzix @ Wed Jul 29, 2009 10:38 pm wrote:
A dictionary is a hashtable.
A dictionary is a "map", which may be a hashtable. It could be an associative array for all we know. I agree with the person suggesting a trie, but I am not sure where the log(N) came from. The access time for an entry in a trie is O(the length of the key).
md




PostPosted: Thu Jul 30, 2009 11:37 pm   Post subject: RE:Word-count

A dictionary is a java interface which does not exist in C and may (or may not) be implemented as a hash table.

tries might work as well, but whether the space/speed trade-off is worth it would be highly dependent on the average size of the file being analyzed.
rizzix




PostPosted: Fri Jul 31, 2009 5:39 pm   Post subject: RE:Word-count

Actually "Dictionary" is commonly associated with Hashtables in quite a few programming languages including Perl and Python. However, yes it could be generalised to Maps.
btiffin




PostPosted: Sat Aug 01, 2009 2:12 pm   Post subject: RE:Word-count

Old guy aside; a term you can use to describe this type of word counting is concordance

Cheers
Icharus




PostPosted: Sun Aug 02, 2009 11:54 am   Post subject: Re: Word-count

Wow. Gmail waited until today to notify me this thread was still active. I apologize for the delay in reply. It looks like I have a lot to learn about. Thanks to you all for your ideas and suggestions. As to the size of the file, it's looking like 5.8 megs. I'm interested by DemonWasp's proposal. I'm afraid I have to admit I'm a beginning coder with a background in history rather than mathematics. Although I want to go back to school to brush up on some math and physics, your O, M, N statement went right over my head. I think I gather the basic premise, but what is the O?
I'll be researching both hashtables and Tries. Cheers again, mates! Keep any ideas coming, this is immensely helpful!
saltpro15




PostPosted: Sun Aug 02, 2009 12:56 pm   Post subject: RE:Word-count

O is order. In other words, how many operations(steps) must your program use in order to run?

For a simple count, your runtime would be O(M * N)
assuming the file contains 5 000 000 words, and we'll say 200 000 distinct words, your worst case run time is O(1 000 000 000 000)

Using a hashtable, your runtime becomes O(M) and your worst case run time becomes O(5 000 000)

you can see which will run faster Very Happy

I'll leave a better explanation to the more experienced people, but this should help
Display posts from previous:   
   Index -> Programming, C -> C Help
View previous topic Tell A FriendPrintable versionDownload TopicSubscribe to this topicPrivate MessagesRefresh page View next topic

Page 1 of 2  [ 21 Posts ]
Goto page 1, 2  Next
Jump to:   


Style:  
Search: