Welcome to my Word Counter CLI application! ๐ This app can do one simple thing: find the "n" most frequently used words in a folder.
To try this app out, you need to perform the following steps:
- Check out the repo.
- Install the dependencies:
npm i
. - Build the app:
npm run build
. - Run the built version:
node dist/index.js <folder> <wordCount>
.
If you see any issues popping up during the steps above, maybe you have an incompatibility issue. Here are the Node and NPM versions I used:
- Node.js v20.10.0
- NPM: 10.2.5
An easy test case can be running the following command:
node dist/index.js ./src/services/__tests__/data/1 5
Which should generate the following result:
๐ Done! Here are the top 5 most frequently used words in ./src/services/__tests__/data/1:
Top 5 words in ./src/services/__tests__/data/1:
1. man: 6
2. truth: 6
3. acknowledged: 3
4. considered: 3
5. daughters: 3
As you can see, paths relative to the calling location are working just fine. Happy testing! ๐
This application uses Node.js with TypeScript and the Commander.js framework for building the CLI app. I've also sprinkled in a bit of Figlet (just for the title).
I embraced a test-driven development approach (not strictly by the book, sometimes I wrote a simple chunk of the function before the test). For testing, I chose Vitest, my favorite test framework โ it's almost equal to Jest in terms of API but faster, more modern, and compatible with other cool stuff! ๐จ
I tried to implement the requirements in a way that it scales especially in terms of memory usage. That's why, instead of reading the entire file at once, I voted for reading it line-by-line and merge-summing the results after every line.
The implementation goes through the following steps:
- Read the passed directory: load the list of files and call the file parsing function on them (I do this in parallel with
Promise.all()
as the results will be summed in the end). - Read the lines of each file: each file parsing is about going through it line-by-line and calling the
getWordFrequencyMap
on each line, then sum-merging the partial results to an accumulator object. - Parse the line with a pure function that takes a string as an argument and performs the actual logic.
The algorithm is optimized because it performs reads in parallel, and also doesn't overload the memory because of the line-by-line parsing. One weak point is that a large file can also be a single long line, which would make it slightly less effective. Chunk-by-chunk reading can be an option to solve this, in that case, we have to make sure that the words are not broken on the boundary of each chunk.
Note: the current implementation doesn't go into subfolders recursively. It could be added to the getWordFrequencyInFolder
quite easily though.