Home C C++ Java Python Perl PHP SQL JavaScript Linux Selenium QT Online Test

Home » C++ » C++ Projects » C++11 Project on Word Indexing

C++11 Project on Word Indexing

A fantastic project with documentation for beginners & experienced C++ programmers to learn C++11 features with Makefile, Map , Multimap & STL Algorithms. This project also uses Valgrind tool to find out memory leak. This project was build using Makefile on fedora 12 Linux but if you want to build with your compiler then you can add all source files to your project and compile it with C++11 build options. If you need any support in compilation, reach out via Admin@cppbuzz.com

Word-Indexing
High Level Design
By https://www.cppbuzz.com, Jan 2015
last modified: 12 Jan, 2015


High Level Design
  • 1 Introduction----------------------------
  • 2 Architecture----------------------------
  • 3 Class Diagram---------------------------
  • 4 Development-----------------------------
  • 5 Test Cases--------------------------

1. Introduction

1.1 Problem Statement
Create a multi-threaded text file indexing command line application in C++ that works as follows:
1. Accept as input a file path (e.g. /myfiles) on the command line
2. Have one thread that is responsible for searching the file path, including any sub-directories, for text files (ending in .txt)
3. When a text file is found, it should be handed off to a worker thread for processing, and the search thread should continue searching.
4. There should be a fixed number (N) of worker threads (say, N=3) that handle text file processing.
5. When a worker thread receives a text file to process, it opens the file and reads the contents one word at a time. Words are delimited by any character other than A-Z or 0-9.
6. A master table in memory, shared between all threads, keeps track of all unique words
Encountered and the number of times it was encountered. Each time a word is encountered the count is incremented (or it is added to the table if not present). Words should be matched case-insensitive and without any punctuation.
7. Once the file search is complete and all text files finish processing, the program prints out the top 10 words and their counts.
Basically we just want to find the top 10 words across a directory tree of text files.

2. Architecture

2.1 Architure Diagram

2.2 Modules
There are three modules SerachThread, SyncQueue and WorkerThread.

2.2.1 SearchThread
This module search for .txt file in the path specified as command line argument. In addition, it sends file to SyncQueue module. SearchThread stop working once searching is over.

2.2.2 SyncQueue
This module send the file in a synchronized Queue. This module provides file to WorkerThread module for processing. SyncQueue provides access of its Queue to only one WorkerThread at a time.

2.2.3 WorkerThread
This module has three workerthread and each thread get the file to process from SyncQueue module.
After getting the file, each workerthread reads the file and fetch words to save in a data structure called MTable. MTable contains unique words with there frequency.

3. Class Diagram

This program has been divided into three classes:
1. SearchThread
2. SyncQueue
3. WrokerThread

Class Diagram of Word Indexing Project in C++

4. Development

Development is done on Fedora 12 using C++11 language.

4.1 Directory Structure :
SearchFiles->
 | ---src
  |-- SearchThread.cpp
  |-- SearchThread.h
  |-- WorkerThread.cpp
  |-- WorkerThread.h
  |-- main.cpp
 | --- wordindex.out
 | --- Makefile

4.2 Output of Program:
Output of Word Indexing Project in C++
4.3 Debugging For debugging GDB is used.
C++ word indexing project debugging using GDB
4.4 Memory Leaks
To find out memory Leak Valgrind tool is used.
Finding memory leaks in C++ project
4.5 Known Issues
-Creating a multi map to sort the contents of map, which requires more memory, we can remove use of multi map.
-WorkerThread module returns the multi map, to print this multi map in main function I am creating one extra multi map to save multi map returned by WorkerThread module.

4.6 Glossary
-MTable is a data structure, which contains words with their frequency.
-WorkerThread1, WorkerThread2 and WorkerThread3 are three-worker thread, which are part of WorkThread and responsible for filling words in MTable.
-Queue is synchronized queue, which contains file.

5 Test Cases

C++ project's test cases output

Download Source Code & Docs

Check List of C++ Projects