Projects

Development of an auto-summarization tool

 

Title of the project

 

Development of an auto-summarization tool

 

Abstract of the project

 

Auto-summarization is a technique used to generate summaries of electronic documents. This has some applications like summarizing the search-engine results, providing briefs of big documents that do not have an abstract etc. There are two categories of summarizers, linguistic and statistical. Linguistic summarizers use knowledge about the languange (syntax/semantics/usage etc) to summarize a document. Statistical ones operate by finding the important sentences using statistical methods (like frequency of a particular word etc). Statistical summarizers normally do not use any linguistic information. 

 

In this project, an auto-summarization tool is developed using statistical techniques. The techniques involve finding the frequency of words, scoring the sentences, ranking the sentences etc. The summary is obtained by selecting a particular number of sentences (specified by the user) from the top of the list. It operates on a single document (but can be made to work on multiple documents by choosing proper algorithms for integration) and provides a summary of the document. The size of the summary can be specified by the user when invoking the tool. Pre-processing interfaces are there to handle the following document types: Plain Text, HTML, Word Document.

 

Keywords

 

Generic Technlogy keywords

 

Algorithm, Programming

 

Specific Technology keywords

 

C, C++, Java, C-Sharp

 

Project type keywords

 

Statistics, User Interface

 

Functional components of the project

 

Following is a list of the functional components of the tool.

 

  1. Text pre-processor. This will work on the HTML or Word Documents and convert them to plain text for processing by the rest of the system.
  2. Sentence separator. This goes through the document and separates the sentences based on some rules (like a sentence ending is determined by a dot and a space etc). Any other appropriate criteria might also be added to separate the sentences.

 

  1. Word separator. This separates the words based on some criteria (like a space denotes the end of a word etc).

 

  1. Stop-words eliminator. This eliminates the regular English words like ‘a, an, the, of, from..’ etc for further processing. These words are known as ‘stop-words’. A list of applicable stop-words for English is available on the Internet.

 

  1. Word-frequency calculator. This calculates the number of times a word appears in the document (stop-words have been eliminated earlier itself and will not figure in this calculation) and also the number of sentences that word appears in the document. For example, the word ‘Unix’ may appear a total of 100 times in a document, and in 80 sentences. (Some sentences might have more than one occurrence of the word). Some min-max thresholds can be set for the frequencies (the thresholds to be determined by trial-and-error)

 

  1. Scoring algorithm. This algorithm determines the score of each sentence. Several possibilities exist. The score can be made to be proportional to the sum of frequencies of the different words comprising the sentence (ie, if a sentence has 3 words A, B and C, then the score is proportional the sum of how many times A, B and C have occurred in the document). The score can also be made to be inversely proportional to the number of sentences in which the words in the sentence appear in the document. Likewise, many such heuristic rules can be applied to score the sentences.

 

  1. Ranking. The sentences will be ranked according to the scores. Any other criteria like the position of a sentence in the document can be used to control the ranking. For example, even though the scores are high, we would not put consecutive sentences together.

 

  1. Summarizing. Based on the user input on the size of the summary, the sentences will be picked from the ranked list and concatenated. The resulting summary file could be stored with a name like <originalfilename>_summary.txt.

 

  1. User Interface. The tool could use a GUI or a plain command-line interface. In either case, it should have easy and intuitive ways of getting the input from the user (the document, the size of the summary needed etc).

 

 

Steps to start-off the project

 

The following steps will be helpful to start off the project.

  1. Study about auto-summarizing techniques (some references are given in the references section of this document) and concentrate more on summarizers based on statistical techniques.

 

  1. Collect the list of stop-words from an Internet site.

 

  1. Come up with algorithms for the different functional components listed in the previous section. Some heuristic methods could be used to come up with modification of any existing algorithm.

 

  1. Implement the pre-processor/sentence separator/word separator/word frequency calculator. These do not require much work on the algorithm side and existing algorithms will do fine.

 

  1. Implement the scoring and ranking component.

 

  1. Test it with some documents and tune the algorithms, if needed.

 

  1. Bench-mark your tool against some tools available on the Internet (like www.copernic.com).

 

 

 

 

Requirements

 

Hardware requirements

 

Number

Description

Alternatives (If available)

1

PC with 2 GB hard-disk and 256 MB RAM

Not-Applicable

2

 

 

 

Software requirements

 

Number

Description

Alternatives (If available)

1

Windows 95/98/XP with MS-office

Not Applicable

2

 

 

 

Manpower requirements

 

2 to 3 students can complete this in 4 – 6 months if they work fulltime on it.

 

Milestones and Timelines

 

Number

Milestone Name

Milestone Description

 

 

Timeline

 

Week no. 

from the start

of the project

Remarks

 

 

1

Subject Familiarization

Reading the literature on autosummarization techniques in general and statistical techniques in particular.

3-4

Attempt should be made to get an understanding of the subject in a broad sense. Also, attempt should be made to understand the statistical techniques of auto-summarization in some depth.

2

Pre-processors implementation

Implement the pre-processors like word to text and html to text, sentence separators, word separators, word frequency calculators

5-6

These are implemented first so that some metrics for further algorithms can be calculated.

3

Scoring and Ranking algorithm

Coming up with an algorithm for scoring and ranking the sentences. Any of the existing algorithms can be taken as a base and modifications can be done on that. Or, you can try to come up with a new algorithm.

10-11

The algorithms should be tested and tuned based on the test results. ie, new criteria and new controls can be incorporated in to the algorithms based on the current and expected behaviour of the system.

4

Integrating the system

After finalizing on the algorithms, the system is integrated so that it is possible to test using a GUI or a command line interface.

12-13

Care should be taken to make the user interface a simple and easily understandable one.

5

Integration testing and re-tuning

The tool should be tested with documents of different size and content.

14-15

Some re-tuning may be performed at this stage also.

7

Bench-marking

The tool should be compared to some other similar tool that is available on the Internet. A summary of different sizes for a particular document should be made from both the tools and then compared.

16-17

Care should be taken to select tools which use statistical techniques only, so that the comparison is meaningful.

8

Final Review

The tool is ready for the final review

18-19

During the final review of the project, several new documents can be used to evaluate the system.

 

 

Guidelines and References

 

http://www.ics.mq.edu.au/~swan/summarization/

 

http://www.ics.mq.edu.au/~swan/readingroom/summarisation/index.htm

 

  • Summarization resources website maintained by Stephan Wan.
     

http://www1.cs.columbia.edu/~hjing/sumDemo/

 

  • Summarization projects at Columbia University.
     

http://complingone.georgetown.edu/~linguist/summarizer.html 
 

  • Online text summarization tool.
     

http://www.copernic.com/en/products/summarizer/index.html

 

  • Commercially available auto-summarization product.
     

 



Tags :
0
Your rating: None