In a multi-lingual country like India, which has many languages with their own distinctive scripts and rich literary traditions, it is particularly important to develop computer systems that allow users to interact with them in Indian languages.

Due to the peculiarities of Indian scripts and languagessolutions that work well for languages such as English would not be applicable, in their totality, for Indian languages.

In fact, sufficient research work is not reported on Indian language character recognition. Most of the pieces of existing work are concerned about Devanagari and Bangia script characters, the two most popular languages in India. Some studies are reported on the recognition of other languages like Tamil, Telugu, Oriya, Kannada, Panjabi, Gujrathi, etc.

Structural and topological features based tree classifier, and neural network classifiers are mainly used for the recognition of Indian scripts. Further, in the Indian context, many documents would contain text of more than one script fur example, English, Hindi and the local languageand hence recognition and segmentation of different scripts from a multi-lingual document is also an important problem.

At present, several organizations have started working on Indian languages optical character recognition OCR. The different components of an OCR include preprocessing segmentation, feature extraction and classification.

The image acquisition stage, which is usually executed via the use of a digital scanner is one of the first operations performed is preprocessing.

It includes binarization and noise removal. Then, the digitized image is binarized using histogram-based thresholding approach. The threshold value is chosen as the midpoint between two histogram peaks. Median filtering is used to remove the noise in the binarized image.

It is evident from the literature that there are two particularly important and at the same time complicated components of the character recognition process.

First the segmentation or separation of characters and the second one is feature extraction. In document analysis, when the word "segmentation" is used, it may be attributed to line, word or character segmentation.

Character segmentation is fundamental to character recognition approaches which rely on isolated characters. It is a critical step because incorrectly segmented characters are not likely to be correctly recognized.

The segmentation in Malayalam and Kannada OCRs uses projection profile technique and zoning algorithm along with connected component analysis to segment the characters. The binarized image is processed to lines and words using appropriate horizontal and vertical projection profiles.

For Malayalam and Kannada characters the projection profile approach alone will not give the desired output, as individual characters are comprised of combinations of left Malayalamright top, bottom modifiers with the base consonants.

Overlapping of modifiers with the base consonant, in forming a valid character, causes additional difficulty in segmentation. Hence, a two-stage segmentation approach is followed, where zone level features reference points are extracted in the first stage and then in second stage, these reference points are used for connected component analysis in segmenting the characters.

It is suggested that the key to high performance is through the ability to select and utilize the distinctive features of characters. Feature extraction can be defined as the process of extracting distinctive information from the matrices of digitized characters. There are two main categories of features: Malayalam OCR uses simple global feature of the character, i.

In Kannada OCR, different structural and topological features like presence of shirorekha, presence of holes, its number, position and size with respect to character size, number of connected components, number of zero crossings etc.

The direction code frequency is used in the second stage.Techniques for script identification generally require large areas for operation so that sufficient information is available.

Techniques for script identification generally require large areas for operation so that sufficient information is available. Such assumption is nullified in Indian context, as there is an interspersion of words of two different scripts in most documents.

Progress in Galician Visigothic script identification Ainoa Castro Correa doctoral thesis, recently read at the Universitat Autònoma de Barcelona, has isolated and made relevant the regional characteristics of the Visigothic script used in the Kingdom of Galicia between the 10th and 12th centuries.

