Monday, August 5, 2019
Optical Character Recognition (OCR)
Optical Character Recognition (OCR) INTRODUCTION 1.1. Optical Character Recognition: Optical Character Recognition (OCR) is the mechanical or electronic interpretation, reading of images of handwritten, typewritten or printed text (usually captured by a scanner or tablet) into machine-editable text. OCR is a playing field of research in pattern identification, artificial intelligence and machine vision. An OCR system enables you to take a book or a magazine article, feed it directly into an electronic computer file, and then edit the file using a word processor. All OCR systems include an optical scanner for reading text, and suave software for analyzing images. Most OCR systems use a mishmash of hardware (specialized circuit boards) and software to recognize characters, although some economical systems do it entirely through software. Advanced roman OCR systems can read text in large variety of fonts, but they still have difficulty with handwritten text. 1.2. History Of Optical Character Recognition: To comprehend the phenomena described in the above section, we have to look at the history of OCR [3, 4, 6], its improvement, recognition methods, computer technologies, and the differences between humans and machines [1, 2, 5, 7, 8]. It is always intriguing to be able to find ways of enabling a computer to ape human functions, like the ability to read, to write, to see things, and so on. OCR research and development can be traced back to the early 1950s, when scientists tried to confine the images of characters and texts, first by mechanical and optical means of rotating disks and photomultiplier, flying spot scanner with a cathode ray tube lens, followed by photocells and arrays of them. At first, the scanning operation was dawdling and one line of characters could be digitized at a time by moving the scanner or the paper medium. Subsequently, the contraptions of drum and flatbed scanners arrived, which extended scanning to the full page. Then, advances in digital-integrated circui ts brought photo arrays with higher solidity, faster transports for documents and higher speed in scanning and digital conversions. These vital improvements greatly accelerated the speed of character recognition and abridged the cost, and opened up the possibilities of processing a great range of forms and documents. Throughout the 1960s and 1970s, new OCR applications sprang up in retail businesses, banks, hospitals, post offices; insurance, railroad, and aircraft companies; newspaper publishers, and many other industries [3, 4].In parallel with these advances in hardware development, rigorous research on character recognition was taking place in the research laboratories of both academic and industrial sectors [6, 7]. Although both recognition techniques and computers were not that powerful in the in the early hours (1960s), OCR machines tended to make masses of errors when the print quality was poor, caused either by wide disparity in type fonts and roughness of the surface of the paper or by the cotton ribbons of the typewriters [5]. To make OCR work proficiently and economically, there was a big ram from OCR manufacturers and suppliers toward the standardization of print fonts, paper, and ink qualities for OCR applications. New fonts such as OCRA and OCRB were designed in the 1970s by the American National Standards Institute (ANSI) and the European Computer Manufacturers Association (ECMA), respectively. These special fonts were quickly approved by the International Standards Organization (ISO) to facilitate the recognition process [3, 4, 6, 7]. As an upshot, very high identification rates became achievable at high speed and at reasonable costs. Such accomplishments also brought better printing traits of data and paper for practical applications. Actually, they completely revolutionize the data input industry [6] and eliminated the jobs of thousands of keypunch operators who were doing the really mundane work of keying data into the computer. 1.3. Common Steps Of OCR Processing: The method of converting documents into electronic forms, which is usually referred to as digitization is undertaken in different steps. The process of scanning a document and representing the scanned image for further processing is called the pre-processing or imaging stage. The process of manipulating the scanned image of a document to produce a searchable text is called the OCR processing stage. 1.3.1. The Imaging Stage: The imaging procedure involves scanning the document and storing it as an image. The most popular image format used for this purpose is called Tagged-Image File Format (TIFF). The resolution (number of dots per inch dpi) determines the accurateness rate of the OCR process. 1.3.2. The OCR Process: The major steps of the OCR processing stage are shown below. 1.3.3. Distinguishing Between Text And Images Segmentation: In this step, the process of recognizing the text and image blocks of the scanned image is undertaken. The boundaries of each image are analyzed in order to identify the text. 1.3.4. Character Recognition Feature Extraction: This step involves recognizing a character using a process known as feature extraction. OCR tools stockpiles rules about the characters of a given script using a method known as the learning course. A character is then identified by analyzing its shape and comparing its features adjacent to a set of rules stored on the OCR engine that distinguishes each character. 1.3.5. Recognition Of Character: Following the character identification process, character detection process is performed by comparing the string of characters against an existing dictionary of words. Additional processes such as spell-checking are performed under this step. 1.3.6. Output Formatting: The finishing step involves storing the output in one of the industry standard formats such as RTF, PDF, WORD and plain UNICODE text. 1.4. Pattern Recognition: Pattern recognition (also known as classification or pattern classification) is a field within the vicinity of artificial intelligence and can be defined as the act of taking in raw data and taking an action based on the category of the data. It uses methods from statistics, machine learning and other vicinities. Typical applications of pattern recognition are: Automatic speech identification. Classification of text into numerous categories (e.g. spam/non-spam email messages). The automatic identification of handwritten postal codes on postal envelopes. The automatic identification of images of human faces etc. The preceding three examples form the subtopicimage analysis of pattern recognition that pact with digital images as input to pattern recognition systems. Some trendy techniques for pattern recognition include: Neural Networks(NN) Hidden Markov Models(HMM) Bayesian networks (BN) The application domains of pattern identification include: Computer Vision Machine Vision Medical Image Analysis Optical Character Recognition Credit Scoring. 1.5. Applications Of The Pattern Recognition: Pattern recognition has many useful applications. Some of them are outlined below. Utilizes as a telecommunication aid for deaf, in airline reservation, in postal department for postal address reading (both handwritten and printed postal codes/addresses) and for medical diagnosis. For use in customer billing as in telephone exchange billing system, order data logging, and automatic finger print identification, as an automatic inspection system. In automated cartography, metallurgical industries, computer assisted forensic linguist system, electronic mail, information units and libraries and for facsimile. For direct processing of documents as a multipurpose document reader for large scale data processing, as a micro-film reader data input system, for high speed data entry, for changing text/graphics into a computer readable form, as electronic page reader to handle large volume of mail. 1.6. Scope Of This Work: The Project is designed to classify and identify a scanned image containing Arabic characters using two pace approaches. In the first pace the Arabic text image is preprocessed. And in the second pace it features are extracted. During the itinerary of work it is assumed that there is no noise in the image and the image is flawlessly scanned with no deviation from its original angle no skewing. 1.7. Objectives And Applications Of This Work: Arabic Optical Character Recognition can open a novel way of realizing the dream of the natural mode of communication amid man and machine in this part of the world. It will inflate and multiply already available knowledge to new horizons. Centurys aged rare script in Arabic, Urdu and Persian will become available to common man. The ultimate goal of character recognition is to conjure up the human reading capabilities. Character recognition systems can contribute immensely to the advancement of the automation process and can improve the interaction among man and machine in many applications, including office automation, check verification and a large variety of banking, business and data entry applications, library archives, documents identifications, e-books producing, invoice and shipping receipt processing, subscription collections, questionnaires processing, exam papers processing and many other applications[9], beside online address and signboard reading. 1.8. Thesis Organization: The remaining part of this thesis is divided into four chapters. Chapter 2 describes review of literature. Chapter 3 describes Arabic script, its peculiarities and problems. Chapter 4 is regarding the development of Arabic Character identification and chapter 5 is about conclusions and future directions respectively. Chapter 2 REVIEW OF LITERATURE 2.1. Optical Character Recognition: Since the beginning of writing as a form of communication, paper prevailed as the medium for writing. Electronic media is replacing paper with time. Because it preserves space and is fast to access, electronic media are constantly gaining esteem. The convenience of paper, its pervasive used for communication and archiving, and the quantity of information already on paper, press for quick and accurate methods to automatically read that information and adapt it into electronic form [Albadr95]. The latent application areas of automatic reading machines are numerous. One of the earliest, and most thriving, applications is sorting checks in banks, as the volume of checks that circulates daily has proven to be too huge for manual entry. Other applications are detailed in the next section [Govindan90, Mantas86]. The machine imitation of human reading (i.e. optical character recognition) has been the subject of widespread research for more than five decades. Character identification is pattern recognition application with a crucial aim of simulating the human reading capabilities of both machine printed and handwritten cursive text. The currently available systems may interpret faster than humans, but cannot reliably read such a wide diversity of text nor consider context. One can say that a great quantity of further effort is required to, at least, narrow the gap between humans reading and machines reading capabilities. The practical significance of OCR applications, as well as the interesting nature of the OCR problem, has lead to great research interest and assessable advances in this field. Now, commercial OCR systems for Latin characters are commonly accessible on personal computers achieving recognition rates above 99% [McClelland91, Welch93]. Further, systems on the market can now inte rpret a variety of writing styles (e.g., hand-written, printed Omni-font), and character sets including Chinese, Japanese, Korean, Cyrillic, and Arabic. Since the 50s, researchers have carried out far-reaching work and published many papers on character recognition. Nearly all of the published work on OCR has been on Latin, Japanese or Chinese characters. This has started since the median 40s for Latin, the middle of the 1960s for Chinese and Japanese. The following are positive surveys and reviews on Latin character recognition. Reference may be made to [Mori92] for historical appraisal of OCR research and development. The survey of [Govindan90] includes surveys of other languages; [Mantas86] has an overview of character identification methodologies, [Impedovo91] on commercial OCR systems, [Tian91] on machine-printed OCR, [Tappert90, Wakahara92] for on-line handwriting identification. [Suen80] has a survey on automatic identification of hand printed characters (viz. numerals, alphanumeric, FORTRAN, and Katakana), while [Nouboud90] produced a review of the recognition of hand-printed (non-cursive) characters and conducted beta tests on a business system. [Bozinovic89, Simon92] surveyed off-line cursive word recognition, Jain et al [Jain2000] reviewed statistical pattern recognition methods, and [Plamondon2000] comprehensive survey of online and offline handwriting identification. Two bibliographies of the fields of OCR and document scrutiny appeared in [Jenkins93, Kasturi92]. [Stallings76, Mori84], produced surveys on identification of Chinese machine- and hand-printed characters, respectively, and Liu et al [Liu2004] addressed the state of the art of online identification of Chinese characters. 2.2. General Review Of Arabic Character Recognition: Although almost one billion people world-wide, in several diverse languages, use Arabic characters for writing (Arabic, Persian, and Urdu are the most noted examples), Arabic character identification has not been researched as thoroughly as Latin, Japanese, or Chinese. The first published work on Arabic character acknowledgment may be traced back to 1975 by Nazif [Nazif75] in his masters thesis. In his thesis a system for the identification of printed Arabic characters was developed based on extracting strokes that he called radicals (20 radicals are used) and their positions. He used correlation between the templates of the deep-seated and the character image. A segmentation phase was included to segment the cursive text. Years later Badi and Shimura [Badi78, Badi80] and Noah [Nouh80] toiled on printed Arabic characters and Amin [Amin80] on hand-written Arabic characters. Surveys on AOTR may be referred in [Amin85a, Amin98, Shoukry89, Jambi91, Albadr95, Nabawi2000, Ahmed94]. On-line systems are restricted to recognizing hand-written text. Some systems recognize remote characters [Ali89, Amin80, Amin85b, Amin87, ElSheikh89, ElSheikh90b, ElWakil87, ElWakil89, Saadallah85] and hand-written mathematical formulas [ElSheikh90c, Amin91b], while others recognize cursive words [Badi78, Badi80, Badi82, Amin82a, Amin82b, Shaheen90, AlEmami90]. Since the segmentation problem in Arabic is non-trivial the concluding systems deal with a much harder problem. While several off-line systems use video cameras to digitize pages of text (e.g., [Abbas86, Goraine92, Amin86, HajHassan85, HajHassan90, Nouh80, Nouh87, Nouh89, Sarfraz2003, Sarfraz2004]), the inclination now is to use scanners with resolutions ranging from 200 to 400 dots per- inch (e.g., [AbdelAzim89c, AbdelAzim90a, AlYousefi88, Amin91a, Bouhlila89, ElDabi90, ElSheikh88a, Ramsis88, Sarfraz2003a, Sarfraz2003b, Zidouri2002, Zidouri2005]). Scanners set up less noise to an image, are less pricey, and more convenient to use for character recognition, especially when coupled with automatic document feeders, automatic Binarization, and image enhancement. Among the off-line systems that identify hand-written isolated characters are [Abuhaiba90, AlYousefi90, AlTikriti85, ElDesouky92, Hyder88]. [Abbas86, AbdelAzim89b, Goneid92] identify hand-written Arabic (Hindi) numerals, and [Badi80, Badi82, Goraine92, Jambi92, Zahour91] distinguish hand-written words. The majority of off-line systems distinguish typewritten cursive words [AbdelAzim89c, AbdelAzim90a, Bouhlila89, ElDabi90, Amin86, ElKhaly90, ElSheikh88b, Goraine89, Khella92, Margner92, Nazif75, Nouh87, Ramsis88, Tolba89, Tolba90, ElRamly89c, HajHassan90, HajHassan91], while [ElShiekh88a, Mahdi89, Mahmoud94, Nouh80, Nouh89, NurulUla88, Fayek92, Sarfraz2005d, Zidouri2005] identify only typewritten isolated characters. The systems of [Abdelazim90b, AlBadr92, ElGowely90, Kurdy92, Fakir93] are intended to recognize typeset words. One of the systems [Abdelazim89a] recognizes bilingual (Arabic/Latin) typewritten words. Examples of systems for detection of other languages that use Arabic scri pt are [Parhami81, Yalabik88, Hyder88], which are designed for the identification of Persian, Ottoman (Old Turkish), and Urdu, respectively. 2.3. Applications Of Optical Character Recognition: Optical character recognition technology has many practical applications that are independent of the treated language. The following are some of these applications: Financial Business Applications: For cataloging bank checks since the number of checks per day has been far too large for manual arrangement. Commercial Data Processing: For inflowing data into commercial data processing files, for example inflowing the names and addresses of mail order customers into a database. In addition, it can be worn as a work sheet reader for payroll accounting. In Postal Department: For postal address reading, cataloging and as a reader for handwritten and printed postal codes. In Newspaper Industry: Premium typescript may be read by recognition equipment into a computer typesetting system to keep away from typing errors that would be introduced by keypunching the text on computer peripheral equipment. Use By Blind: It is used as a reading abet using photo sensor and tactile simulators, and as a sensory aid with sound output. Additionally, it can be worn for reading text sheets and reproduction of Braille originals. In Facsimile Transmission: This procedure involves transmission of pictorial data over communications channels. In practice, the pictorial data is mainly text. Instead of transmitting characters in their pictorial representation, a character identification system could be used to recognize each character then transmit its text code. Finally, it is worth to say that the major potential application for automatic character identification is as a general data entry for the automation of the work of an ordinary office typist. 2.4. Development Of New OCR Techniques: As OCR research and development advanced, demands on handwriting identification also increased because a lot of data (such as addresses written on envelopes; sums written on checks; names, addresses, identity numbers, and dollar values written on invoices and forms) were written by hand and they had to be pierced into the computer for processing. But early OCR techniques were based generally on template matching, simple line and geometric features, stroke detection, and the extraction of their derivatives. Such techniques were not classy enough for practical identification of data handwritten on forms or documents. To cope with this, the Standards Committees in the United States, Canada, Japan, and some countries in Europe designed some handprint models in the 1970s and 1980s for people to write them in boxes [7]. Hence, characters written in such specified shapes did not diverge too much in styles, and they could be recognized more easily by OCR machines, especially when the data were pierced by controlled groups of people, for example, employees of the same company were asked to write their data like the advocated models. Sometimes writers were asked to follow certain bonus instructions to enhance the quality of their samples, for example, write big, close the loops, use simple shapes, do not link characters, and so on. With such constraints, OCR detection of handprints was able to flourish for a number of years. 2.5. Recent Trends And Movements: As the years of exhaustive research and development went by, and with the birth of several new conferences and workshops such as IWFHR (International Workshop on Frontiers in Handwriting Recognition), 1 ICDAR (International Conference on Document Analysis and Recognition), 2 and others [13], identification techniques advanced rapidly. Moreover, computers became much more authoritative than before. People could write the way they normally did, and characters need not have to be written like specified models, and the subject of unimpeded handwriting recognition gained considerable momentum and grew swiftly. As of now, many new algorithms and techniques in pre-processing, feature extraction, and powerful classification methods have been urbanized [8, 9]. Chapter 3 ARABIC A CURSIVE SCRIPT 3.1. Arabic: Arabic is a semantic language used as principal language in most countries. Arabic is vocalized by 234 million people [9] and essential in the culture of many more. While spoken Arabic varies across region, written Arabic, sometimes called Modern Standard Arabic (MSA), is a uniform version used for official communication across the Arab world [9]. The characters of Arabic script and similar character are used by a much higher entitlement of the worlds population to write language such as Arabic, Farsi, Persian and Urdu. Thus the ability to automate the understanding of written Arabic would have wide spread benefits. Arabic is normally written in the calligraphic Nastaliq script, whereas Naskh is more commonly used. Usually, bare transliterations of Arabic into Roman letters exclude many phonemic elements that have no counterpart in English or other languages commonly written in the Roman alphabet. National Language Authority of Pakistan has developed numeral systems with specific notations to signify non-English sounds, but these can only be appropriately read by someone already familiar with Urdu, Persian, or Arabic for letters such as ? ? ? ? or ? and Hindi for letters. Most of Arabic characters when pooled form a degree of about 45 to the horizontal line because of which Arabic script reading is faster than roman script but on the other hand it makes it harder for the greenhorn readers and the machines to identify the word or segment one character from the rest. Unlike the English script there is no capital or small characters in Urdu, but the last character of a word can be measured as a capital character as in many cases it presents the full form of the character and the characters at early and middle positions are considered as small. Every character has an impartial shape besides different joining forms, but some of the alphabet like the characters making the word Urdu (? ? ? ?) or of the similar category are not joinable or cannot be connected. Arabic alphabet utilizes consonant letters, vowels, diacritic marks, numerals, punctuations and a few superscripts signs. The graphical representation of each alphabet has surplus one form depending on its position and context in the word. In general each letter has four forms that is beginning, middle, final and standalone as shown in table 3.1. 3.2. Arabic Letters: The Arabic alphabet contains 28 letters. Each has between two and four shapes and the choice of which shape to use depends on the situation of the letter within its word or sub word. The shape correspond to the four positions: beginning of a (sub) word, middle of a (sub) word. End of a (sub) word and in isolation. Table 3.1 shows each shape for each letter. Letters without initial shapes are purely their isolated shapes, and their medial shapes are their final shapes. Some letters have descanters or ascenders which are position that extend below the primary line on which the letters sit or above the stature of most letters. Theres no upper or lower case, but only one case. Arabic script is written from right to left, and Letters within a word are usually joined even in machine print. Letter shapes and whether or not to connect depend on the letter and its neighbors. Letters are connected at the same virtual height. The baseline is the line at the height at which letters are allied, and it is akin to the line on which some an English word sits. Letters are wholly above it except for decanters and some markings. Theres no association between separate words. So word boundaries are always represented by a breathing space. Six letters, however, can be allied only on one side. When they occur in the middle of a word, the word is divided into manifold sub-words separated by space. A ligature is a word shaped by combining two or more letters in an accepted manner. Arabic has numerous standard ligatures, which are exception to the above rules for joining letters. Most common is laam- alif, the combination of laam and alif and other include yaa-meem. 3.3. Problems Of Arabic Script: Despite a huge character set Arabic has a small set of characters which are easily discernible from one another. The remaining character fluctuates from these character using dots or symbols above or below these shapes [19]. The table 3.2 shows group of similar characters and their derived forms. As shown above table 3.2, only 21 different groups exits out of 32 character set. It will complicate the identification phase of Arabic characters. Further study of other forms ( initial, middle and final ) of these character divulges that ein( ) is analogous to hamza(?), wow (?) might be perplexing with (?) , ze (?) resembles noon () and mem(?) can be baffled with middle form of ein () and with stand alone goal-he (?). A key distinction between Latin scripts and Arabic script is the fact that many letters only differ by a dot(s) but the primary stroke is exactly the same. [19] 3.4. Others Problems In Arabic OCR: All Muslims (almost à ¼ of the people on the earth) can read Arabic because it is the language of Al-Quran, the holy book of Muslims. Even though, Arabic script identification has not received enough welfare by the researchers. Little research progress has been accomplished comparing to the one done on the Latin and Chinese. The elucidations available in the market are still far from being perfect [11, 14]. There are few raison dà ªtres led to this result. Require of financial support and platform accessible from any government (official language of countries). lack of ample support in terms of journals, books etc. and lack of interaction between researchers in this playing field; lack of broad-spectrum support utilities like Arabic text databases, dictionaries, programming tools, and supporting staff; belatedly start of Arabic text identification (first publication in 1975 compared with the 1940s in the case of Latin character recognition); The research carried out on Arabic language is typically scattered and outside from the Arab world. There are no specialized conferences or symposium demeanor so far. Algorithms developed for other language scripts are not pertinent on Arabic. 3.5. Characteristics Of Arabic Characters: The calligraphic nature of the Arabic set is eminent from other languages in several ways. For example, Arabic text is written from right to left. No upper or lower cases subsist in Arabic, but sometimes the last character of a word is considered as upper case because its always remains in its full form. Arabic has 28 fundamental characters, of which 16 have from one to three dots. Those dots discriminate between the otherwise similar characters. Additionally, three characters can have a meander like stroke. The dots are called secondaries and they are located above the character primary part as in ALEF (?), or below like BAA (?), or in the middle like JEEM (?). Written Arabic text is cursive mutually in machine-printed and hand-written text. Within a word, some characters unite to the preceding and/or following characters, and some do not connect. The connectivity of characters consequences in a word having one or more connected components. We will refer to each connected piece of a word as a sub-word. The shape of an Arabic character depends on its location in the word; a character might have up to four different shapes depending on it being isolated, connected from the right (beginning form), connected from the left (ending form), or connected from both sides (middle form). A distinguishing feature of Arabic writing is the presence of a base-line. The baseline is a level line that runs through the connected portions of text (i.e. where the characters connection segments are located). The baseline has the highest number of text pixels. (See figure 3.2.) Characters in a word may overlie vertically (even without touching). Arabic characters do not have permanent size (height and width). The character size varies according to its pose in the word, Characters in a word can have diacritics. These diacritics are written as strokes, placed either on top of, or below, the characters. Poles apart diacritic on a character may change the meaning of a word. Readers of Arabic are accustomed to reading un-diacritical text by deducing the meaning from context. Numerous characters can combine vertically to form a ligature, especially in typeset and handwritten text. Arabic words may perhaps consist of one or more sub-words. Each sub-word may have one or more characters, because some Arabic characters are not joinable to others from the left side. As an example, the word Ketab ( ) consists of two sub-words: Keta ( ) which consists of three characters and BAA( ?) which is a single character. There are merely three characters that represent vowels, ? , ? or ? . However, there are other shorter vowels represented by diacritics in the form of over scores or underscores but practice of over score and underscore in Arabic is less Dots may materialize as two separated dots, touched dots, hat or as a stroke. Another style of Arabic handwriting is the arty or decorative calligraphy which is usually full of overlapping making the identification process even more difficult by human being rather than by computers. 3.6. Summary: Arabic script includes its cursive nature of writings, right to left style of writing and change of form and shape when a character is placed at different locations of a word, loops, half closed characters and dots on above or below a character. National Language Authority defined 32 characters set but it has 21 working characters beside numeral and diacritics. Chapter 4 ARABIC CHARACTER RECOGNITION 4.1. Phases Of Arabic Character Recognition: In an offline character identification system, the user scans a particular script, runs the OCR and gets the documents saved in a file format of his choice. The alteration of the text from the scanning phase to the final document involves a number of phases that are transparent to the user. The proposed system can be implemented in the following steps: Image Acquisition; Digitization; Preprocessing; Feature extraction; Recognition. Figure 4.1 shows the componen
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.