A System for Social Text Normalization

Social media along with the Web is growing and extending. Text in social media has its own grammatical rules and way of using words. For example a social media user can use a sentence like “I like u w you smiiiile”. This kind of text is hard to process with natural language processing tools and use in analyses. Therefore a normalization task for this social media text is essential to be able to process this huge data generated by social media users all around the world.

Text normalization is the processing of text which contains OOV (out of vocabulary) words and restoring these words to their normal or canonical form to be able to use these text in NLP tools. Canonical forms of the OOV words are named IV (in vocabulary).

Also another difficulty is to detect which OOV words need to be processed and turned into IV forms because not all OOV words are ill-formed. Some of OOV words may new words that are not included in dictionaries and some of them can be special words of terms. So one part of the normalization task is detecting ill formed words. Another issue is that some OOV words may be normalized into different IV words depending on context. So we try to understand context a little bit to make proper choices.