INTRODUCTION

Published in

Social Media Information Extractor

8 min readJan 7, 2022

INTRODUCTION

The rapid growth of IT over the past two decades has led to an increase in the amount of information available on the World Wide Web. A new style of exchanging and sharing information is a social media platform. Social media refers to the ways in which people communicate, build, share information and ideas in visible communities and networks (such as Twitter and Facebook).

Social media, in many cases, provides more up-to-date information than conventional sources such as online news. In order to use this much information, it is necessary to extract the formal information from this random information. Information Extraction (IE) is a field of research that enables the use of such a large amount of information in an organized way.

Indigenous language analysis is used to improve the accuracy of visualizing systematic dotted information on a social network. The main idea of monitoring is to analyze sound information from texts written by inexperienced communication users. Analyzes natural language text to extract information about different types of businesses, relationships or events.

CHALLENGES IN NLP

There are various challenges faced in extracting useful information from Social Media platforms. Some of them are listed below.

1. Informal language: Social Network users Poste texts in an informal language which are noisy include lack punctuation, misspellings, use non-standard abbreviations, capitalization, and do not contain grammatically correct sentences. Part-Of-Speech tags make the Information Extraction from social media more challenging.

2. Short contexts: Social Networks pose a minimum length like Twitter. Thus, the use more abbreviations to precise more information in their posts. It is difficult to disambiguate mentioned entities due to the shortness of the posts and to resolve co-references among the feeds.

3. Noisy sparse contents: The users’ post on social network does not always contain useful information. To purify the input posts stream, Filtering is required as a pre-processing step.

4. Information about entities: People often use a social media platform to display information about their daily routine, events or events and thus organizations are not contained in the Information Base. Disclosure Methods link businesses involved in the extracted information with the Knowledge Base. There is a need for a new Release Suit for social media posts.

5. Uncertain contents: Not all information is trustworthy on the social network. Information contained in the users’ contributions is in conflict with other sources and sometimes untrustworthy. The uncertainty involved in the extracted relations/facts is difficult to handle.

METHODS TO OVERCOME THESE CHALLENGES

To overcome the challenges faced in Information Extraction from Social Media, there are methods designed which targets issues mentioned above individually. To discuss the proposed framework, we first describe some important key components and aspect of the solution.

Noisy Text Filtering: Huge number of data is generated on social media each day. On a typical, the quantity of tweets exceeds 140 million tweets per day sent by over 200 million users around the world. These numbers are growing exponentially. In order to extract useful information, we need to filter non-informative posts. Filtering could be done on supported domain or language or other criteria to make sure to keep only relevant posts that contain information about the domain need to be processed.
Named Entity Extraction: With the shortage of formal writing style, we’d like new approaches for NEE that don’t rely heavily on syntactic features like capitalization and Part-Of-Speech (POS). Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts, especially user-generated social media content. Semantic augmentation is a potential way to alleviate this problem. Given that rich semantic information is implicitly preserved in pre-trained word embeddings, they are potential ideal resources for semantic augmentation.
Named Entity Disambiguation: It is one of the foremost interesting pieces of this puzzle of information extraction. Named Entity Disambiguation is the undertaking of planning expressions of interest, similar to names of people, areas, and enterprises, from an info text archive to relating exceptional substances during an objective information base. The target Knowledge Base depends on the appliance but vast text data is available on Wikipedia. Usually Named Entity Disambiguation doesn’t employ Wikipedia directly, but they exploit databases that contain structured versions of it, like DBpedia or Wikidata.
Feedback Extraction: The feedback loop takes place between the FE (fact extraction) and thus the NED (Named Entity Disambiguation) modules. This feedback helps to resolve errors that happened earlier within the disambiguation step.

Traditional IE framework versus our proposed IE framework.

NLP MODELS:

NLP techniques maps human language to machine language, it models the way user requests information to how computer or software understands it. However, simply searching for keywords is not an appropriate method in Social Network communication.

NLP approaches that are essential for Social media monitoring are :

Automatic Summarization: Automatic Summarization is the procedure of decreasing a textual content record with the assistance of pc software so as to create a precise that keeps the maximum tremendous factors of the unique record. The most important perception of summarization is to discover a consultant subset of the data, which includes the records of the whole set. Generally, there are two methods to computerized summarization: Extraction and Abstraction. Extraction refers to choosing a subset of present words, phrases, or sentences with inside the unique textual content to shape the precise. In contrast, abstraction builds an inner semantic illustration after which use herbal language era strategies to create a precise this is towards what a human may generate. Automatic Summarization gadget takes 3 fundamental steps namely, Analysis, Transformation, and Realization.

Transformation is an ordered textual content is generated with the aid of using manipulating the inner illustration submit the evaluation in Auto Summarization.

Chunking: Chunking is a simple method used for business acquisition. Chunking opts for a lower set of tokens instead of making tokens to leave white spaces. The components built into the content of the supply text are now no longer valid due to the issuance of tokens. It is very difficult to define what should be released slowly. It actually separates the tokens. Chink can be described as a series of non-trivial tokens. Slightly removing a series of tokens is called Chinking. All bites are eliminated if the corresponding series of tokens includes a complete bite. However, the tokens are terminated, leaving fragments where there was a lot of good before; if it appears in the middle of the bite area. A small subset of the bite is left when the series is on the outer edge of the bite.

Parts-of-Speech Tagging: Marking parts of speech is part of a software program that reads text content in a few languages and assigns parts of speech to every sentence that includes a noun, action, adjective to name a few. Typically, computer packages use well-defined features that include tags such as ‘noun-plural’. Dictionaries have categories or categories of selected sentences which means that the phrase may be plural. For example, ‘Run’ with each noun and verb. Markers use ‘Possible Information’ to resolve this ambiguity.

Word Sense Disambiguation: This is an open NLP and ontology issue that identifies the proper feel of the phrase in a sentence in which a couple of meanings of the phrase exist. It’s smooth for a human to recognize the importance of a phrase primarily based totally on the idea of its heritage information of the issue. However, identity the thing of the phrase is tough for a system to recognize. This method gives a mechanism to decrease the ambiguities of phrases with inside the textual content. For example, Word Net is a loose lexical database in English that consists of a massive series of phrases and senses.

Fact/Relation Extraction: Once named entities have been identified in a text, we can then extract the relations or facts that exist between specified types of named entity. The objective of the fact extraction is to detect and distinguish the semantic relations between entities in text or relations and fill it in a predefined template using the entities.

Sentiment Analysis: Sentiment Analysis is an NLP method that identifies, renders, and calculates the consumer’s point of view in terms of the benefits of using the consumer in contextual text. A series of text content should reflect a range of emotions that may be of high quality, negative or neutral. Emotional analysis is often used to analyze survey situations, online views, and social media tracking. Returns emotions experienced with numerical values from 1.0 to 1.0 where 1.0 is the highest quality and -1.0 worst. The sensible software of this type will be on the central e-commerce website. Popular sales or ‘Highly Rated’ may generate hundreds of ideas and this may make it difficult for you to search for relevant ideas that may be helpful in making a decision. Marketers use it to test emotions and to see if they have any control over what is going on and to forget about the fraudulent opinions provided by reviewers.

OPEN SOURCE NLP LIBRARIES:

NLP libraries are the algorithmic edifice of NLP in real-world applications. It provides a free API to setup or provision servers and infrastructure.

Apache OpenNLP: It is an open source machine learning toolkit that provides natural language text. It provides services like tokenizers, summarization, searching, part-of-speech tagging, named entity extraction, translation, information grouping, natural language generation, feedback analysis and more. It provides a command line interface with some predefined models where models are trained and evaluated.

Natural Language Toolkit (NLTK): It is a leading Python library that provides modules for processing text, classifying, tokenizing, stemming, sematic reasoning, parsing, and more. It provides user friendly interfaces over 50 corpa and lexical resources such as WordNet.

Standford NLP: It is a suite of NLP tools that provide part-of-speech tagging, the named entity recognizer, coreference resolution system, sentiment analysis, and more. It provides statistical NLP, deep learning NLP, and rule based NLP tools which are broadly used in industry, academia and government.

MALLET: It is a Java package that provides Latent Dirichlet Allocation, document classification, clustering, topic modeling, information extraction, and more.

CONCLUSION:

From this article, we studied that social media became one of the major parts of human beings for sharing their thoughts and exchanging data. Information extraction from social media is an emerging field nowadays. We also discussed how we can use this social media data and for utilization how to extract this data using different models in NLP. But while extracting data using NLP we faced many challenges like references, unwanted data etc, which makes it difficult to extract the exact data. To overcome these problems we developed the framework for extracting data using NLP. The proposed framework works on noisy text filtering which removes unwanted data or non informative data, because only required data is processed. Some other features of this framework are named entity extraction, named entity disambiguation, feedback extraction which is discussed above.

References:

[1] DOI:10.3115/v1/W14–6202 Conference: Proceedings of the Third Workshop on Semantic Web and Information Extraction (SWAIE 2014)

[2] Atefeh Farzindar; Diana Inkpen; Graeme Hirst, Natural Language Processing for Social Media: Second Edition , Morgan & Claypool, 2017.

[3] International Journal of Computational Intelligence Research ISSN 0973–1873 Volume 13, Number 4 (2017), pp. 621–630

Written by Sidra Shaikh