Text Analytics through Machine Learning

A text analysis application which processes unstructured textual data for you

The Problem

Out of this percentage, majority of the data is in textual form. Therefore, the amount of unstructured textual data on the internet is overwhelming and it is important to process and structure it so that it can be made useful.

Textual data mainly includes blogs, articles, interviews, social media post descriptions and even research papers published on the internet. Even this blog is a part of the unstructured textual data. 

The current environment for every business and individuals is highly competitive. It is essential to analyze the resources available on the internet to get actionable insights and strategies.

Our aim was to curate a system that is able to process larger chunks of data into metadata that can stored into databases and mapped as per the need.

Our Approach

Brainstorming

Brainstorming mainly centered around deciding the best possible functionality which would aid in achieving our aim. After juggling around multiple text analytic techniques, we landed upon text classification, topic modeling, article generation and domain interaction. The second phase of brainstorming was shortlisting ML models to achieve these feats. the game plan was to test them out and compare their results with each other.

Innovation

A main challenge was to develop an interface which could be easily used. Through an innovative approach, we displayed the results in such a way that the final interface would be dynamic enough without being overwhelming. Text Analytics were represented through interactive graphs to scale the user experience. Furthermore, we fine-tuned and compared various state-of-the-art transformer models to classify and cluster the text and used GPT-3 based text generation to mimic style from the given unstructured text.

Iteration

Finally, we tested the shortlisted models before and after configuring them together in an application to ensure a seamless experience. Multiple iterations of the final algorithm helped us pinpoint discrepancies in the application and improve our quality.

Our Solution

A dynamic application which analyzes unstructured textual data. It classifies and generates data on the basis of specific key-words related to the domain.

Data is structured through text classification and topic modeling and then it is represented through clusters and graphs. Information from specified topics  is extracted from the data and represented through clusters on the basis of the probability of their occurrence together in the unstructured data.

The second aspect of the application is text generation, which produces context-driven text which is stylistically similar to the the text in the unstructured data. The ML model writes articles for you, using the same stylistic tone and the context used in the text documents that were a part of the unstructured data.

Domain interaction is another aspect of this application which enables you to put in any two keywords and in return, you can get an analysis on how those two topics have been discussed in relation with each other in the unstructured data.

Given below is a visual representation of the text classification tool. It will walk you through how the text is classified on the basis of keywords and then that classification is represented through interactive graphs.

A Zero-shot Text Classifier model that classifies textual data based on specific keywords.
Text Classification results based on specific keywords. The percentage denotes how much the text talks about the specified keyword.
A bar-graph depicting the extent of keywords mentioned in the unstructured data through the years.
Topic Clustering of the unstructured data in terms of their discussion in relation to each other in the textual data.

Out of this percentage, majority of the data is in textual form. Therefore, the amount of unstructured textual data on the internet is overwhelming and it is important to process and structure it so that it can be made useful.

Textual data mainly includes blogs, articles, interviews, social media post descriptions and even research papers published on the internet. Even this blog is a part of the unstructured textual data. 

The current environment for every business and individuals is highly competitive. It is essential to analyze the resources available on the internet to get actionable insights and strategies.

Our aim was to curate a system that is able to process larger chunks of data into metadata that can stored into databases and mapped as per the need.

We wouldn't have been able to tackle this problem if it wasn't for all the amazing tech out there!