TECS-14328 Camera Ready.pdf

Page 1 of 8

Transactions on Engineering and Computing Sciences - Vol. 11, No. 2

Publication Date: April 25, 2023

DOI:10.14738/tecs.112.14328.

Sargsyan, S., & Hovakimyan, A. (2023). Creating a Sentiment Analyzer for Text Messages. Transactions on Engineering and

Computing Sciences, 11(2). 53-60.

Services for Science and Education – United Kingdom

Creating a Sentiment Analyzer for Text Messages

Siranush Sargsyan

Yerevan State University, Yerevan, Armenia

Anna Hovakimyan

Yerevan State University, Yerevan, Armenia

ABSTRACT

The work is devoted to the use of Logistic Regression and Neural Network methods

to develop a methodology for creating and implementing an analyzer for

determining the sentiment of a text in Armenian. Based on a selection of data from

social networks, this analyzer determines the tone of the message entered by the

user. Note that such work with Armenian texts is carried out for the first time.

Sentiment analysis of the text helps to form an opinion about the content of the

message without reading the entire text, freeing the user from unnecessary work

and time. Sentiment analyzer based on various machine-learning methods for text

classification has been created. The paper presents the results of a comparative

analysis of the models underlying this analyzer. The created analyzer is used in

processing the results of a social survey of students both to determine the students’

opinion about the relevance of teaching a given subject and to assess the quality of

teaching a subject by a teacher. It is also supposed to use the analyzer to determine

the psychological state of the student before the exam in order to adapt the exam to

student.

Keywords: sentiment analysis, machine learning, logistic regression, neural network,

classifier, social networks.

INTRODUCTION

Recently, thanks to the spread and ubiquitous use of the Internet, the task of analyzing the

sentiment of a text is relevant and has gained wide interest. It is the task of Sentiment Analysis

[1]. The purpose of Sentiment Analysis is to identify sentiments expressed in textual content.

The Sentiment Analysis technique can be used to determine, for example, people's opinion of

the quality of a particular product, to analyze the results of a public opinion poll, to determine

the sentiment of a film or article review, to determine the quality of a subject's teaching, etc.

Sentiment Analysis allows researchers to solve a wide range of problems, such as forecasting

the stock market and election results, determining reactions to certain events or news, and so

on. Determining the tone of text messages is able to identify and analyze the emotional

assessment of the author in relation to the objects referred to in the text. This makes it possible

to learn about the author's attitude to any phenomenon, product, service [1, 2].

Sentiment Analysis can be seen as an alternative to traditional surveys. These methods allow

not only to show the current attitude to the object of study, but also to predict changes in

attitude to the object. For example, in the USA there is a project called Pulse of the Nation, whose

Page 2 of 8

Transactions on Engineering and Computing Sciences (TECS) Vol 11, Issue 2, April - 2023

Services for Science and Education – United Kingdom

goal is to determine the mood of the citizens of the country during the day by researching and

analyzing their records in the popular social network Twitter [2].

Since the beginning of the 2010s current trends in sentiment analysis are its use for social

media and multilingual tone analysis. Until now, such works have been occupied by research

outside of English-language texts. The reason for this is most often the lack of available thesauri

in national languages for tonal analysis of texts.

Until 2014, there were only 12 publicly available non-English lexicons for tone analysis. In

2014, an attempt was made to create dictionaries for other languages. The degree of

convergence of the result with the traditional approach was below 50%, and the proposed

dictionaries are practically not used.

In 2016-2018 many works have appeared in which texts in the main European languages are

analyzed. But in general, the authors do not develop dictionaries for each of the languages, but

use machine translation of negative and positive lexemes from available English thesauri. This,

of course, reduces the quality of the analysis [3].

Despite the prospects of this direction, it is not yet so actively used in text information

processing systems. The reasons are the difficulty of identifying emotional vocabulary in texts,

the imperfection of existing text analyzers, and dependence on the subject area. Therefore, the

improvement and development of new methods for sentiment analysis based on machine

learning is an urgent task [4].

The purpose of this work was to create a sentiment analyzer for Armenian texts from social

networks in accordance with their emotional characteristics. Two text classification methods

were applied: Logistic Regression and Neural Networks to create a hybrid analyzer model.

Various experiments were carried out to assess the quality of this analyzer.

We, the authors of the presented article, have been engaged in the educational process for many

years. During the exams, students are in different psychological situations. Before the exam,

with the help of a sentiment analyzer, classifying them into different categories using special

questionnaires, you can balance their condition and conduct the exam in an adaptive way. Work

is underway to create special extended questionnaires to classify students according to their

emotional state.

STATEMENT OF THE PROBLEM

The task of analyzing the sentiment of texts has many effective methods for solving, each of

which has its own characteristics. To solve these problems using information technology,

Natural Language Processing (NLP) methods are used [1, 4].

For a long time, linear models such as support vector machines, logistic regression, etc. have

been used in the mainstream NLP methods. Recently, other machine learning approaches have

taken over. In particular, neural network models are successfully applied in the field of natural

language processing [4,6].

Page 3 of 8

Sargsyan, S., & Hovakimyan, A. (2023). Creating a Sentiment Analyzer for Text Messages. Transactions on Engineering and Computing Sciences, 11(2).

53-60.

URL: http://dx.doi.org/10.14738/tecs.112.14328

This paper discusses the tools and methodologies that underlie logistic regression and

recurrent neural networks [5,6,8]. It is assumed that the models under consideration are able

to learn how to classify texts. Many of them learn when they are presented with examples with

correct answers.

The classic method for determining the tone of a text is the dictionary method. It is based on

dictionaries of emotionally colored words. Each term from the dictionary is assigned its own

weight, either manually or with the help of third-party tools.

There are different methods for determining tonality using dictionaries. For example, each term

in the dictionary is evaluated as a positive or negative sentiment. The positive coloring of the

text is formed based on the weight of the term included in the text with the maximum positive

coloring. Negative coloring is also determined. In the absence of a term in the dictionary, it is

assigned a minimum positive weight [10,11].

The main drawback when using dictionary methods is the procedure for compiling dictionaries

of terms and specifying their weights, since almost always the value of the weight of a term

depends on the chosen subject area. These methods do not allow creating a universal dictionary

of terms, since their weight in different subject areas can differ significantly or be opposite. The

compilation of language corpora has its own characteristics and difficulties. This is typical for

those languages that are relatively rarely used on the Internet (for example, the Armenian

language).

For our implementation of the models, the compilation of the corpus of the Armenian language

was especially time-consuming. The source of information for the Armenian texts was the social

networks Twitter and Facebook. We have developed and implemented a scheme for compiling

dictionaries of terms and indicating their weights based on these texts.

At the next stage, features are selected, for which it is determined whether they are present in

the text or not. For text classification, words, pairs of words, whole phrases, etc., occurring in it,

are considered as features. In the constructed analyzer, words expressing emotions are used as

features.

DESCRIPTION OF THE DOCUMENT CLASSIFICATION PROBLEM

The classification of texts is one of the main tasks of the search engine, the purpose of which is

to include a text in a particular class based on its content. This classification in this case is

performed by the sentiment analyzer [4, 5].

Assume that we have finite sets of classes C = {c1 ... c | C |}, documents D = {d1 ... d | D |} - and the

desired function Ф, which for arbitrary pair <c, d>, where c ∈ C, d ∈ D, determines whether

document d is in class C or not. Ф: D × C → {0,1}.

The task of classification is to find the function Ф', approximating the function Ф. This function

Ф' is the classifier function.

Page 4 of 8

Transactions on Engineering and Computing Sciences (TECS) Vol 11, Issue 2, April - 2023

Services for Science and Education – United Kingdom

At the initial stage of machine learning, a group of documents is allocated = {d1 ... d |Q|} ⊆ D.

The Ф function is known for an arbitrary pair <di, ci > ∈ Q × C.

The subset Q of documents is divided into two disjoint sets: training and testing sets. With the

help of the training set Tr = {d1 ... d|Tr|} the classifier Ф' is determined, and Te = {d|Tr|+1 ... d|Q|}

is a set of documents on which the classifier Ф' is tested.

Each text document is given to the classifier as an input, and the resulting result Ф'(di, ci) is

compared with the value of the function Ф(di , ci). The more matches, the more efficient the

classifier.

• There are two types of classifications, depending on the value of functions Ф':

• Precise classification – Ф': D × C → {0,1}

• Probabilistic classification - Ф': D × C → [0,1]

Thus, in the case of precise classification, an arbitrary pair (d, c) corresponds to the value 0 or

1, that is, the document belongs to this class or not. In the case of probabilistic classification, an

arbitrary pair (d, c) corresponds to a number from the interval [0, 1], which indicates the

probability that document d belongs to class C.

The main stages of document classification are indexing of documents, selection and training of

a classifier, evaluation of the results of the classifier.

During indexing, documents are converted to the same format. Since the volume of documents

is quite large, it is necessary to remove words from them that do not contain emotional coloring

[10,11]. These can be supporting verbs, links, and more. At the next stage, the documents are

converted into vector format, that is, the document is represented as a vector of weights. Then

the vectors are normalized.

In the presented work, the effectiveness of the implemented classifiers is evaluated by the most

common F1 metric, which includes two main concepts: precision and recall. These values are

easily obtained using the confusion matrix [12].

APPLICATION OF LOGISTIC REGRESSION

The analyzer is based on the method of logistic regression. The choice of the method is due to

high accuracy rates in the analysis of short messages in English and Russian, which is relevant

for the task at hand. As is known, logistic regression is used to predict the probability of an

event occurring based on the values of a set of features [4,5].

In our case, the event is the presented text, and the features are the words in the text. A

dependent variable y is introduced, which takes one of two values (0-event did not occur, 1-

event occurred), and a set of independent variables (called features x1, ... xn ), based on the

values of which it is required to calculate the probability of accepting value of y. The model

assumes that the probability of an event y=1 is

P{y = 1 | x} = f(z),

Page 5 of 8

Sargsyan, S., & Hovakimyan, A. (2023). Creating a Sentiment Analyzer for Text Messages. Transactions on Engineering and Computing Sciences, 11(2).

53-60.

URL: http://dx.doi.org/10.14738/tecs.112.14328

where z = w

Tx = w0 + w1x1 + w2x2 + ⋯ + wnxn, x and w are vectors of values of

independent variables and regression parameters (weights), respectively, and f(z) is the

logistic function

f(z) =

1+e−z

Since y takes only the values 0 and 1, the probability of taking the value 0 is

P{y = 0 | x} = 1 − f(z) = 1 − f(wTx)

To select the parameters w1, ... wn , a training set is defined, consisting of sets of values of

independent variables and the corresponding values of the dependent variable y. So, this is a

set of pairs (x

(1)

, y

(1)

), ... , (x

(m)

, y

(m)

), where x

(i) ∈ R

is the vector of values of independent

variables, y

(i) ∈ {1,0} is the value of y corresponding to it.

Usually, the maximum likelihood method is used, according to which the parameters w1, ... wn

are selected to maximize the value of the likelihood function on the training sample. In the

model, the weight coefficients w is chosen so that the probability y for the input data x from the

training sample would be the largest.

arg max log ( | )

j j

w = P y x

For the entire training set, the weight vector is chosen as follows:

= j

j j

w arg max log P( y | x )

The objective function L that we are trying to maximize is the following [8]

=

j j

L(w) log P(y | x )

Finding a vector w that maximizes this function is an optimization problem that has different

solution methods, such as the gradient boosting method [7]. Taking into account the weights of

features in the text, the maximum probability of the text belonging to a particular class is

determined.

To implement the model, the Logistic Regression class of the sklearn library was used. First, the

text data and their corresponding tones are loaded. The input data was digitized in two ways:

Bag of words and word2vec [10, 11]. Based on the input text data, a dictionary of features was

built, and with the help of the CountVectorizer class, a vector representation of texts was built.

Then, with the help of the analyzer, which receives a vector representation of the training data

and the corresponding sentiment (positive, negative), the model is trained. After training the