Group Project: Financial Statement Sentiment Analysis
Foreword
With the end of Python course(FE), the teacher also assigned us the final group assignment. It feels really bad to drag and drop a group by one person. ( However, the deadline for group assignments is December 18th. Because the course is set for beginners, Undoubtedly, it is obvious, the content of the course is not closely related to the final assignment. It is extremely difficult to require other members of my group to complete the project together. I have no choice but to go through the group assignment by myself, which is a little difficult.
NLP Exploration
As a beginner of python, I should hands on the concepts of NLP first.
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. (Wikipedia)
Resolve of the Group Project
What we should do
Python for Finance Project: Financial Statement Sentiment Analysis
Literally, this project is closely related to financial mathematics, with the purpose of linking public opinion with financial trends.Deserves to be called Python Programming for Beginners (Financial Mathematics)。
Sentiment analysis, also known as opinion mining, refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. (Wikipedia)
The original document also mentioned:
A simple solution is, every word is given a score based on its extent of
positiveness
,negativeness
orneutral
. The Sentiment Analysis is done by calculating the algorithmic score of each word, and returning with the combined score for the given set of text.
For each word, what we need is that we should calculate the individual word’s point and return the point of the entire sentence.
What have been already given
The difficult of this project is:
Please noted, you are not allow to use packages that are from outside of this course
This turned out to be the biggest obstacle for me to complete this project.
So, I will start directly:
3. What you need to do in this project
3.1 What is already given
train.txt
: contains 3877 rows with 3 columns “index”, “sentiment”, “text”, where “text” denotes the financial news sentence, “sentiment” denotes the human annotated sentiment result (3 categories “positive”, “negative”, “neutral”), and “index” denotes the ID of the news sentence.
test_x.txt
: contains 970 rows with 2 columns “index” and “text” whose explanation is the same as “train.txt”.
senti_dict.csv
: The sentiment word dictionary, where the value of each word indicates how positive of a word is. Negative score means the word tends to be negative. Note that this word sentiment dictionary is not guaranteed to be thorough and complete.
fyi
folder: resources you may be interested in.
What we need to submit
3.2 What you need to submit
You need to submit a
zipped
package containing following files:
test_y.txt
. You need to put your prediction results in a file named as “test_y.txt”. The file contains 970 rows with 2 columns, in the form ofindex,label
. “index” denotes the ID of corresponding sentence in “text_x.txt”, and “label” denotes the sentiment you predict with respect to that sentence, which should be “positive”, “negative”, or “neutral”.
.ipynb file
. You need to implement your solution in this jupyter notebook file, and add sufficient introduction to your thoughts and program (e.g. use highlight, bulletin, etc, just like the lecture slides), using themarkdown
language.Important: your .ipynb file should be able to output (write) the “test_y.txt” to hard disk. If there is only the prediction result (test_y.txt) and your .ipynb file does not include the code which can output the prediction result, you will get zero marks. I will go into the .ipynb file and run each cell of your program. So, make sure your .ipynb is well organized so that I can easily find the part which can output your prediction result.
sentiment_words_dict.txt (optional)
. If you use your own sentiment dictionary in the solution, you need to submit the dictionary used in your project as well.In all, you need to submit all the materials so that I can successfully run your code and obtain the prediction result.
Get Started with the Project
CSV Format Document Processing
In the process of processing the raw materials, I found that the contents of train.txt
and test_x.txt
roughly conform to the format of the csv document, so they are processed together:
import csv
# Define global variables
senti_dict = {}
idf_dict = {}
test_x = []
test_y = []
train = []
custom_dict = {}
merged_dict = {}
accuCount = 0.0
class handleData:
def __init__(self, senti_dict_path, test_x_path, test_y_path, train_path=None, custom_dict_path=None):
# Handle data
if not senti_dict_path:
print('Please input senti_dict_path')
quit()
else:
self.handle_senti_dict(senti_dict_path)
if not test_x_path:
print('Please input test_x_path')
quit()
else:
self.handle_test_x(test_x_path)
if not test_y_path:
print('Please input test_y_path')
quit()
if train_path:
self.handle_train(train_path)
if custom_dict_path:
self.handle_custom_dict(custom_dict_path)
merged_dict.update(custom_dict)
merged_dict.update(senti_dict)
def handle_senti_dict(self, path):
global senti_dict
# Preprocessing senti_dict type of dictionary
with open(path) as path:
csv_reader = csv.reader(path)
for line in csv_reader:
if csv_reader.line_num == 1:
continue
senti_dict[line[1]] = float(line[2])
print('senti_dict library has %s line(s).' %len(senti_dict))
def handle_test_x(self, path):
global test_x
# Processing test_x type of two-dimensional array
with open(path) as path:
csv_reader = csv.reader(path)
for line in csv_reader:
line[0] = int(line[0])
test_x.append(line)
# test_x = sorted(test_x)
print('test_x library has %s line(s).' %len(test_x))
def handle_train(self, path):
global train
# Processing train type of two-dimensional array
with open(path) as path:
csv_reader = csv.reader(path)
for line in csv_reader:
if csv_reader.line_num == 1:
continue
line[0] = int(line[0])
train.append(line)
# train = sorted(train)
print('train library has %s line(s).' %len(train))
def handle_custom_dict(self, path):
global custom_dict
# Preprocessing custom_dict type of dictionary
with open(path) as path:
csv_reader = csv.reader(path)
for line in csv_reader:
if csv_reader.line_num == 1:
continue
custom_dict[line[0]] = float(line[1])
idf_dict[line[0]] = float(line[2])
print('custom_dict library has %s line(s).' %len(custom_dict))
def listToCSV(list, path):
with open(path, 'w', encoding='UTF8', newline='') as f:
writer = csv.writer(f)
writer.writerows(list)
print('Results saved in %s' %path)
main()
When it needs to be processed formally, just call it like this:
def main():
print('Python for Finance Project: Financial Statement Sentiment Analysis')
handleData('./senti_dict.csv', './test_x.txt', './test_y.txt', './train.txt', './sentiment_words_dict.txt')
if train != []:
listToCSV(sentimentAnalysis(train, 2), './test_train.txt')
else:
listToCSV(sentimentAnalysis(test_x, 1), './test_y.txt')
Among them, './train.txt'
and './sentiment_words_dict.txt'
are two optional parameters.
sentimentAnalysis() Part
This part is the most troublesome place for me. Because this course has always emphasized the basics, there is no in-depth exploration of the algorithm part (only one kind of recursion is mentioned). So I could only cross the river by feeling the stones by myself, and I lost my mind.
**And don’t let the package be imported! **
But the interesting point is that TA gave a thesaurus that seems to have nothing:
Words | Scores | |
---|---|---|
0 | abil | 0.00012416859533089645 |
1 | actual | 0.0006961488543271923 |
2 | advertis | -0.005592582215163179 |
3 | agenc | 0.0006353814403691496 |
4 | aggreg | -0.0010330308303088183 |
5 | agreement | 7.96548813763754e-05 |
6 | allow | 0.002174006034169924 |
7 | although | -0.003593985395296731 |
So, what do the numbers in the third column represent?
(I still haven’t figured it out yet)
But after comparison, it was found that these data came from here:
https://github.com/nproellochs/SentimentDictionaries/blob/master/Dictionary8K.csv
At the same time, the third column of the document given by the source is the Idf (inverse document frequency) data.
Words | Scores | Idf |
---|---|---|
abil | 0.013451985735806887 | 2.440750228642685 |
abl | -0.004787871091608642 | 1.9253479378492828 |
absolut | 0.003360788277744489 | 2.728432301094466 |
academi | 0.007129395655638781 | 2.9089206768067597 |
accent | -0.003550084101686155 | 2.894374965804381 |
accept | -0.010454735567790118 | 1.5262960445758316 |
accomplish | 0.004308459321682365 | 2.7162740966146566 |
act | -0.022561708229224143 | 0.89426188633043 |
actor | -0.011279482911515365 | 1.0184159310395977 |
actual | -0.022918668083275685 | 1.5116972451546788 |
This inspired me: Can we extract word frequencies to calculate the rarity of words in the text?
An easy way to think of is to find the word that appears most frequently. If a word is important, it should appear multiple times in this article. Therefore, we conduct “term frequency” (Term Frequency, abbreviated as TF) statistics.
At this time, there is an idea. We first extract the words of a given short sentence, and then count the total number of words in the sentence, the realization is as follows:
# return a list
def getWords(text):
# Use RegEx to extract the part that meets the specification
text = re.sub("[^a-zA-Z]", " ", text)
# Lowercase all words and turn them into word list
words = text.lower().split()
# return words
return words
# return an integer
def countWords(text):
return len(getWords(text))
Then, when performing word comparison, perform TF-IDF operation to get the word with the highest word frequency:
# Count the frequency of the word word in the sentence
# Input word, sentence as string,idf as float
def tf_idf(word, sentence, idf):
tf = getWords(sentence).count(word) / countWords(sentence)
return tf * idf
def sentimentAnalysis(data, num):
global accuCount
# -= Extract by line =-
# For test_x, line[0] is id,line[1] is sentence
# For train,line[0] is id,line[1] is given result, line[2] is sentence
# There are judgment statements for two data sources in main()
# index is the 0-based count of the sentence
for index, line in enumerate(data):
positiveCount = 0
negativeCount = 0
sum = 0
diff = 0
sentiPoint = 0.0
max_tf_idf = 0
max_word = ''
sentiType = 'neutral'
# -= Perform judgment in units of words =-
# Count the number of positive and negative words separately
# line[num] as string
for word in getWords(line[num]):
# find word in dictionary
# merged_dict[word] idf_dict[word]
if word in merged_dict:
tf_idf_val = tf_idf(word, line[num], idf_dict[word])
if tf_idf_val > max_tf_idf:
max_tf_idf = tf_idf_val
max_word = word
if max_word != '':
if merged_dict[max_word] > 0.01:
sentiType = 'positive'
elif merged_dict[max_word] < -0.01:
sentiType = 'negative'
else:
sentiType = 'neutral'
else:
sentiType = 'neutral'
test_y.append([line[0], sentiType])
if sentiType == line[1]:
accuCount += 1
# Calculate accuracy using train.txt
if num == 2:
print('Accuracy is %s.' %(accuCount / len(data)))
return test_y
At this point we can already get good things:
kirin@KirindeMacBook-Pro Senti % python3 main.py
Python for Finance Project: Financial Statement Sentiment Analysis
senti_dict library has 172 line(s).
test_x library has 970 line(s).
train library has 3876 line(s).
custom_dict library has 683 line(s).
Accuracy is 0.5092879256965944.
Results saved in ./test_train.txt
kirin@KirindeMacBook-Pro Senti %