Stock Sentiment Analysis on Reddit: A NER Approach

Using the Reddit Investing NER Dataset, I’ve applied Named Entity Recognition (NER) and Sentiment Analysis techniques to uncover the top 10 positive and negative stocks being discussed on Reddit’s r/investing subreddit.

Reddit Investing NER Dataset

Dataset Description

The Reddit Investing NER (Named Entity Recognition) Dataset comprises 880 items collected from the Reddit API, specifically from the r/investing subreddit's new feed. The dataset facilitates Named Entity Recognition using Natural Language Processing (NLP) techniques.

Dataset Details

The dataset includes the following columns:

  • Name: Unique identifier for each post item.
  • Created_utc: Timestamp indicating when the post was created, in Coordinated Universal Time (UTC).
  • Subreddit: The subreddit from which the post originated (r/investing).
  • Title: Title of the post, providing a brief summary or topic description.
  • Selftext: Content of the post, typically containing additional details or information beyond the title.
  • Upvote_ratio: Ratio of upvotes to total votes received by the post.
  • Ups: Number of upvotes received by the post.
  • Score: Overall score assigned to the post, calculated based on upvotes and other factors.

Purpose

The primary purpose of this dataset is to support the development and evaluation of Named Entity Recognition models within the domain of investing-related discussions on Reddit. Researchers and data scientists can utilize this dataset to train, test, and fine-tune NER models, ultimately enhancing their ability to extract relevant information and entities from unstructured text data.

Potential Use Cases

  1. Entity Identification: Extract named entities related to investment topics mentioned in post titles and selftext.
  2. Sentiment Analysis: Explore sentiment trends within the investing community based on score, upvote ratio, and upvotes.
  3. Temporal Analysis: Investigate patterns and trends over time using the created_utc timestamp.

Notebook Steps

Import spaCy:

Utilize the en_core_web_sm model.

Import Data:

Fetch the Reddit Investing NER Dataset.

Get Specific Entities Handler:

Implement a handler to extract specific entities, updating it to ignore blacklist words.

Add Organizations Feature:

Enhance the dataset by adding a new feature for organizations.

Entity Frequency:

Analyze entity frequency, incorporating a blacklist to filter out inaccuracies.

NER with Sentiment:

Perform Named Entity Recognition with sentiment analysis, identifying organizations and their sentiment scores.

Top 10 Positives & Negatives:

Present the top 10 organizations with the highest and lowest average sentiment scores, respectively.

Top 10 Positive Entities

EntityPositive ScoreNegative ScoreFrequencyScore
Robinhood1.7731861.8192524-0.011517
EV1.6842041.8390434-0.038710
VOO8.38406227.95648438-0.515064
AVUV0.9341282.9989904-0.516215
TQQQ0.5682002.9111294-0.585732
LLC0.9921503.9923635-0.600043
NASDAQ0.9657033.9661295-0.600085
TSP0.7959543.8659725-0.614004
NVDA0.5006732.9685494-0.616969
VGT0.9734364.8193686-0.640989

Top 10 Negative Entities

EntityPositive ScoreNegative ScoreFrequencyScore
Chase0.04.9987665-0.999753
FDIC0.03.9840314-0.996008
Amazon0.07.9368158-0.992102
ESPP0.010.81015911-0.982742
FSKAX0.04.8570405-0.971408
Tesla0.05.8113306-0.968555
MSFT0.010.56418811-0.960381
Intel0.03.8387854-0.959696
Nvidia0.03.7847564-0.946189
BOJ0.03.4372754-0.859319