Skip to content

ETL to QE, Update 18, Long Time No See

2023-12-20

Graph Database neo4j Representation of Discord Data Complete

Check out Graph Database to learn what they are?

Why did I dump the discord data into Neo4J?

I have only generated cool info graphics from the queries in Discord Data Cypher Queries. In order to derive actual value from the Neo4J Graph Schema I will need to read up on the Graph Algorithms Neo4J supports.

POC Dashboard Complete

If you go back and checkout ETL to QE, Update 11, Posted Results on Discord you can see I am generating a graph from the discord data. My list of Questions for Discord Data has been getting too long and nuanced with specific guild, author, and channel queries now requiring numerous inputs to generate a single graph which would more easily be done in a web user interface.

queries.py

I developed queries.py to be reusable. It can plug into the REST API used in the dashboard, graphs.py to generate graphs, or used independently in a pip package when I get there.

graphs.py

I developed graphs.py to be modular. It uses the plotly graphing library which not only works in python but has a javascript library allowing the same graph functions to be used in jupyter notebooks as well as the the web interface.

POC Dashboard Working

The POC had a simple set of features,

  • Populate and update Guild, Author, and Channel autocomplete fields that can be selected by the user
  • Populate a list of graph's that can be selected by the user
  • Be able to render the selected graph.

I needed a simple REST API but chose Django because I will eventually need a user model to allow annotation on top of the discord data. So far I am not using any Django specific features.

The design allows for any graph added to graph.py to work immediately in the web application.

Initial Feedback on POC Dashboard

I showed some friends of mine and got some feedback. Firstly people wanted to know what message with a link got the most reactions and that the list of queries was too overwhelming. I implemented both these features.

URL Metadata Extraction Complete

I updated bot the Postgres and neo4j schemas to support separate URL data. I then used the urlextract pip package to extract URL's from the content of discord messages when transforming the data from JSON into the databases. With this URL metadata extracted I was able to come up with Discord URL Specific Queries.

Tagging Graph in the Web App

I updated each graph in graphs.py to have a series of tags. The Web App groups queries by their tags then so that individual tags can be selected allowing the user to select a query from a much smaller list compared to the initial POC.

The Importance Interviewing prospective users

I have been messaging some people and even had some calls but rather than talk about all that I am going to use this case to vent about my feelings on the topic because to my knowledge no one reads these anyways.

When doing a project it is usually a good idea to understand why one is doing it. Usually in there is an assumption in there that involves other people either using what you produce or god forbid paying for it. So if you want your project to be real, you need to exit your little echo chamber.

I have personally found it very difficult to go find people and get feedback. I can usually be an extrovert but the act of DMing people to demand a conversation has some sort of emotional road block for me.

I think part of the problem is that I don't know exactly what I want from these people. I don't yet have the the correct heuristics and emotional skills in order to go out and complete the ineffectively scoped task of customer research. Actually I don't even know the correct name of this task is, I know it is research but I don't know what kind of research.

Part of the problem is I have never managed customer research type relationships before. The situation should be simple,

  • Me: Hey I want to get your opinion on something
  • Interviewee: Here is my opinion
  • Me: Thanks for your opinion, do you mind if I come to you with follow up questions if I have any?
  • Interviewee then responds, yes or no

I guess it's really that simple when you get down to it.

Another part of the problem is that I don't know what I want their opinion on. I am implicitly looking for someone to tell me what to build but the reality is You have to reverse engineer their solution, they do not know what they want or need.

I guess the lesson here is I need to get better at articulating what I want.

Consolidated all project Docs

Scatter brained me has worked on a series of projects over the last couple years that all have a shared theme behind them. If you go to the Dentropy Daemon page now I have links to them all.

Adding Questions for Discord Data

I created a list of notable queries that will get updated based on what people I interview actually find interesting and useful.

The number of queries started getting so long that I created separate files to contain each list segmented by data type, this also allows the same query involving multiple data types to show up under each data type.

Committing the Planning Fallacy

A good two weeks ago I wrote my Backlog - Discord Binding document. I finally had a set of relatively clear features that could be implemented one by one. Problem is seeing the road in front of me I became too overwhelmed to take the first step. It doesn't help that the correct step is not to build something but to cultivate relationships with people and discover what they actually care about and what problems they actually have.

Embeddings POC Complete

Firstly if you don't know what embeddings are check out What are embeddings?

I tested out using pgvector, a Postgres extension that allows vector columns of arbitrary length. All that was required to calculate the embeddings for each message a updated postgres container, some additional tables in the schema, and a script that would loop through each message, calculate the embedding, and store them in the database linking back to their original message. You can check out the schema here , updated postgres container here, script that calculates and stores the embeddings here

The Project Management Bullshit Test

Just go read the Project Management Bullshit Test

Began work on Phase 2, Graph Based Annotation on Top of Discord Data

Remember the The 4 Step plan for Question Engine which initially launched this project update blog post series. The first phase was a simple dashboard that could be used to generate data visualizations for the discord data I had indexed and transformed into the databases.