I have a lot of Twitter data.
I decided a few months ago to get my list of those I follow and those following me, and then find that same information about each of them.
This took quite some time, as Twitter stops you from getting too much at a time.
I found a few interesting things. One was, of all the people I follow and all following me, I had the highest number of bi-directional connections (I follow them, they follow me; let's say "friends") of any, at something like 150.
But I'm thinking about writing the next step, and I gotta say, I don't wanna.
In part, that data's dead. It's Spring me, and my Spring followers. It's Fall now, and I don't know that the data is valid anymore.
So I'm thinking about the next step.
If "Big Data" is more than a buzzword, it is working with data that never ends. I mean, my lab has nodes in a huge research cluster, that contain like a quarter-TB of RAM. That's still a mind-boggling thing when I think about it. And it's not like we let that sit empty; we use it. Well, I don't. My part of the workflow is when the users bring their data and, increasingly, when it's done and being sent on. It clearly is large data sets being handled, but it isn't "Big Data", because we finish it, package it, and give it to our users.
"Big Data", it seems to me, means you find out more now than you did before, that you're always adding to it, because it never ends. "The Torture Never Stops", as Zappa said.
In the case of Twitter, I can get every tweet in my feed and every tweet I favorite. I could write that thing and have it run regularly, starting now. Questions: What do I do? How do I do it?
Here's the Why:
There's tools out there that I don't understand, and they are powerful tools. I want to understand them. I want to write things with these things that interest me, because it's an uncommon day that I'm interested in what I code most of the time.
Then there's Twitter. You can say many many things about Twitter, and it's largely true. But, there are interesting people talking about the interesting things they do, and and other people retweeting it. With luck, I can draw from that, create connections, learn things and do interesting things.
So, while there is the "make a thing because a thing" part, I'm hoping to use this.
So, the What:
A first step is to take my Favorites data and use it as "ham" in a Bayesian way to tell what kind of thing I like. I might need to come up with a mechanism beyond unfollowing to indicate what I don't want to see. Maybe do a search on "GamerGate", "Football" and "Cats"?
Anyway, once that's going, let it loose on my feed and have it bubble up tweets that it says I might favorite, that "fit the profile", so to speak. I'm thinking my next step is installing a Bayesian library like Algorithm::Bayesian or Algorithm::NaiveBayes, running my year+ worth of Twitter favorites as ham, and have it tell me once a day the things I should've favorited but didn't. Once I have that, I'll reorient and go from there.
No comments:
Post a Comment