Posts in "Link"

Cranberry Bogs Use Spiders Instead Of Pesticides

You see, cranberry farms have been moving towards more organic farming methods which preclude the use of pesticides and so to keep the insect population down, the farmers encourage wolf spiders to live in the bogs.

I’m not particularly… comfortable around spiders, so to speak, but wolf spiders are one of my favorite species. They’re like little shy tarantulas. Still wouldn’t be caught dead on a cranberry farm but I’ll take spiders over pesticides.

Someone Made a Dataset of One Million Bluesky Posts for ‘Machine Learning Research’

The data isn’t anonymous. In the dataset, each post is listed alongside the users’ decentralized identifier, or DID…It’s also noteworthy that it’s a “snapshot” of time on Bluesky, meaning it could, and probably does, include since-deleted posts.

I mean in a way, it’s similar to the Internet Archive. But they never really archived social media accounts did they?

This dataset could be used for “training and testing language models on social media content, analyzing social media posting patterns, studying conversation structures and reply networks, research on social media content moderation, [and] natural language processing tasks using social media data,” the project page says.

This is literally made for training AI right? I guess it’s just a matter of whether or not the resulting chatbot is publicly released but then the actual data is already available anyway.

“A number of artists and creators have made their home on Bluesky, and we hear their concerns with other platforms training on their data. We do not use any of your content to train generative AI, and have no intention of doing so,” - Bluesky, official account

I don’t know. I can believe the devs aren’t personally training AI on it, but it’s definitely a thing you can feed to a LLM. Regardless of who physically does it.

Personally? I don’t have a problem with AI scraping the dumb shit I post online. But I can understand why an artist or seasoned writer wouldn’t want generative AI learning off of their trademark style and then bottling it up for a monthly subscription fee.

I don’t think this is a Bluesky issue, it just got caught in the middle because it’s in the spotlight right now. I’m not a programmer or an AI whisperer so I could totally be wrong, but couldn’t anyone create a dataset of a public social network’s content using their APIs and then train whatever they like on it?

It’s shitty, but the cat’s already out of the bag. If you’re posting anything online it can, and probably is, being trained on AI.


Update: Looks like they took down the data already:

I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.

[image or embed]

— Daniel van Strien (@danielvanstrien.bsky.social) November 26, 2024 at 9:19 PM

Elon Musk Muses About Buying MSNBC: “How Much Does It Cost?”

The billionaire and buddy to the president-elect jokes about buying the liberal network. At least we think he’s joking.

His acquisition of Twitter started out as a joke too. I’d be surprised if he doesn’t at least make an offer. Fascist regimes need their state run media and he’s just preaching to the converted on Twitter now.

You deserve a better browser than Google Chrome

If you’ve been using Chrome because it came as the default browser on your phone, you might want to try something new. If you’ve been using Chrome for 15 years because it was so innovative when it was introduced, that’s no longer the case, and you should definitely try something new.

I’d also add that if you’re using Chrome on iOS, you’re just using a shittier version of Safari that spies on you.