FaceDeer

FaceDeer@fedia.io · 1 day ago

There will eventually be enough public domain content that AI will be at the quality it is today with public materials alone.

So, AI will always be ~95 years behind the times?

Except the AIs produced by Disney et al, of course. And those produced by Chinese companies with the CCP stamp of approval. They’ll be up to date.

FaceDeer@fedia.io · 1 day ago

Many people with positive sentiments towards AI also want that.

FaceDeer@fedia.io · 1 day ago

If you think death is the answer the polite thing is to not force everyone to go along with you.

FaceDeer@fedia.io · 1 day ago

So they’re still feeding LLMs their own slop, got it.

No, you don’t “got it.” You’re clinging hard to an inaccurate understanding of how LLM training works because you really want it to work that way, because you think it means that LLMs are “doomed” somehow.

It’s not the case. The curation and synthetic data generation steps don’t work the way you appear to think they work. Curation of training data has nothing to do with Yahoo’s directories. I have no idea why you would think that’s a bad thing even if it was like that, aside from the notion that “Yahoo failed therefore if LLM trainers are doing something similar to Yahoo then they will also fail.”

I mean that they’re discontinuing search engines in favour of LLM generated slop.

No they’re not. Bing is discontinuing an API for their search engine, but Copilot still uses it under the hood. Go ahead and ask Copilot to tell you about something, it’ll have footnotes linking to other websites showing the search results it’s summarizing. Similarly with Google, you say it yourself right here that their search results have AI summaries in them.

No there’s not, that’s not how LLMs work, you have to retrain the whole model to get any new patterns into it.

The problem with your understanding of this situation is that Google’s search summary is not solely from the LLM. What happens is Google does the search, finds the relevant pages, then puts the content of those pages into their LLM’s context and asks the LLM to create a summary of that information relevant to the search that was used to find it. So the LLM doesn’t actually need to have that information trained into it, it’s provided as part of the context of the prompt,

You can experiment a bit with this yourself if you want. Google has a service called NotebookLM, https://notebooklm.google.com/, where you can upload a document and then ask an LLM questions about the documents’ contents. Go ahead and upload something that hasn’t been in any LLM training sets and ask it some questions. Not only will it give you answers, it’ll include links that point to the sections of the source documents where it got those answers from.

FaceDeer@fedia.io · 1 day ago

No, it’s not “LLMs all the way down.” Synthetic data is still ultimately built on raw data, it just improves the form that data takes and includes lots of curation steps to filter it for quality.

I don’t know what you mean by “a replacement for search engines.” LLMs are commonly being used to summarize search engine results, but there’s still a search engine providing it with sources to generate that summary from.

FaceDeer@fedia.io · 2 days ago

Thanks for asking. My comment was off the top of my head based on stuff I’ve read over the years, so first I did a little fact-checking of myself to make sure. There’s a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they’re capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run, it’s just that well-curated high-quality raw data is already available.

This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA’s Nemotron-4 models as another example.

FaceDeer@fedia.io · 2 days ago

Raw source data is often used to produce synthetic data. For example, if you’re training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

The resulting synthetic data does not contain any of the raw source, but it’s still based on that source. That’s one way to keep the AI’s knowledge well grounded.

It’s a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.

FaceDeer@fedia.io · 2 days ago

Betteridge’s law of headlines.

Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It’s done so that the data’s format and content can be tailored to optimize its value in the training process. Over the past few years it’s become clear that simply dumping raw data from the Internet into LLM training isn’t a very good approach. It sufficied to bootstrap AI development but we’re kind of past that point now.

Even if there was a problem with training new AIs, that just means that they won’t get better until the problem is overcome. It doesn’t mean they’ll perform “increasingly poorly” because the old models still exist, you can just use those.

But lots of people really don’t like AI and want to hear headlines saying it’s going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you’ll see it raises these sorts of issues itself.

FaceDeer@fedia.io · 3 days ago

Interesting. As poorly as I think of X as an organization, I do hope they follow through with their open system prompt commitment. That’s something that other major AI companies should be doing too.

FaceDeer@fedia.io · 3 days ago

Could also be malicious compliance on the part of whatever engineer set this up, prompting Grok in such a way that it’s making it obvious what’s going on under the hood.

FaceDeer@fedia.io · 3 days ago

deleted by creator

FaceDeer@fedia.io · 4 days ago

we hear about crime everywhere.

Worth noting that although concern about crime in the US has risen over time, the actual rate of violent crime has fallen dramatically over the past few decades. As in the overall violent crime rate fell 49% between 1993 and 2022.

I’m not telling you whether your level of concern is appropriate or not, that’s up to you and may vary with circumstances that I don’t know. But generally speaking I think it’s safe to say that levels of concern in the US don’t line up very well with the things that the concern is about. Might be worth investigating for yourself and perhaps calibrating your expectations a bit.

FaceDeer@fedia.io · 6 days ago

When your standard is “perfection” then nothing at all will ever meet it.

FaceDeer@fedia.io · 6 days ago

Your standard of acceptability is “perfect”, then?

FaceDeer@fedia.io · 6 days ago

Again, a properly built landfill doesn’t have that problem. I specified that right from the start. They’re designed to manage leachate.

FaceDeer@fedia.io · 6 days ago

Seems to be from some short stories and poems that have been published over the years.

Different rainbow bridge with a different afterlife on the other end. I hope the signage is clear. I can only imagine a couple of very confused Vikings are wandering around, perhaps disappointed there’s no big battles to participate in but on the plus side having tons of dogs that are happy to see them there. And conversely, a couple of good bois running around Odin’s hall having fun with all the commotion and feasting and whatnot going on there.

FaceDeer@fedia.io · 6 days ago

If the plastic is not degrading then it’s not releasing anything, be it methane or CO2.

Isn’t one of the big talking points against plastic the “it’ll be around for thousands of years” thing?

FaceDeer@fedia.io · 6 days ago

“Never intended” doesn’t mean it doesn’t work as one.

The point I’m making here is that if we already have a chunk of plastic, why not bury it? Your own comment that I originally responded to was about how the composting process for these bioplastics is difficult to do and so people rarely do it. Landfills are comparatively quite easy and common, we already have that process well established. So if you’ve got a chunk of carbon-rich plastic right there in your hand and you’re trying to decide what to do with it, which makes more sense, turning it into CO2 to vent into the atmosphere, or sequestering it effectively forever? There are carbon sequestration projects that go to much greater lengths to bury carbon underground than this.

FaceDeer@fedia.io · 6 days ago

CO2 is CO2, it doesn’t matter where the carbon came from. If you’re sequestering plastics that were made from plants then you’re taking it out of the atmosphere for a net benefit.

FaceDeer@fedia.io · 6 days ago

It absolutely baffles me how states are able to botch executions like they’re doing. I’ve had many dogs over my lifetime and sadly that means I’ve seen many of them off to the rainbow bridge at the ends of theirs, and there’s never been a botched euthanasia. I guess vets are just more professional and compassionate than these executioners.

I oppose the death penalty universally. But I’ve long argued that if you absolutely must execute someone and must avoid the messiness of exploding their brain for instant painlessness and reliability, then nitrogen gas asphyxiation is probably the best way to go - completely painless and incredibly hard to botch. Just flood the room with nitrogen gas, how hard is that? It’s a common industrial accident. And yet there was a case recently where a state tried nitrogen gas asphyxiation and the monsters somehow managed to botch even that.