• partial_accumen@lemmy.world
    link
    fedilink
    arrow-up
    29
    ·
    3 days ago

    Understanding how LLMs actually work that each word is a token (possibly each letter) with a calculated highest probably of the word that comes next, this output makes me think the training data heavily included social media or pop culture specifically around “teen angst”.

    I wonder if in context training would be helpful to mask the “edgelord” training data sets.

    • ilinamorato@lemmy.world
      link
      fedilink
      arrow-up
      9
      ·
      3 days ago

      Yeah, I think the training data that’s most applicable here is probably troubleshooting sites (i.e. StackOverflow), GitHub comment threads, and maybe even discussion board forums. That’s really the only place you get this deep into configuration failures, and there is often a lot of catastrophizing there. Probably more than enough to begin pulling in old LiveJournal emo poetry.