LoRA vs. Full Fine-Tuning: An Illusion of Equivalence

156 points by timbilt 10 hours ago

deskr a minute ago

What an unfortunate choice of name. LoRa is already a big project.

pwillia7 8 hours ago

This tracks with my feelings making and using Stable Diffusion Loras and fine tunes. Still, with the speed to train and use, Loras have worked for me in most use cases and it hasn't been worth fine tuning the entire model.

K0balt 7 hours ago

Yeah,it reflects the “feel” I get from lLoRa as well, especially if I overdo it. The new data becomes the preferred output even for unrelated inputs. I always felt like it was bludgeoning the model to some extent vs finetuning.
Also, LoRa tuning an extensively tuned model occasionally provokes full on delusional “insanity” or gibberish seizures.
I have had really good luck though using a highly tuned model as the training basis for a LoRa and then applying that LoRa mask to the base version of that model. I’m not sure why that seems to work better than the same LoRa training directly on the base model.
- cheald 3 hours ago
  
  I've done a lot of tinkering with the internals of LoRA training, specifically investigating why fine-tune and LoRA training result in such different results, and I'm no academic, but I have found that there are definitely some issues with the SOTA at least WRT Stable Diffusion.
  I've had significant success with alternate init mechanisms (the standard technique of init'ing B to zeros really does hurt gradient flow), training alpha as a separate parameter (and especially if you bootstrap the process with alphas learned from a previous run), and altering the per-layer learning rates (because (lr * B) @ (lr @ A) produces an update of a fundamentally different magnitude than the fine-tune update of W * lr = lr * B @ A).
  In the context of Stable Diffusion specifically, as well, there's some really pathological stuff that happens when training text encoders alongside the unet; for SD-1.5, the norm of "good" embeddings settles right around 28.0, but the model learns that it can reduce loss by pushing the embeddings away from that value. However, this comes at the cost of de-generalizing your outputs! Adding a second loss term which penalizes the network for drifting away from the L1 norm of the untrained embeddings for a given text substantially reduces the "insanity" tendencies. There's a more complete writeup at https://github.com/kohya-ss/sd-scripts/discussions/294#discu...
  You also have the fact that the current SOTA training tools just straight up don't train some layers that fine-tunes do.
  I do think there's a huge amount of ground to be gained in diffusion LoRA training, but most of the existing techniques work well enough that people settle for "good enough".
  - doctorpangloss 2 hours ago
    
    Most people are using LoRAs as a solution for IP transfer.
    Thing is Ideogram v2 has already achieved IP transfer without fine tuning or adapters. So we know those aren't needed.
    Is Ideogram v2 an exotic architecture? No, I don't think so.
    Are there exotic architectures that will solve IP transfer and other tasks? The Chameleon and OmniGen architectures. Lots of expertise went into SD3 and Flux dataset prep, but: the multimodal architectures are so much more flexible and expressive.
    Flow matching models are maybe the last we will see before multi-modal goes big.
    What to make of things in the community? How is it possible that random hyperparameters and 30 minute long fine tunings produce good results?
    (1) Dreambooth effect: if it's like, a dog, you won't notice the flaws.
    (2) Filing drawer problem. Nobody publishes the 99 things that didn't work.
    (3) SD <3 struggled with IP transfer on image content that could not have possibly been in its datasets. But laypeople are not doing that. They don't have access to art content that Stability and BFL also don't have access to.
    (4) Faces: of course SD family saw celebrity images. Faces are over-represented in its datasets. So yeah, it's going to be good at IP transfer of photographic faces. Most are in-sample.

viktour19 3 hours ago

> LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution.

The ability for nnets to generalize is inherently tied to their trainable parameter count via mechanisms we don't understand but we know parameter count is the key. When you finetune with lora, you're updating maybe 5% of the parameters, I really don't think there is an illusion of equivalence in the field.

kelseyfrog 30 minutes ago

> When you finetune with lora, you're updating maybe 5% of the parameters
I'm not sure I understand this comment. The LoRA paper[1] specifically says that all of the pretrained weights remain frozen.
> keeping the pre-trained weights frozen
Specifically, the LoRA paper differentiates itself from updating some parameters by stating
> Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks.
1. https://arxiv.org/pdf/2106.09685
abhgh 2 hours ago

More magnitude than count [1] I think, but I haven't kept up in a while.
[1] https://proceedings.neurips.cc/paper_files/paper/1996/file/f...
wrs 3 hours ago

Well, I think it depends who you talk to. I suspect quite a few practitioners (as opposed to researchers) regard LoRA as a valid shortcut without full consideration of the difference.

sorenjan 5 hours ago

> We randomly initialize A such that it has singular values of 1, freeze it, and only train B. When we do this, we see a sharp reduction in high ranking intruder dimensions in comparison to those in normal LoRA

This sounds interesting, but I can't see that they do much with this result. Are they saving it for a follow up paper? I would think that if their whole paper is about a big problem with LoRAs and they then find what looks like an easy solution for that problem that would warrant more than a paragraph just before the conclusion.

It would also have been interesting if they included the DoRA method, they reference it briefly and that paper claims to resemble fine tuning learning behavior.

But perhaps this paper is focused on LoRA behavior, and a separate paper comparing various improvements is better.

liuliu 2 hours ago

Yeah, honestly not too surprising. Happy someone made the experiments though.
I think we know that NN with limited data tends to over-fitting, so to train LoRA you need stronger regularization mechanism, that including:
* Fixing A as projection matrix so it doesn't rotate to an "easier" orientation for B to learn.
* Periodically merging AB into W_tuned to simulate the full-model finetuning behavior.
I think fundamentally, LoRA is sound because gradient matrix is low-rank by its nature.

K0balt 8 hours ago

So, in layman’s terms, LoRa appears to “traumatize “ the model to some degree, connecting the vector space with strong “jumpers” (intruder dimensions) to change it’s behavior, instead of subtly conforming the entire model into a shape that accommodates the new data.

These jumpers or shortcuts do create connections between the relevant new concepts in the model, but by directly connecting them instead of associating them through the existing network of concepts, nuance is lost and the bypassed areas become deemphasized, leading to forgetting of previously held associations.

Because of this, In general, fine tuning produces better results than LoRa in most cases, especially when forgetting of existing training is detrimental.

Or, to further oversimplify the issue in SE terms, LoRa == monkeypatching. (Is this a kind of intruder dimension?)

six_four_eight 5 hours ago

I wonder how this compares to 'catastrophic forgetting' that can be a problem of full fine tuning. Or at least that's what I've just been reading as a case _for_ using LoRa, as it's not susceptible to that. I guess this paper shows LoRa causes forgetting in a different way.
Are there good general principles yet for what fine tuning method to use in certain situations? It still seems quite difficult to know ahead of time what's going to happen.
ismailmaj 6 hours ago

How does it compare to partially fine-tuning the model by freezing most of the network beside the last few layers?
Mockapapella 7 hours ago

Thank you for this layman explanation

Der_Einzige 2 hours ago

This paper seems dubious, because it flies in the face of what the reft/pyreft paper is showing (you can use 0.0001% of the parameters trained for 100 epochs to personalize on a small dataset):

https://github.com/stanfordnlp/pyreft

https://arxiv.org/abs/2404.03592

Note that the OP paper is not peer reviewed yet, and while the one I linked isn't either, it has Christopher Manning (yes, the one you know from youtube), the head of AI at Stanford, as a co-author.

In general, I think that Lora and especially reft should be more resistant to catastrophic forgetting due to them literally not impacting most of the model.

The Stable Diffusion community has literally tens of thousands of lora's that don't cripple a model at small rank.

Eisenstein 5 hours ago

Is this just specifying what has been known, that LoRAs skew towards the new training heavily and are not 'more intelligent' just 'more targeted' and become less intelligent the more they are targeted? Or is this proposing something else? I am having a difficult time understanding exactly what 'intruder dimensions' are.

bArray 9 hours ago

[flagged]

HPsquared 9 hours ago

Nor with LORAN, a WW2 navigation system: https://en.m.wikipedia.org/wiki/LORAN
- poizan42 5 hours ago
  
  Why is GP flagged? I was pretty confused about the title as well. This is the first time I have heard "LoRa" meaning anything else than the wireless protocol, and that meaning was further strengthened by talking about tuning. What is it with the AI crowd reusing long established terminology and then getting mad at people being confused by them usurping terms that have been used for a long time for something else? I mean I could understand that you could post this title on some forum dedicated to AI. But Hacker news is not that, and LoRa have had a another meaning that have been commonly known in Hacker circles for a decade by now.
  - Tomte 4 hours ago
    
    Because we have this discussion every single time LoRA comes up.
    Also, neural networks and niche radio technologies are far enough apart that name clashes are to be expected and no problem.
    
    oytis 42 minutes ago
    
    Niche? When I google for "lora" the AI stuff is not even in the first page.
    
    poizan42 3 hours ago
    
    And how should they know that? Do you expect people to read every single post on Hacker News? For me it's the first time I have heard LoRa being used to mean something else. Obviously new people are going to keep being confused for as long as posts with confusing titles are being posted.
    And yes, the name clash may not matter in either of those circles, but it does matter right here on Hacker News where those circles overlap.
    Also LoRa devices exists in the consumer space, so seems a bit disingenuous to call it a niche radio technology.
    
    Tomte 3 hours ago
    
    You don‘t have to know it, and I will still flag your comment. Get a grip! Nobody is doing something serious to you.
    
    poizan42 3 hours ago
    
    Nobody has flagged my comment? Are you confusing me with someone else? No-one is doing anything to me at all?

AstroJetson 6 hours ago

[flagged]

Tenoke 5 hours ago

If the issue is reusing names then they also shouldn't have named the radio lora with the same name as my aunt.
sorenjan 6 hours ago

> I was excited to click the link to see how fine tuning LoRA frequencies I was using on my Mesh network would work.
You're thinking of LoRa radio, from Long Range. There's one of you in each LoRA comment section, I have a hard time believing it's an actual mistake in good faith anymore.
- samuellavoie90 5 hours ago
  
  LoRA radio things are a lot more common than the newly popular AI stuff.
  - sorenjan 5 hours ago
    
    That depends on what your particular filter bubble contains. Just browsing HN should inform you that LoRA and fine tuning are common terms in AI, and even if you genuinely thought the article was about the radio technique there's really no reason to leave the same comment that's on every LoRA post.
    https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
    
    poizan42 4 hours ago
    
    If every post has similarly confusing titles then why shouldn't they deserve the same comment? The obvious solution is for people to stop using confusing titles. Something as simple as putting (AI) at the end of the title would go a long way towards fixing the problem.
  - poizan42 5 hours ago
    
    Yes exactly. The AI crowd are the ones not arguing in good faith here. The title is very confusing as it stands right now.
mdp2021 5 hours ago

'LoRa' ("Long Range") and 'LoRA' ("Low Rank Adaptation") have different capitalization.
- AstroJetson 4 hours ago
  
  Most search engines are capitalization agnostic, even the one here on HN. I'll have a word with my lawyer, I want to call something iphone, he says that's taken. I'll see if he likes IpHoNe any better to fix the capitalization. :-)
rjsw 6 hours ago

We could switch to just referring to everything as "the thing" or "the idea".
- drweevil 5 hours ago
  
  Or the AI guys could respect the namespace and call it LRA, a la RAG. What's next, WiFI?