Dreambooth can be a tricky process, so be warned! You will need to be willing to try things possibly many times before you get a result you are fully satisfied with. It's best to approach with a curious and experimental mindset to make the best of the tools available to you. Don't be afraid to try things out or fail. Practice makes perfect.
A great resource is the Github page for the Dreambooth extension here:
https://github.com/d8ahazard/sd_dreambooth_extension
Goals:
The goal in training is to achieve "Convergence", the moment when everything clicks in your training and it all works. Doing this will require trial and error, different learning rates, different optimizers, and varying amounts of epochs. Not for the faint hearted.
Our goal here at RunDiffusion is to provide you with relevant docs you can use and get value from. Keep in mind that this field is changing QUICKLY and some of the advice provided here is only a guideline. Something could come out tomorrow that changes the game completely!
Concepts
Concepts are datasets in a model, generally based around a specific person, object, or style. Each concept has a dataset path, may have it's own class images, and will have its own prompt. In Dreambooth for Automatic1111 you can train 4 concepts into your model. They can be broad or very specific depending on your model focus. More than 4 concepts can be trained by using a Concepts List.
For Directories, use the prefix /mnt/private/ and then your folder(s). For example if you had a folder called "trainingimages" your directory would be /mnt/private/trainingimages/
Setting Descriptions:
Batch size:
- Number of images being processed at once by the training.
- If you have a small dataset, you will want to keep this low as it does not individually calculate the images in the batch. This can result in overfitting and lack of variety.
- The MD box can comfortably process batches up to 10, but in small data sets should be kept quite low. LG box can comfortably run up to 20. Note that this will vary wildly depending on your settings and ram usage.
- For training objects or particular people, or a small dataset of images try a batch size of 2-4. This will make more of the gradients get calculated.
- Gradients are processed across the batch. Sounds fancy, but what does it mean?! What it means is that with a batch images are basically... smushed together and processed at once. Can make weird combinations of images! Think of it like this, larger batches, less individual accuracy on the images.
- Increasing batch size will help process larger data sets faster, especially good when you don’t need all the information from each and every dataset image.
Gradient Accumulation Steps:
- Related to batch size.
- Only use if you need to, can lower VRAM overhead.
- May work best for multiple subjects.
- It's suggested to use the same as your batch size.
Class Images:
- Images generated from the base model being trained that help the training understand the reference internally in the model.
- One theory is to get a very general class image (aka “man” or “person” or “animal”)
- Another theory is to get close to what you want so it shows up in that prompting locale more readily unverified
- May not be necessary for style training, but can greatly help with people/objects. Not necessary with fine tuning.
- Can reduce overfitting
- Can dilute the training
Learning Rate:
- How fast or slow learning is processes on the images
- For larger concepts like styles, overall details, and general vibes, use a higher learning rate. For fine details, specific things like objects and faces, use a lower learning rate. Consider doing a high learning rate path to learn the style, and then a low learning rate to fill in the details for example when training a style!
- Engineering/scientific notation. First number is First number, second number is how many places to move the decimal place. E.g.: 3e-6 = 0.0000003 and 3e-7 = 0.00000003
- Example Learning Rates:
- Object training: 4e-6 for about 150-300 epochs or 1e-6 for about 600 epochs
- Suggested upper and lower bounds: 5e-7 (lower) and 5e-5 (upper)
- Can be constant or cosine
- Cosine: starts off fast and slows down as it gets closer to finishing
- Constant: same rate throughout training
- May be able to refine by doing different training rates at different passes. Aka higher rate (4e-6) for first 100 epochs, lower rate (3e-6) for remaining epochs to refine details.
- Nerdy ai stuff: https://www.baeldung.com/cs/learning-rate-batch-size
Epochs:
- The amount of times it will run through training your dataset of images. Used in steps calculation.
- 100 epochs is a good starting point for a training test.
Advanced Settings:
Mixed Precision:
- Training method for deep learning, heavy math stuff
- Multiple modes:
- FP16 - “half precision”, faster but less accurate. Lower vram requirement. 4x faster than using single precision.
- FP32 - “single precision”
- TF32 - “tensor float precision”, potentially 20x faster but reduced accuracy from FP32. Could be 13 bits less accurate, so equivalent to an imaginary FP19
Optimizer:
- LION - new optimizer discovered by Google Brain https://github.com/lucidrains/lion-pytorch
- Way faster, many less steps needed.
- Can use far lower learning rates than Adam.
- It is suggested LION can work with a wide variety of images in a data set.
- Recommended to try 10x less LR than Adam!
- 8-bit Adam
- Helps with VRAM usage, keeps training under 10GB
- Slightly lower accuracy
- Requires higher learning rates (and takes longer!)
- Adam Torch
- Brand new
Potential strategies: Try a slow Learning Rate like 1e-7 with LION until it converges, then use Adam to dial it in at a low rate!
- Train Unet - Text Encoder training can vastly help a model. Some people training find not training it works too!
- ClipSkip - Reduces prompt variably, will make sure prompt concepts don’t bleed into eachother. Useful for text encoding.
- Offset Noise - Use VERY low settings aka 0.1! Note that this will slow down learning and require additional steps of training.
Outputs
Location of samples, models, partially trained models etc. via Webui / extensions folder or model folder.
Overtraining and Undertraining
Finding convergence means finding the exact point between undertraining and overtraining that reproduces your trained dataset style while retaining flexibility in the underlying model.
DATA SETS
Preparing a Data Set
- Captioning
- Be as specific as you can! Get every single detail in the scene
- When labeling stuff with clip or caption buddy, everything you want to be able to maybe change from the image, label it. If you like the background of a night scene you would just label it night background but if you want to have flexibility with it maybe a moon or no moon you would write a night background with a full moon
- Everything you caption helps train the model! Consider captioning emotions, textures, styles, vibes, makeup, clothes, features, and backgrounds.
NOTES
- Overtraining: can be due to too much training or similar images, similar tokens and learning rate, etc. The right way to avoid overtraining is to become familiar with what the settings do and test on smaller datasets before committing to a big one.
- Convergence: The moment when your model reaches peak training and produces the best results - any further training will result in each epoch producing worse results via overtraining.
RESUMING A TRAINING SESSION
- Copy exact path. Paste the training copypasta into notepad, replace the incorrect path for the model with the path to the Epoch folder.
LINKS
http://birme.net - cropping tool