Often the two concepts are confused
In machine learning papers, both terms “Out of Distribution” and “Out of Domain” come up in discussions about generalization. Do these mean the same thing? Actually, they are different concepts. (as I understand it)
In the context of machine learning, Out of Domain is a concept paired with Applicable Domain, and Out of Distribution is a concept paired with In Distribution.
- Applicable Domain and Out of Domain
— Applicable Domain
This term refers to the range or domain of data for which the model performs adequately for its intended purpose. This term broadly implies the characteristics and conditions of the data under which the model will function properly.
— Out of Domain
This term refers to data outside the original domain or range for which the model was designed or trained. This could include different characteristics or different types of data, not just different distributions.
- In Distribution and Out of Distribution
— In Distribution
This term refers to data from the same distribution as the data set used when the model was trained. In other words, data with the types and characteristics of data that the model is familiar with.
— Out of Distribution
This term refers to data from a distribution that is different from the distribution of the data the model saw during training. This means data that is unknown or anomalous to the model.
In short, Out of Distribution and Out of Domain have in common that they refer to unusual data for which the model may not function properly. However, when the unusual data is “data outside the domain of the training data,” it is Out of Domain, and when it is “data outside the distribution of the training data,” it is Out of Distribution.
Let me elaborate on this a bit more. Within the Applicable Domain, there may be Out of Distribution samples. Applicable Domain refers to the intended applicability domain in which the model was designed or trained. However, not all data within this domain will have the same distribution as the model’s training data. Data that follow this different distribution are Out of Distribution. On the other hand, it is rare to find In Distribution data within an Out of Domain, and most of the time it will be Out of Distribution data.
Consider the case of model training with English text data. If the model is trained on an extremely large amount of English text, most of the English texts will be Applicable Domain and In Distribution. However, if you forgot to include contractions that are used only by a subset of the population (e.g., young people, gamers, etc.) in the training data set, those contractions will be Applicable Domain and Out of Domain. Also, since the training data is English text only, Japanese text must be Out of Domain. In addition, since the distributional characteristics of most Japanese texts are different from those of English texts, most Japanese texts are Out of Distribution.