[Google_Bootcamp_Day14]

Updated: November 12, 2020

ML strategy 3

Error Analysis

Assume that you have 90% accuracy (10% error) cat classifier, and you want to know whether you should try to make your cat classifier do better on dogs or not
Error Analysis
- ex. Get 100 mislabeld dev set examples, and count up each categories.
- Overall dev set error : 10%
- Errors due incorrect labels : 6% * 10% = 0.6% (relatively small! If it is big, then try to fix it)
- Errors due to other causes: 9.4%
- Improve performance on the biggest percentage of error categories (correct label, etc.)
DL algorithms are quite robust to random errors in the training set (need to care about dev/test set)
Apply same process to your dev and test sets to make sure they continue to come from the same distribution
Train and dev/test data may now come from slightly different distributions (don’t matter!)

Build your first system quickly, then iterate

Depending on the area of application, the guideline below will help you prioritize when you build your system.

Guideline
1. Set up development/test set and metrics
  - Set up a target
2. Build an initial system quickly
  - Train training set quickly: Fit the parameters
  - Development set: Tune the parameters
  - Test set: Assess the performance
3. Use bias/variance analysis & Error analysis to prioritize next steps

Training and testing on different distributions

Example: Cat vs. Non-cat

In this example, we want to create a mobile application that will classify and recognize pictures of cats taken and uploaded by users.
There are two sources of data used to develop the mobile app. The first data distribution is small, 10,000 pictures uploaded from the mobile application. Since they are from amateur users, the pictures are not professionally shot, not well framed and blurrier. The second source is from the web, you downloaded 200,000 pictures where cat’s pictures are professionally framed and in high resolution.

Option 1 (random shuffle : not recommended)
- Pros: all data set come from same distribution
- Cons: most of data in dev set come from “webpages”, which means optimizing for wrong target
Option 2 (let all of the dev/test set come from “mobile app”: recommended)
- Pros: target is well defined
- Cons: training distribution is different from the dev/test set distributions

Bias and variance with mismatched data distribution

Example: Cat classifier with mismatch data distribution

When the training set is from a different distribution than the dev/test sets, the method to analyze bias and variance changes. ex1

Scenario A
- If the development data comes from the same distribution as the training set, then there is a large variance problem and the algorithm is not generalizing well from the training set.
- However, since the training data and the development data come from a different distribution, this conclusion cannot be drawn. There isn’t necessarily a variance problem. The problem might be that the development set contains images that are more difficult to classify accurately.
- When the training set, development and test sets distributions are different, two things change at the same time.
- First of all, the algorithm trained in the training set but not in the development set.
- Second, the distribution of data in the development set is different.
- It’s difficult to know which of these two changes what produces this 9% increase in error between the training set and the development set.
- To resolve this issue, we define a new subset called training-development set. This new subset has the same distribution as the training set, but it is not used for training the neural network
Scenario B
- The error between the training set and the training-development set is 8%.
- In this case, since the training set and training-development set come from the same distribution, the only difference between them is the neural network sorted the data in the training and not in the training development.
- The neural network is not generalizing well to data from the same distribution that it hadn’t seen before
- Therefore, we have really a variance problem.
Scenario C
- In this case, we have a mismatch data problem since the 2 data sets come from different distribution.
Scenario D
- In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.
Scenario E
- In this case, there are 2 problems.
- The first one is that the avoidable bias is high since the difference between Bayes error and training error is 10 %.
- Second one is a data mismatched problem.
Scenario F
- Development should never be done on the test set.
- However, the difference between the development set and the test set gives the degree of overfitting to the development set.

General formulation

general

Addressing data mismatch

Perform manual error analysis to understand the error differences between training, development/test sets. Development should never be done on test set to avoid overfitting. - Make training data or collect data similar to development and test sets. To make the training data more similar to your development set, you can use is artificial data synthesis. However, it is possible that if you might be accidentally simulating data only from a tiny subset of the space of all possible examples.

[Source] https://www.coursera.org/learn/machine-learning-projects

Share on

Twitter Facebook LinkedIn

Jeongho Shin (Leo)

[Google_Bootcamp_Day14]

Error Analysis

Build your first system quickly, then iterate

Training and testing on different distributions

Bias and variance with mismatched data distribution

General formulation

Addressing data mismatch

Share on

Leave a comment

You may also enjoy

[Network] Transport Layer 2

[Network] Transport Layer 1

[Web Backend] SQL Basic / JDBC

[Data Structure] Linked_List