key word : control chart, control limit, non-normal, nonnormal, percentile, statistical process control, Cpk, UCL, SPC , one-sided , skewed
Download
version 418
History and Background
I have been a data-miner since 2003. Before that I have been a researcher for electronic devices. And I have also worked as a Statistician (2012 - 2013) at a silicon wafer company.
During my two years of working as a statistician , I checked and reviewed several control limits calculated by a company's own software. When I have a question about their calculation , I checked it by using the software. I use Minitab to double-check. Usually , those difficult and complicated parameters are non-normal.
All silicon wafer company need to deal with non-normal distribution parameters , for example metal contamination , anions , cations , and particles. Those parameters are main causes of yield for semiconductors.
Although the Minitab calculation method is usually a reliable IDI ( Individual Distribution Identification ) , it often does not work when the distribution is heavily skewed. Sometimes control limit calculated by IDI is not suitable for process control. If we apply the control limit to historical data , more than 5% OOC occurred although theoretically OOC rate is only 0.27%.
But IDI works well for a slightly skewed parameters.
You can learn more about IDI by clicking this link: http://blog.minitab.com/.
You can use a free trial of Minitab for 30 days.http://www.minitab.com/ja-jp/products/minitab/free-trial/
Regular price for one personal use is $1,595.
Student price is \3,329 for 6 month rental
Another statistical package software is JMP (pronounce : jump)
You can use a free trial of JMP for 30 days. https://www.jmp.com/ja_jp/download-jmp-free-trial.html
Regular price for one personal use is $1,785
Japanese \289,440 including 8% tax.
Student price is \105,840 including 8% tax.
Regular price for one personal use is $1,785
Japanese \289,440 including 8% tax.
Student price is \105,840 including 8% tax.
JMP handles non-normal parameter in the same way as Minitab. If the data is SKEWED, it needs to be transformed to normal distribution by looking for a function that will fit with the measured data.
Why does the IDI of Minitab and the corresponding tool of JMP not work for heavily skewed distribution ?
Best fit function which is selected by AD (Anderson-Darling) value and p-value by Minitab changes when duration of data changes or tool number changes , because those change is trivial ,the population should be the same. And when the best fit function changes, both the control limit and the Cpk change. Therefore IDI was not available for process improvement. For example, we cannot know if the process has improved.
Best fit function which is selected by AD (Anderson-Darling) value and p-value by Minitab changes when duration of data changes or tool number changes , because those change is trivial ,the population should be the same. And when the best fit function changes, both the control limit and the Cpk change. Therefore IDI was not available for process improvement. For example, we cannot know if the process has improved.
This is because IDI is over fitting and not robust. As a result IDI is not available for SPC when it is heavily skewed. Because the best fit function is identified by data around average , data around tails have small effect. But OOC% or Cpk is mainly decided by data around tails. So there is a large mismatch between the best fit function and OOC%, Cpk.
Usefull usage of IDI :
I think that once you define best fit function of a parameter, you should not change the function.
This means that the smallest AD-value or the largest p-value is not the best to identify the best fit function..
Because if several data in a dataset are unstable a little, the best fit function may change. But the function should be the same still. This strategy is explained in JMP's guide.
In 2013, I noticed that the right tail shows linear when non-normal data are plotted on a probability plot. I did not understand why. There are many heavily skewed parameters that looks like a straight line on a probability prot.
So my idea is not based on a theory but based on a lot of experiences of reviewing hundreds of parameters and a thousand of charts. And my idea works better than IDI , though its calculation is much easier than IDI..
So my idea is not based on a theory but based on a lot of experiences of reviewing hundreds of parameters and a thousand of charts. And my idea works better than IDI , though its calculation is much easier than IDI..
Someday when I searched a better way than my idea. I found a patent created by TOSHIBA by chance. Their idea is simple. That is , for non-normal parameter, a data those are larger than median is approximated by a normal distribution. I thought that is very similar to my idea.
Most of the statistician can conclude that my idea is inspired by the TOSHIBA patent. So my idea may be included in the TOSHIBA patent. This is their paper about it:https://yahoo.jp/box/E7KZ0o .
Feature of my software
Cons:
*Easy to find and remove outlier
*Normal distribution approximation.
*Normal distribution approximation.
Fitting function is only normal distribution
*Original solver for robust regression --> High accuracy and calculation speed
*Free
*Original solver for robust regression --> High accuracy and calculation speed
*Free
Pros:
*It needs a lot of data because only 16% of measured data are used for control limit calculation.
For example: Control limit changed largely when we remove one maximum data as an outlier.
For example: Control limit changed largely when we remove one maximum data as an outlier.
It typically needs more than 100 data. If possible ,more than 300 data.
Countermeasure
- Multivariate method for control limit or outlier. For example Taguchi MT , Hotelling T2.
This is my blog about Hotelling T2. Sorry written in Japanese.
*Doesn't show you confidence interval of control limit and Cpk
Countermeasure
Bootstrap
- For example: re-sampling 1000 times from the raw dara by using Bernoulli trial
- calculates 1000 control limits
- calculate 2.5 and 97.5 %tile based on the 1000 control limit. These are confidence interval
- For example: re-sampling 1000 times from the raw dara by using Bernoulli trial
- calculates 1000 control limits
- calculate 2.5 and 97.5 %tile based on the 1000 control limit. These are confidence interval
- But calculation above may needs long time. For example 10 min or 1 hour.
* Impact of a number of data to control limit and Cpk is not evaluated.
But in general , when it is larger than 300 , the confidence intervals are narrow and nigligible.
In contrast , when it is less than 30 , the confidence intervals are too wide
, and sometimes a conclusion made by my Macro is not available.
Countermeasure
- Also , multivariate method may help you.
If you deal with plural parameters at the same time , the number of data can increase.
Feature of coding
*Convinient data entry
*Engin of robust regression is faster and more accurate than Microsoft Excel solver
A number of data should be over 100 or at least 30.
Sometimes the control limit is not correct if is less than 30. So control limit can be too high or too low.
Control limit can also be largely affected by the selection of outlier.
Method to find best fit normal-distribution-function
1. On a range larger than median + σ; measured data are plotted on a probability plot.
2. Conduct linear regression analysis.
-Strength of the robustness can be changed or selected by you.
-Automatically weighted regression analysis.
-Strength of the robustness can be changed or selected by you.
-Automatically weighted regression analysis.
3. An approximation straight line corresponds to the best normal function that fits.
Feature:
-Normal function approximation
-Range is the right tail where upper spec or upper control limit are located.
-Calculated control limit and Cpk are robust because they are always approximated by normal function.
-Those provided by Minitab IDI is not robust. It is also not available for indicator.
Data around average or median are mainly used looking for the best function that fits.
Data around average or median are mainly used looking for the best function that fits.
Spec and control limit does not exist.
IDI Individual Distribution Identification: Select best function that fits.
I think that IDI is over fitting. It doesn't fit with spec or control limit.
Therefore mathematically it is interesting but practically meaningless.
I think that IDI is over fitting. It doesn't fit with spec or control limit.
Therefore mathematically it is interesting but practically meaningless.
Z = 2.78 corresponds to 99.73%tile of one-sided parameter like above example..
It corresponds to ( 1 - false alerm rate) of Xbar +- 3 sigma of double sided distribution.