
上QQ阅读APP看书,第一时间看更新
Sample spaces
The sample space is the space that is covered by all the possible outcomes of a measurement. For example, if a feature column in a dataset is populated with the number of days last month that a responder watched television, then the sample space will include all the integers in the {0,1,2...31} set. If a manufacturing tool measures the temperature difference before and after processing a widget, then the sample space is a continuous range from {|0-maxT|}, where maxT is the highest temperature that the tool can measure. Data outside the sample space can be a sign of misreporting or a systematic misunderstanding of the problem statement, and should trigger further investigation.
The concept of sample space seems trivial but it's vital for good data mining practice. Not only does it help you to identify outliers or missing and wrong data points, it also helps to orient your mind to the task at hand and understand what the data is meant to represent. Ask yourself this question before you get started: "What is my sample space?"