First steps
in Data Mining
with Weka
Łukasz Kobyliński & Radosław Szmit
Codepot 2015
What is Data Mining?
Data Mining is a process of discovering hidden information in data.
https://visualisingadvocacy.org/blog/disinformation-visualization-how-lie-datavis
Typical applications
Customer analysis
- which customers are likely to increase their purchases?
- which products are more likely to sell to my customers?
Typical applications
Text mining
- what is the category of this email we have received?
- is this product review positive or negative?
- what are they saying about me on twitter?
Typical applications
Image mining
- which images in my collection contain cats?
- which of my contacts are visible on these photos?
- what is the sex and age of these people?
Data Mining Methods
- Regression analysis
- Classification
- Cluster analysis
- Association rule mining
- Sequence mining
- Anomaly detection
- ...
Task #1
Assign names to flowers
How do they differ?
iris-versicolor,
iris-setosa,
iris-virginica
iris-
iris-
iris-
iris-setosa ma niskie wartości dla wszystkich parametrów oprócz sepal-width (szerokie działki korony).
iris-versicolor ma średnie wartości dla wszsytkich parametrów.
iris-virginica ma wysokie wartości dla wszystkich parametrów oprócz sepal-width.
Fisher's iris dataset
petals and sepals
Task #2
Answer the questions:
- iris-setosa has:
- low values of
- high values of
- iris-virginica has:
- low values of
- high values of
- the three classes are best separated by:
- sepallength
- sepalwidth
- petalwidth
Task #3
Use the rules.PART classifier on the iris dataset and answer the questions:
- which is better on this dataset: J48 or PART?
- how many examples of iris-versicolor have been classified as iris-setosa?
- how many examples of iris-virginica have been classified as iris-versicolor?
- what is the accuracy of the classifier on the training set?
Task #4
Use the J48 classifier on the iris dataset and answer the questions:
- use the tree visualization pane to manually perform classification of the following example:
- sepallength=6.7, sepalwidth=3.0
- petallength=5.0, petalwidth=1.7
- are all the attributes used in the classifier?
- what are the numbers of instances of type iris-versicolor, which were misclassified as iris-virginica? (use the Visualize classifier errors panel).
Data: the evolution
Big DataData Mining and Knowledge DiscoveryData WarehousingData AccessData CollectionData: the evolution
Big Data"What’s likely to happen to online sales, considering 1M visits/day?"Data Mining and Knowledge Discovery"What’s likely to happen to Boston unit sales next month? Why?"Data Warehousing"What were unit sales in New England last March? Drill down to Boston."Data Access"What were unit sales in New England last March?"Data Collection"What was my total revenue in the last five years?"
First steps
in Data Mining
with Weka
Łukasz Kobyliński & Radosław Szmit
Codepot 2015