HƯỚNG DẪN DataMining with FuzzyLogic & Implemented JAVA Tool


Thành viên VIP

It's time again to write something 'sophisticated' for the creative Java Developers. DataMining (or DM) is the most confusing buzzword nowadays. If you surf on the web for the definition or explanation you could be more confused than explained. Why? Some "DM pundits" dispute LITERALLY about the word MINING. Some say " it's Data Extraction". The others say "it's Data Collection and Processing".

(source: datasolut.com)

The problem of such verbal "brawls" is that there is too few real IT-creative developers. Creativeness is here the self-development, not the adeptness how to work with a (software) tool. And (IT) companies contribute the worsening trend of the IT development with their "usage-knowledge" requirements. The concentration of knowledge is in a few creative development elitist groups which develop and shape the IT world. The rest is a bunch of IT gurus who work with the products. These IT gurus are in a broad sense NOT the real IT developers, but merely the excellent IT users. For example, some GUI developers designed stunning GUI appearance, but in reality an IDE did the coding work that based on their design. Such excellent IT gurus have so much time to "brawl" about the words and the definitions. Whatever the definition of DataMining is I don't intend to join the verbal fray and stay with the known term: DataMining.

The essence of DataMining is to find out the specific patterns among a heap of unformatted data. The heap could be one file to several files or millions of files: BigData. Again, the term BigData (BD) is a brawling point of BD-pundits. For me BD starts with unformatted data -regardless of their origin: from one file or from millions of files.

A creative (JAVA) developer is the one who knows how to extract and to process the unformatted data without having to be dependent from any 'existed' tool. And that is the literal meaning of "to extract". The reason is that DataMining tool is usually universal for "all" cases. An overhead if you have to extract and to process some specific data which is unique for your company. It's like buying a cannon to kill a mouse. Before I show you how to build your own DataMining tool some words about DataMining Technology and technique.

Unformatted data are raw data and don't belong to any rule. For example: data of survey sheets, data of emails, etc. To extract the most relevant data out of the heap of unformatted data it requires a special tool which is usually NOT available and has to be developed. If you don't belong to the creative elitist IT developer group you could click HERE to find the tool you think that it serves you the best.

The most important essence of DataMining is to understand the data, to reckon their Structure and to recognize their (inter)relationship. Without the understanding you could get lost in the maze of the unstructured, unformatted data.

Before you start to mine the data you have to explore the data. To do that you need to set a goal and then you work step-by-step towards the setting goal. For example: you have a heap of surveyed data. The data compose of shopping behavior, ages, fashion (trendy or conservative) and taste. And your goal is to profile the consumers who are the most willing to spend. So you start to work with the shopping behavior down to their taste. Step-by-step.

From the "mined data" (the results of your step-by-step works) you could establish the relationships between the extracted data and categorize the consumers into several groups. The categorization could deliver a view about the groups (linear regression or clustering)



and the relationships could determine the shopping behavior of different groups. For example, women older than 40 won't buy colorful clothing while teenagers between 14-20 prefer "tattered" jeans and young women between 20-40 like fashionable apparel. Within the groups you could derive a linear-regression of the shopping behavior or taste.

In general DataMining is an intensive analysis work. An ETO (Extract-Transform-Optimize.) The ETO process extracts the data and transforms them into a model (modelling) so that the data can be optimized for further evaluation.

(next: DataMining with ETO process)
Sửa lần cuối:


Thành viên VIP
DataMining with ETO process

1) Data Classification
As mentioned DataMining starts with a determined goal to classify the data so that you could successively work down to your setting goal. For example, you want to know which customers in your surveyed data (or in your customer database) are solvent and female. The classification of creditability and gender eliminates the superfluous data and facilitates the subsequent analysis.

2) Interrelationship (or Associate Rule Learning)
The detection of interrelationships within the data is the dependency or the complement of data to data. The detection of interrelationship of data is the most popular marketing algorithm to reveal the hidden potential values of the data which could be used for product optimization or for enhancement of market share. For example, teenagers love wearing "tattered" jeans.

3) Data Visualization
With the classification of data and their interrelationship the data could be visualized in a coordinates: either linear regression or clustering. The visualization of data allows you to detect the amomalies which could effect your decision or products.

4) Anomaly (or Outlier) Detection
The anomalies in data is the (early) warnings or hints about the (market or product) situation. For example, the shopping anomaly before Christmas or (national, religious) holidays.

5) Forecasting
The data which have been collected from months or years could reveal some valuable information about a product or behavior of customers. For example, the data of a bike shop gives the information about what bike model was sold the most. With the mining information the shop owner can stock this model more and reduces the bad-sellable models.

6) Decision tree
The decision tree varies from business to business. Catering business, for example cafeteria with outdoor garden, requires some attention to the weather to make a decision to pose tables and chairs in the garden: rainy, sunny, cloudy or windy.

Back to the term BigData. As you have already noticed that I talked about the data "in your surveyed data (or in your customer database)". Data are data regardless where they come from. Data of a Database does not mean that they are formatted and structured. No, in some sense YES. But they do need some "DataMining" so that their hidden values could be revealed and meaningfully processed. So much about DataMining and BigData.

The brief exploration into the realm of DataMining and BigData gives you an overview about the complexity and the diverseness of the two fields. Therefore an universal DataMining tool can only cover some certain aspects and its application is in itself also as complex as DataMining/BigData. It requires its users that they are experienced data specialists. I will show you how to implement an API that bases on the known algorithm "Map-Reduce-Algorithm" or MRA for short. This link HERE is for those who want to learn more about MRA.

The mentioned Interrelationship or "Associate Rule Learning" signifies something intelligent. Yes, it is. Artificial Intelligence and Machine Learning are the best preconditions for a successful DataMining. I have showed you how FuzzyLogic (FL) can be applied into AI-ML with the FLDrone example (click HERE and HERE). I will show you later how to apply FuzzyLogic with MRA to DataMining.

Let start with the Survey as an Example. Your Website starts a survey with the following "formula":


To simulate the "BigData" generated by the "survey action" we create the following Java App: CreateSurvey.java
import java.util.*;
import java.nio.file.*;
public class CreateSurvey {
  public static void main(String... a) throws Exception {
    Random ran = new Random( );
    int max = (a.length > 0)? Integer.parseInt(a[0]):100;     
    File fi = new File("c:\\jfx\\datamining\\survey.txt");
    String[] lines = (new String(Files.readAllBytes(fi.toPath()))).replace("\r", "").split("\n");
    StringBuilder sb = new StringBuilder();
    for (int i = 1; i < max; ++i) { // create 100
      int I = 0;
      while (I < 16 || I > 60) I = ran.nextInt(100);
      sb.append(lines[2]+" "+I+"\n");
      // male/female
      sb.append(lines[3]+((ran.nextInt(10) > 5)?" female":" male")+"\n");
      // Reason
      I = ran.nextInt(3);
      if (I == 0) sb.append(lines[4]+" shopping\n");
      else if (I == 1) sb.append(lines[4]+" by chance\n");
      else sb.append(lines[4]+" curiosity\n");
      // buy
      I = ran.nextInt(5);
      if (I == 0) sb.append(lines[5]+" nothing\n");
      else if (I == 1) sb.append(lines[5]+" shirt\n");
      else if (I == 2) sb.append(lines[5]+" jeans\n");
      else if (I == 3) sb.append(lines[5]+" skirt\n");
      else sb.append(lines[5]+" suit\n");
      // hue
      I = ran.nextInt(5);
      if (I == 0) sb.append(lines[6]+" light\n");
      else if (I == 1) sb.append(lines[6]+" bright\n");
      else if (I == 2) sb.append(lines[6]+" dark\n");
      else if (I == 3) sb.append(lines[6]+" colorful\n");
      else sb.append(lines[6]+" plain\n");
      // how much
      sb.append(lines[7]+" "+ran.nextInt(1000)+"\n");
      // other sites
      I = ran.nextInt(2);
      if (I == 0) sb.append(lines[8]+" yes"+"\n");
      else sb.append(lines[8]+" no"+"\n");
      // verdict
      I = ran.nextInt(3);
      if (I == 0) sb.append(lines[9]+" good"+"\n");
      else if (I == 1) sb.append(lines[9]+" bad"+"\n");
      else sb.append(lines[9]+" mediocre"+"\n");
      // appearance
      I = ran.nextInt(2);
      if (I == 0) sb.append(lines[10]+" yes\n");
      else sb.append(lines[10]+" no\n");
      // content
      I = ran.nextInt(4);
      if (I == 0) sb.append(lines[11]+" good\n");
      else if (I == 1) sb.append(lines[11]+" bad\n");
      else if (I == 2) sb.append(lines[11]+" confusing\n");
      else sb.append(lines[11]+" inadequate\n");
      // create a survey
      FileOutputStream fout = 
          new FileOutputStream("c:\\jfx\\datamining\\survey\\survey_"+i+".txt", false);
then create a Survey template: survey.txt
A Customer survey

How old are you?
Are you male or female (male, female)?
Why you visit this site (shopping, by chance, curiosity)?
What have you bought (nothing, shirt, jeans, skirt, suit)?
What color hue you like (light, bright, dark, colorful, plain)?
How much you could pay for online shopping (US$)?
Have you visited other shopping sites (yes, no)?
Your verdict to our site (good, bad, mediocre)?
Is because of the appearance (yes, no)?
is the content (good, bad, confusing, inadequate)?
Compile and run the app as following:
C:\JFX\DataMining>javac -g:/none -d ./classes CreateSurvey.java
C:\JFX\DataMining>java CreateSurvey
CreateSurvey creates a new Directory "survey" with (default) 100 "responded" surveys. One of the responded survey is as following (e.g. survey_10.txt):
A Customer survey

How old are you? 38
Are you male or female (male, female)? female
Why you visit this site (shopping, by chance, curiosity)? shopping
What have you bought (nothing, shirt, jeans, skirt, suit)? skirt
What color hue you like (light, bright, dark, colorful, plain)? dark
How much you could pay for online shopping (US$)? 743
Have you visited other shopping sites (yes, no)? no
Your verdict to our site (good, bad, mediocre)? bad
Is because of the appearance (yes, no)? yes
is the content (good, bad, confusing, inadequate)? confusing
(Next: the API PatternMining)
Sửa lần cuối:


Thành viên VIP
PatternMining APIs

To build a Data Mining tool we should start with the APIs as the foundation for any further development. As said, Data Mining is only meaningful when data are huge, confusing and unstructured or unformatted. The data could be a heap of files (usually Text files) or a single big file (e.g. a log file of activities) or a list of the most actual links to some business web pages. Knowing that the Data Mining API tools should be absolutely performant, otherwise it's useless to run a tool that takes hours or days to finish.

MapReduce Algorithm or MRA is, for example, one of the search algorithms invented by Google. MRA is quick, easy to implement. What is it? The Beginning of Wikipedia about "MapReduce" says:
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster...
Yes, PARALLELISM is the most important key to accelerate the "search" with the quickest result. With this feature in mind we should develop our Data Mining APIs that base on parallelism. I have introduced Parallel Programming to you in this forum (click HERE for more details). However, we should be careful between Concurrency and Parallelism. At the first glance they seem to be the same. But they aren't. Parallelism is when the tasks are processed by different processors, while concurrency means the tasks are run parallelly by one (pseudo-parallel) or more processors (parallel). More about Concurrency and Parallelism: click HERE.

In the geometry and trigonometry it is said that two parallel lines won't cut each other. Or the angle between 2 parallel lines is 0 degree (or 0 radian). In other words: the cosine(a) of two parallel lines is ONE (cos(0°) = 1). Google coined the nice term "Cosine Similarity".
Cosine Similarity is between Zero and One. Zero means there is NO similarity between 2 lines and ONE is the parallelism (or they are the SAME or SIMILAR). The transformation into MapReduce is as following: a search pattern with a CosineSimilarity of ~1 is the most present in the heap of data. The smaller the CosineSimiarity is the less common is the search pattern in the data heap.

Let start with the APIs.
The Data Mining Packages consists of 3 APIs:

  1. Group. The document property. Each document has a "grouped" properties (document name, document size, frequency for each search pattern, the possible extension and its Cosine Similarity). The possible extension is the "residue" of a search pattern that could be variable and used for a later individual verification. For example: the question "Are you male or female?" is the search pattern and its residue is either "male" or "female".
  2. PatternMining. This is the main Data Mining part. In this API the pattern-searches are processed parallelly using the available JAVA technology: ForkJoinPool and ForkJoinTask in conjunction with Future< T > and Collections.synchronizedList.
  3. PatternQuery. This is the FuzzyLogic part for the PatternMining (or DataMining). This API is a part of the FuzzyLogic package (see HERE). The sources of FuzzyLogic package is available on demand. What you have to do is to inbox me with your email and I'll email it to you.

The sources of Group.java, PatternMining.java and PatternQuery.java are included with 2 examples in the CDJ.zip which can be downloaded from HERE (with the last session).

How does the PatternMining work? Let examine a questionary:
A Customer survey

How old are you? 48
Are you male or female (male, female)? female
Why you visit this site (shopping, by chance, curiosity)? by chance
What have you bought (nothing, shirt, jeans, skirt, suit)? jeans
What color hue you like (light, bright, dark, colorful, plain)? plain
How much you could pay for online shopping (US$)? 756
Have you visited other shopping sites (yes, no)? yes
Your verdict to our site (good, bad, mediocre)? bad
Is because of the appearance (yes, no)? no
is the content (good, bad, confusing, inadequate)? bad
It's a plain text file. The search patterns are no other thing than the questions themselves (e.g. pattern 1 is "How old are you?" and the residue is "48"). The Search starts from the first line and ends at the last line. The challenge is how to parallelize the search of all patterns at the same time. If you ponder on the algorithm you could see that the search could be divided into n processes where n is the number of search patterns. It's clear because the searches on a common data heap won't interfere each other (e.g. modification or lock). The JAVA Parallelism Implementation in PatternMining is as following:
  private Group checkPattern(byte[] bb, String name) throws Exception {
    Group g  = new Group();
    g.css    = 0;
    g.doc    = name;
    g.size   = bb.length;
    g.fq     = new int[ba.length];
    g.values = new String[ba.length];
    List<Future<Integer>> Fut = new ArrayList<>();
    for (int a = 0; a < ba.length; ++a) {
      final int l = a; // verify parallelly...
      Fut.add(ForkJoinPool.commonPool().submit(()-> {
        for (int j = 0, k = 0; k < bb.length; ++k) {
          if ((bb[k] | 0x20) != ba[l][j]) j = 0;
          else if (++j == ba[l].length) {
            ++g.fq[l]; // found
            if (delim != ' ') {
              if (delim != '\n') {
                for (j = k; k < bb.length; ++k) if (bb[k] == (byte)delim) break;
              } else { // till end of the line
                for (j = ++k; k < bb.length; ++k) if (bb[k] == '\r' || bb[k] == '\n') break;
              g.values[l] = (new String(bb, j, k-j)).trim();
            if (once) return 0;
            j = 0;
        return 0;
    for (Future<Integer> f : Fut) f.get();
    return g;
import java.io.*;
import java.util.*;
import java.nio.file.*;
import datamining.PatternMining;
// Joe Nartca (C)
public class SurveyTest {
  public static void main(String... args) throws Exception {
    String[] a = args;
    if (a.length < 2) {
      a = new String[2];
      a[0] = "c:/jfx/datamining/survey";
      a[1] = "c:/jfx/datamining/text/surveylist.txt";
    // set the min. and max. age
    int min = a.length > 2? Integer.parseInt(a[2]):20;
    int max = a.length > 3? Integer.parseInt(a[3]):50;
    File fi = new File(a[1]); // the Pattern List
    List<String> list = Arrays.asList((new String(Files.readAllBytes(fi.toPath()))).
                              replace("\r", "").split("\n"));
    // Start PatternMining
    PatternMining pm = new PatternMining(a[0], list, '\n', true);
    long time = System.currentTimeMillis();
    List<String> lst = pm.miningResults();
    System.out.printf("Time: %6.3f Sec. \n",((double)(System.currentTimeMillis()-time)/1000));
    if (a.length > 4) for (String s:lst) System.out.println(s);
    System.out.printf("HeapSize %6.3f KB, Number of Documents: %d\n",
                      pm.bigDataSize(), pm.documentCount());
    System.out.println("Frequency of 'Are you male or female (male, female)? male': "+
                       pm.patternFrequency("Are you male or female (male, female)?", "male"));
    System.out.printf("Frequency of 'How old are you?' between %d - %d: %d\n",min, max,
                       pm.patternFrequency("how old are you?", min, max));
and the print-out
C:\JFX\DataMining\examples>javac -g:none -d ./classes SurveyTest.java

C:\JFX\DataMining\examples>java SurveyTest
Time:  0,244 Sec.
HeapSize 53,107 KB, Number of Documents: 99
Frequency of 'Are you male or female (male, female)? male': 53
Frequency of 'How old are you?' between 20 - 50: 71

Before you run this example you have to create a "BigData" directory "survey" by compiling, then running the source CreateSurvey.java (see previous session or in the CDJ.zip file). And here is a JavaFX application: Survey.java (source is in the CDJ.zip)


The Percentages are relative to the sum of all documents. They could overlap. For example: some of the 5% females could appear in 2 other patterns so that the sum of all percentages could be more than 100%.

(Next: PatternQuery - PatternMining with FuzzyLogic)
Sửa lần cuối:


Thành viên VIP
PatternQuery - PatternMining with FuzzyLogic

I have shown you how to classify the data (the surveys), how to extract the relevant data from the surveys and how to structure the extracted data into groups so that they could be easily visualized (Piechart) using the derived percentages. In the today last session I show you how to establish the interrelationship between the data using Fuzzy Logic.

The API PatternMining produces the basic data for a further processing. The procedure is done with MapReduce Algorithm. The results:
  • Pattern Frequency (with its individual appendix -the residue) within a document and in relation to the whole (all documents or BigData).
  • The basic properties of each document (name, size, CosineSimilarity and the appendixes of all patterns found within the document).

Two ways to determine the Pattern Frequency within a document:
  • Once: if the pattern is found then no further determination is made. In this case the pattern frequency is either ONE (found) or ZERO (not found). And the CosineSimilarity is also either 0 or 1.
  • Multiple: the opposite of Once. The Pattern Frequency within a document can be ZERO (not found) or is found N times in the document. N could be either ONE (existed only once) or GREATER than one (repetitive). And the CosineSimilarity of this pattern is usually between 0 and 1.
The interrelationship between patterns requires some Artificial Intelligent rules. As I have shown you in the past that Machine Learning (and Deep Learning) works the best with Fuzzy Logic. And there's also NO exception with Data Mining. With Fuzzy Logic the Data could be better visualized and more readable than working with complicate mathematical rules. The basic operations of Fuzzy Logic are the Set-Rules of the Set Theory (Image 1):
  • AND (&&)
  • OR (||)
Image 1

With the two AND-OR operations and with the POLISH NOTATION we can build any interrelationship of between the patterns. The combination of AND and OR allows us to work with even more complex expressions without having to lose the overview. Example:

Fuzzy Expression
1) [How old are you?]&&[Are you male or female?]
2) [How old are you?]&&[Are you male or female?]&&{[Why you visit this site (shopping, by chance, curiosity)?]||[What have you bought (nothing, shirt, jeans, skirt, suit)?]}

Fuzzy Data
1) 16-20, male
2) 16-20, male, shopping, shirt
With the Fuzzy Expression and their possible Fuzzy Data the interrelationship can be built from the surveyed data:
  • with an age between 16 and 20
  • male
The SET operator AND (&&) between the two patterns gives us the interrelationship of the two patterns and enables a clearer understanding about the data: Males between 20 - 30 years old.

As usual in reality we have to set some conventions for our Fuzzy Query. The API PatternQuery expects not only the Fuzzy Expression, but also their related Fuzzy Data. The comprehensive conventions are:
  • $ symbolizes a range of 2 numbers: from-to (min-max). If only ONE number is needed then the min is equals the max. Example: min = 30, max = 30. The expected "range" is only 30. The syntax: $min-max
  • Fuzzy Data are the expected answers from several possibilities. Example: male and female are the data. The expected answer could be either male or female.
  • && is the Set Operator AND (same Java)
  • || is the Set Operator OR (same java)
  • Pattern must be enclosed by square brackets [pattern]
  • Combined patterns must be enclosed by curly brackets { ... }
  • Unlimited Combination between && and ||
  • Polish Notation for a complex expression. Example: [pattern_1]&&[pattern_2]&&{[pattern_3]||[pattern_4]}
Example (see Image 2, 3, 4):
  • Pattern_1: How old are you?, expected data: 16 - 20, frequency: 13.13%
  • Pattern_2: Are you male or female (male, female)?, expected data: male, frequency: 8.08%
The Fuzzy query gives 8. Meaning: only 8 males (among 13 surveyees) are between 16 and 20 years old.

The following JavaFX example shows you how PatternQuery works with PatternMining. The app is Query.java and the source is included in the CDJ.zip for download.

The Query Procedure (Fuzzy Expression):

Image 2

And the result of the Query:

Image 3

Image 4

If you look at the percentages (Image 4) you could detect that the sum of 2 Queries of males/females (green and magenta) is equal to the query of all surveyees between 16 - 20. The app Query.java is an extension of Survey.java (see previous session). If you just want to visualize the percentage of each query you need only to select the question and the expected answer, then with the SHOW button the piechart of all selected questions is presented. Because the percentage of each question is related to the SUM of all documents and it could appear in other percentages too, so that the sum of all percentages could "overrun" 100%. In such a case the followed percentages (that come after 100%) are suppressed (see Image 5 where the last 3 questions are suppressed).

Image 5

With the 3 little APIs:
  • Group.java (a POJO)
  • PatternMining.java
  • PatternQuery
you are in the good position to build your own special DataMining Tool without having to spend months to study any generic DataMining tool which is available on the web (with or without fee). And it is usually awkward to suit your specific data requirements. Further, you could enhance and adapt the sources according to your (new) data requirements at anytime and anywhere. In case that you work with Patterns without "residues" (i.e. answers) the PatternQuery parameter of method xQuery and query for the answers (aList or reply) must be set to NULL.

  1. @Nancru and @Thanhpv : this for downloading CDJ.zip here is the stable version.
  2. It's the free software and is distributed as it is. The author is not responsible for any problem caused by the software. If you detect some bugs it could be nice and courteous to inform me on this forum (public as a thread) so that I could enhance or correct them for the sake of the community. Thanks in advance.
  3. The source Query.java is written with JavaFX. Herein you could find some tutorial coding techniques such as:

  • TextInputDialog with digit-only using the newest JDK-8 feature UnaryOperator: method getInput() and getNumFormat()
  • ComboBox with repetitive selection of the same item: method createCombo()

To get the JAVA style documents run the javadoc command on a CMD Windows as following (here to the subdirectory doc)
C:\JFX\DataMining>javadoc -d ./doc *.java
Loading source file Group.java...
Loading source file PatternMining.java...
Loading source file PatternQuery.java...
Constructing Javadoc information...
Standard Doclet version 9.0.4
Building tree for all the packages and classes...
Generating .\doc\datamining\PatternMining.html...
Generating .\doc\datamining\PatternQuery.html...
Generating .\doc\datamining\package-frame.html...
Generating .\doc\datamining\package-summary.html...
Generating .\doc\datamining\package-tree.html...
Generating .\doc\constant-values.html...
Building index for all the packages and classes...
Generating .\doc\overview-tree.html...
Generating .\doc\index-all.html...
Generating .\doc\deprecated-list.html...
Building index for all classes...
Generating .\doc\allclasses-frame.html...
Generating .\doc\allclasses-frame.html...
Generating .\doc\allclasses-noframe.html...
Generating .\doc\allclasses-noframe.html...
Generating .\doc\index.html...
Generating .\doc\help-doc.html...


For Download. CDJ.zip contains:
  1. text/survey.txt: the template
  2. text/surveList.txt: the list
  3. text/answer.txt: the list of all possible answers
  4. examples/datamining.jar
  5. examples/CreateSurvey.java
  6. examples/Survey.java
  7. examples/Query.java
  8. Group.java
  9. PatternMining.java
  10. PatternQuery.java
The 3 last sources (Group.java, PatternMining.java and PatternQuery.jva) build the core of the DataMining package (datamining.jar)


Sửa lần cuối:


New Member
It is very helpful for me when I was learning data mining in case how it works on Java
  • Like
Reactions: Joe


Thành viên VIP
For those who interest on BIG-DATA. QuantCube is a French Startup setup by a French Vietnamese Huynh Thanh Long

and how AI is used for Data Mining