Toggle Theme Editor
Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate Charcoal

Fuzzy Search And Data Mining

Discussion in 'Java Update' started by Joe, 6/5/19.

  1. Joe

    Joe Moderator

    In the previous articles I've written about "FuzzyLogic" that could be extremely useful in DataMining. Some readers emailed me and asked for "more" explanation about this particular issue. Well, here is the FuzzySearch explanation before I scoot to vacation for 2 months.

    FuzzySearch
    What is FuzzySearch? It's about the similarity of different objects. The similarity or the alikeness is expressed in some commonesses between objects. Example: dogs have more similarity with foxes than cats or pigs.

    In a huge heap of data or BigData it's difficult to find things that share the same "similarity". For example: if you gather all the daily-online papers of today and try to look for the name "Dai Tuong Vo Nguyen Giap" or "Nguyen Phu Trong" you could get lost due to the sheer mass of data. The following results of FuzzySearch give you an impression how FuzzySearch works.

    The Java codes:
    PHP:
    import fuzzylogic.FuzzyPatterns;
    public class 
    urlSearch {
      public static 
    void main(String... argsthrows Exception {
        
    ArrayList<Stringurls = new ArrayList<String>();
        
    urls.add("https://www.bloomberg.com");
        
    urls.add("https://www.bloomberg.com/europe");
        
    urls.add("https://www.bloomberg.com/asia");
        
    urls.add("https://thediplomat.com");
        
    urls.add("https://www.bbc.com");
        
    urls.add("https://edition.cnn.com");
        
    urls.add("https://vietnamnews.vn");
        
    urls.add("https://thanhnien.vn");
        
    ArrayList<Stringlst = new ArrayList<String>();
        for (
    String s:argslst.add(s);
        
    FuzzyPatterns fp = new FuzzyPatterns(urlslst);
        
    lst fp.searchResults();
        
    System.out.printf("Time: %06.3f MilliSec. by: %06.3f MB \n",
                          
    fp.getTimeMilli(),
                          
    fp.sizeOfBigData()/1048576);
        
    int i 1;
        
    double[] cs fp.cosineSimilarity();
        if (
    lst.size() > 0) for (String s:lst) {
          
    System.out.printf("%5d. %s [%01.6f]\n"iscs[i-1]);
          ++
    i;
        } else 
    System.out.println("Found NOTHING.");
      }
    }
    and the search results:
    Code:
    C:\JFX\Fuzzy\Test>java urlSearch dai tuong vo nguyen giap
    Time: 2439,939 MilliSec. by: 01,498 MB
    	1. https://thanhnien.vn, found Pattern: 10x dai, 12x tuong, 62x vo, 13x nguyen [0,328783]
    	2. https://www.bbc.com, found Pattern: 2x dai, 23x vo [0,204733]
    	3. https://edition.cnn.com, found Pattern: 1x dai, 31x vo [0,204733]
    	4. https://vietnamnews.vn, found Pattern: 6x dai, 19x vo [0,204733]
     
    C:\JFX\Fuzzy\Test>java urlSearch nguyen phu trong
    Time: 2376,578 MilliSec. by: 01,501 MB
    	1. https://thanhnien.vn, found Pattern: 13x nguyen, 39x phu, 36x trong [0,138404]
    	2. https://vietnamnews.vn, found Pattern: 9x phu, 3x trong [0,105023]
    	3. https://www.bloomberg.com, found Pattern: 1x trong [0,096239]
    	4. https://www.bloomberg.com/europe, found Pattern: 1x trong [0,096239]
    	5. https://www.bloomberg.com/asia, found Pattern: 1x trong [0,096239]
    	6. https://thediplomat.com, found Pattern: 11x trong [0,096239]
    	7. https://edition.cnn.com, found Pattern: 33x trong [0,096239]
    	8. https://www.bloomberg.com, found Pattern: 1x trong [0,096239]
     
    C:\JFX\Fuzzy\Test>
    
    I am talking about the MRA or MapReduceAlgorithm with the CosineSimilarity. Google Search Engine bases on this MRA. As you know, two parallel lines won't "cut" each other because they share the same similarity parallelism. In the geometry it's so: 2 lines which have an angle of 0 degree to each other share the similarity of cosine(0) which is 1 and it's (the one) the absolute parallelism: the CosineSimilarity.

    The essence of CosineSimilarity is the similarness to the reference(s). The more they are similar, the more they look parallel. Or in other words: if the angle between "objects" approximates zero the cosine is nearing 1 (absolute the same). With this knowledge MRA-CosineSimilarity proves the best tool to extract the sameness of objects out of a huge heap of objects (BigData). The MRA or FuzzySearch results of "dai tuong vo nguyen giap" and "nguyen phu trong" show you how the CosineSimilarity looks like. To facilitate the searching process it's better to convert all the letters either to UPPER case or to LOWER case:

    5 patterns: dai, tuong, vo, nguyen and giap
    • thanhnien.vn scores the highest points. Namely all 5 words: Dai, Tuong, Vo, Nguyen and Giap. A CosineSimilarity of 0,439789 out of 8 daily online papers.
    • vietnamnews.vn is the second with 4 hits or a CosineSimilarity of 0,390070
    • edition.cnn.com and vietnamnews.vn show up with 2 hits: Dai and Vo
    • the rest missed.

    The search algorithm just looks for pattern (or term) by pattern, NOT the entire sentence (Dai Tuong Vo Nguyen Giap) and builds the similarness with CosineSimilarity. The conclusion whether the found results are correct and valid or not is back to the searcher as human who has to to adjudicate on them. And that is what Artificial Intelligence and Machine Learning with Supervisor really are. If an entire sentence is the pattern itself it must be assigned as an element of the ArrayList or String-Array. Example:
    PHP:
    ArrayList<String> list = new ArrayList<String>();
    list.
    add("Dai Tuong Vo Nguyen Giap"); // the whole sentence
    ...
    FuzzyPatterns fp = new FuzzyPatterns(urls, list);
    ...
    Here is the shortened FuzzyPatterns API of FuzzySearch from my FuzzyEngine package. It exploits the Parallelism Feature of JAVA: the ForkJoinPool, and it works relatively very fast (see the time in milliSec. of the urlSearch example.)
    PHP:
    package fuzzylogic;
    //
    import java.io.*;
    import java.util.*;
    import java.net.URL;
    import java.util.concurrent.*;
    /**
    FuzzyPatterns contains all patterns to be searched in a big heap of data (BigData).
    <br>The patterns are JAVA strings.
    @author Joe Nartca
    */
    public class FuzzyPatterns {
      
    /**
      Constructor, Set patterns and BigData then starts to search.
      @param uLst ArrayList of URLss where they may contain the pattern
      @param pLst String Arraylist of patterns
      @exception Exception if one of the URLs is invalid
      */
      
    public FuzzyPatterns(ArrayList<StringuLstArrayList<StringpLstthrows Exception {
        for (
    String u uLst) if (!isURL(u)) throw new Exception("Invalid "+u);
        
    this.uLst uLst;
        
    this.pLst pLst;
        
    init();
      }
      
    /**
      Constructor, Set patterns and BigData then starts to search.
      @param uLst ArrayList of URLss where they may contain the pattern
      @param pArray String Array of patterns
      @exception Exception if one of the URLs is invalid
      */
      
    public FuzzyPatterns(ArrayList<StringuLstString[] pArraythrows Exception {
        for (
    String u uLst) if (!isURL(u)) throw new Exception("Invalid "+u);
        
    pLst = new ArrayList<String>(pArray.length);
        for (
    int i 0pArray.length; ++ipLst.add(pArray[i]);
        
    this.uLst uLst;
        
    init();
      }
      
    /**
      add a pattern to searching list
      @param pattern String, the searching pattern to be searched
      */
      
    public void addPatterns(String pattern) {
        
    pLst.add(pattern.trim());
      }
      
    /**
      Remove a pattern from searching list
      @param pattern String, the searching pattern to be removed
      @return boolean true if pattern exists and is removed.
      */
      
    public boolean removePatterns(String pattern) {
        return 
    pLst.remove(pattern.trim());
      }
      
    /**
       getTimeSecond() returns the searching time in seconds
       @return double
      */
      
    public double getTimeSecond() {
        return 
    time;
      }
      
    /**
       getTimeMilli() returns the searching time in milliseconds
       @return double
      */
      
    public double getTimeMilli() {
        return 
    milli;
      }
      
    /**
      setBigData()
      @param uLst ArrayList of URLss where they may contain the pattern
      @exception Exception if one of the URLs is invalid
      */
      
    public void setBigData(ArrayList<StringuLstthrows Exception {
        for (
    String u uLst) if (!isURL(u)) throw new Exception("Invalid "+u);
        
    this.uLst uLst;
      }
      
    /**
      setPatterns()
      @param pLst String Arraylist of patterns
      */
      
    public void setPatterns(ArrayList<StringpLst) {
        
    this.pLst pLst;
      }
      
    /**
      setPatterns()
      @param pArray String Array of patterns
      */
      
    public void setPatterns(String[] pArray) {
        
    pLst = new ArrayList<String>(pArray.length);
        for (
    int i 0pArray.length; ++ipLst.add(pArray[i]);
      }
      
    /**
       sizeOfBigData() returns the amount of searched bytes. Only after searchResult() was invoked.
       @return double Zero if searchResult() wasn't invoked in advance.
      */
      
    public double sizeOfBigData() {
        return 
    size;
      }
      
    /**
      cosineSimilarity() return the list contains all documents with their cosinesimilarity
      <br>Document with 0 is the one that has NO match with any pattern
      @return double array that contains the CosineSimilarity values corresponding to the results
      */
      
    public double[] cosineSimilarity() {
        return 
    css;
      }
      
    /**
       search() restarts to search (only necessary after renewing Patterns or BigData)
      */
      
    public void search( ) {
        
    init();
      }
      
    /**
       fuzzy searchResults() returns an ArrayList of String with the following format:
       <br>DocumentName, foundPatterns: ax XXX, bx YYY, ...
       <br>Where a, b, ... are the frequencies of the found patterns XXX, YYY, ... in this document.
       <br>Note: searchResults() must be executed BEFORE any other Search Result can be accessed.
       @return ArrayList of Strings containing the documents which match the best to the patterns
      */
      
    public ArrayList<StringsearchResults( ) {
        if (
    group == nullcss cosSim();
        
    double[] cs = new double[css.length];
        
    int[] ix = new int[css.length];
        
    ArrayList<StringrList = new ArrayList<String>( );
        
    System.arraycopy(css0cs0css.length);
        for (
    int l0ix.length; ++i) {
          
    size += group[i].size;
          
    index( );
          
    css[l] = 0;
          
    ix[i] = l;
        }
        
    StringBuilder sb = new StringBuilder();
        for (
    int i 0ix.length; ++i) {
          
    int l ix[i];
          
    css[i] = cs[l];
          if (
    cs[l] > 0) {
            
    sb.setLength(0);
            
    boolean done false;
            for (
    int a 0pSize; ++a) if (group[l].fq[a] > 0) {
              if (
    donesb.append(", "+group[l].fq[a]+"x "+pLst.get(a));
              else 
    sb.append(group[l].fq[a]+"x "+pLst.get(a));
              
    done true;
            }
            
    rList.add(group[l].doc+", found Pattern: "+sb.toString());
          }
        }
        return 
    rList;
      }
      
    //
      
    private boolean isURL(String url) {
        try {
          (new 
    URL(url)).openStream().close();
          return 
    true;
        } catch (
    Exception ex) { }
        return 
    false;
      }
      private 
    double[] cosSim( ) {
        
    int l0;
        
    ArrayList<Grouplst = new ArrayList<Group>();
        
    time System.nanoTime(); // starting
        
    try { // now wait for the results
          
    for (Future f futurelst.add((Group)f.get());
        } catch (
    Exception ex) { }
        
    group lst.toArray(new Group[lst.size()]);
        
    int[][] df = new int[group.length][pSize];
        for (
    0group.length; ++i)
          for (
    0pSize; ++l)
            if (
    group[i].fq[l] > 0) {
              for (
    int k 0group.length; ++k)
                  if (
    group[k].fq[l] > 0) ++df[i][l];
            }
        
    double[][] idf = new double[group.length][pSize];
        for (
    0group.length; ++i)
          for (
    0pSize; ++l)
            if (
    df[i][l] == 0idf[i][l] = 0;
            else 
    idf[i][l] = Math.log(((double)(group.length)/df[i][l]));
        
    double[][] dbt = new double[group.length][pSize];
        for (
    0group.length; ++i)
          for (
    0pSize; ++l)
            
    dbt[i][l] = idf[i][l] * group[i].fq[l];
        
    //
        
    int mx[] = new int[pSize];
        for (
    0mx.length; ++imx[i] = 0;
        
    // compute the frequency of each pattern in all documents
        
    for (0group.length; ++i)
          for (
    0pSize; ++l)
            if (
    group[i].fq[l] > 0) ++mx[l];
        
    //
        
    int maxTF 0;
        for (
    0mx.length; ++i) if (maxTF mx[i]) maxTF mx[i];
        
    // compute the Term or QueryVector
        
    double qv 0;
        
    double tv[] = new double[pSize];  
        for (
    0pSize; ++i) {
          
    tv[i] = 0;
          for (
    0group.length; ++l) if (idf[l][i] > 0){
            
    tv[i] += idf[l][i]*(((double)group[l].fq[i])/maxTF);
          }
          
    qv += Math.pow(tv[i], 2);
        }
        
    qv Math.sqrt(qv);
        
    // DocumentVector
        
    double[] dv = new double[group.length];
        for (
    0dv.length; ++i) {
          
    dv[i] = 0;
          for (
    0pSize; ++l)
            
    dv[i] += Math.pow(idf[i][l], 2);
          
    dv[i] = Math.sqrt(dv[i]);
        }
        
    // CosineSimilarity(dv, tv) = sumIDFi/dvi*tvi
        
    double[] cs = new double[group.length];
        for (
    0cs.length; ++i) {
          
    double x dv[i] * qv;
          
    cs[i] = 0;
          if (
    0) {
            for (
    0pSize; ++lcs[i] += idf[i][l];
            
    cs[i] /= x;
          }
        }
        
    milli = (double)(System.nanoTime()-time)/1000000;
        
    time milli/1000;
        return 
    cs;
      }
      
    //----------------------------------------------------------------------------
      
    private void init() {
        
    group null;
        
    pSize pLst.size();
        
    byte[][] ba = new byte[pSize][];
        for (
    int i 0pSize; ++i) {
          
    String s pLst.get(i).trim();
          
    int le s.length();
          
    ba[i] = new byte[le]; // LowerCase only
          
    System.arraycopy(s.toLowerCase().getBytes(), 0ba[i], 0le);
        }
        
    future = new  ArrayList<Future<Group>>();
        for (
    String url uLstfuture.add(ForkJoinPool.commonPool().submit(()-> {
          
    ByteArrayOutputStream bao = new ByteArrayOutputStream(131072); // 128KB
          
    byte[] bb = new byte[32768]; // 32KB
          
    Group g = new Group();
          
    g.fq  = new int[ba.length];
          
    g.doc url;
          try { 
    // compute the Term Frequency of this URL
            
    InputStream is = (new URL(url)).openStream( );
            for (
    int i is.read(bb); 0is.read(bb)) bao.write(bb0i);
            
    bb bao.toByteArray();
            
    g.size bb.length;
            
    bao.close();
            
    is.close();
            
    // count the pattern-frequencies
            
    for (int l 0ba.length; ++l) {
              for (
    int j 00bb.length; ++k) {
                if ((
    byte)(bb[k] | 0x20) != ba[l][j])0;
                if (++
    == ba[l].length) { // match the pattern?
                  
    ++g.fq[l];
                  
    0;
                }
              }
            }
          } catch (
    Exception ex) { }
          return 
    g;
        }));
      }
      
    // sorted sequence of the most likely (or similar)
      
    private int index( ) {
        
    int idx 0;
        
    double d 0;
        for (
    int i 0css.length; ++i)
        if (
    css[i]) {
          
    css[i];
          
    idx i;
        }
        return 
    idx;
      }
      
    // ---------------------------------------------------------------------------
      
    private class Group {
        public 
    double size 0;
        public 
    String doc;
        public 
    int fq[]; 
      }
      
    //----------------------------------------------------------------------------
      
    private int pSize;
      private 
    double size;
      private 
    double[] css;
      private 
    Group group[];
      private 
    double timemilli;
      private 
    ArrayList<StringuLstpLst;
      private 
    ArrayList<Future<Group>> future;
    }
     
    Last edited: 11/5/19

Chia sẻ trang này

Loading...