KINH NGHIỆM Regular Expression or Not?


Thành viên VIP

Today is a beautiful, sunny but very cold day. Sitting near the warm oven I think about Jamie Zawinski, a world class hacker and the co-author of the famous XEmacs. If you are a real Programmer you have to know what Emacs and XEmacs are. Otherwise you are NOT able to appreciate the beauty of the (X)Emacs. Do you know how the Guru Zawinski thought about Regular-Expression or regex?
Master Zawinski said
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Well, before you start to read this thread you should begin to make yourself familiar with JAVA regex by clicking HERE so that you could understand what I am going to talk about :) First of all: a cryptic "regex" for you:
var whitelist =
You got it? Uh huh... =))

I myself rarely work with regex. Not because I cannot. But because regex is too slow and too woodoo to the other readers...and with the time it is also woodoo to me myself :( And that means that if I could solve the pattern problems without having to woodoo with regex I would do. Only in a very complicated, very tricky-complex case I toil with regex. The second reason is the hard-to-detect expression bugs.

Too slow and too woodoo ? Yes. The simple line
if (String.valueOf(text.charAt(beg)).matches("\\W"))
queries the value of letter (char) at position beg that it won't be an alpha letter. Look nice, neat and clear if you know what \\W means. The double backslash means ONE backslash and it means in JAVA the exclusion. W stands for Word. Also: \W in human understanding: exclude alpha-letter. Meaning: all special characters (e.g. dot, hyphen, colon, etc.) Well, how can I do the same job without using regex with "\\W" ? As an experienced developer you would see the solution would be:
char c = text.charAt(beg);
if (c < '0' || c >'9' && c < '@' || c > 'Z' && c < 'a' || c > 'z')
Does It look more complicated than the regex statement? Maybe. However, it is certainly more understandable or less woodoo for the readers than a \\W. Am I right?

But you are still unsure about the slowness of regex because the alternative way requires 2 statements and 1 method (=, if and charAt) plus the chain of conditional GTs (Greater-Than) and LTs (Less-Than). Well, at the first glance you might be right. BUT at the second glance the regex is a combination of 3 methods: String.valueOf, charAt and matches. Invoking method is always a Context-Switch and Context-Switch is like two heavy lead balls chained to your legs. And that makes the regex slow. Very slow.

Still unconvincing? The following little app gives you an undeniable answer. The app reads its own source ( and converts all JAVA keywords to upper case.
import java.awt.*;
import javax.swing.*;
import java.nio.file.*;
import java.awt.event.*;
//Joe Nartca (C)
public class RegEx extends JFrame implements ActionListener {
  public RegEx(String fName) {
    this.fName = fName;
    but1 = new JButton("RegEx");
    but2 = new JButton("HomeMade");
    JPanel jp = new JPanel();
    ta = new JTextArea();
    JScrollPane js = new JScrollPane(ta);
    add("North", jp);
    add("Center", js);
    setSize(550, 300);
  public void actionPerformed(ActionEvent e) {
    try {
      StringBuilder text = new StringBuilder(new String(Files.readAllBytes((new File(fName)).toPath())));
      int beg = 0, end = text.length();
      if (e.getSource() == but1) {
        // the regex solution witn a long expression string
        long t0 = System.currentTimeMillis();
        for (int p = beg; beg < end; ++beg) {
          if (String.valueOf(text.charAt(beg)).matches("\\W")) {
            String pat = text.substring(p, beg);
            if (pat.matches(regex)) text.replace(p, beg, pat.toUpperCase());
            p = beg;
        t0 = System.currentTimeMillis()-t0;
        ta.setText(text.toString()+"\n>>>>>With RegEx:"+t0+" milliSec.<<<<<");
      // the UN-regex solution with a string array
      char cb, ce;
      long t0 = System.currentTimeMillis();
      LOOP: for (int le, q, p = beg; beg < end; p = beg++) {
        cb = text.charAt(beg);
        if (cb >= '@' && cb <= 'Z' || cb >= 'a' && cb <= 'z') {
          for (int i = 0; i < array.length; ++i) {
            le = array[i].length();
            q  = beg + le;
            if (q < end) {
              cb = text.charAt(p);
              ce = text.charAt(q);
              if (text.substring(beg, q).equals(array[i]) &&
                 (cb < '@' || cb > 'Z' && cb < 'a' || cb > 'z') &&
                 (ce < '@' || ce > 'Z' && ce < 'a' || ce > 'z')) {
                text.replace(beg, q, array[i].toUpperCase());
                beg = q;
                continue LOOP;
          for (le = beg; beg < end; ++beg) {
            cb = text.charAt(beg);
            if (cb < '@' || cb > 'Z' && cb < 'a' || cb > 'z') break;
        t0 = System.currentTimeMillis()-t0;
        ta.setText(text.toString()+"\n>>>>>With array:"+t0+" milliSec.<<<<<");
    } catch (Exception ex) {
  public static void main(String... a) throws Exception {
      new RegEx(a.length == 0? "":a[0]);
  private JTextArea ta;
  private String fName;
  private JButton but1, but2;
  private String regex = "\\W(private|public|protected|void|int|"+
  private String[] array = {"private", "public", "protected", "void", "int",
                            "float", "double", "static", "volatile", "class",
                            "extends","implements", "abstract", "import", "this",
                            "new", "byte", "char", "throws", "try",
                            "catch", "if", "for", "while", "do",
                            "case", "switch" };
The results speak for themselves


You can repeat as many as you want the result is always in favor of "homemade" than in favor of "regex".

And what is about the "hard-to-detect" bugs?
It's the millions of Dollar price. A word is a chain of letters and surrounded by 2 NON-letters. Example "joe" is a word, but joe" isn't because the second one is NOT surrounded by 2 NON-letters. It starts with j and ends up with a quote " . If you click both button you will see that the first line import*; stays UNCHNAGED and that is the hard-to-detect bug. Reason: import is a JAVA keyword and it should be changed to IMPORT*;

To remove the bug it is easy in the homeMade case:

// if (text.substring(beg, q).equals(array[i])&&
if (text.substring(beg, q).equals(array[i]) && (p == 0 ||     // take pos.0 into account as a virtual NON-letter

And what is about the regex case? It's your turn so that you can learn how hard to remove a bug in an existing Regular Expression.
Sửa lần cuối:
  • Like