Paul Lynch's Pages : Search and Replace

This is more complex than it needs to be, for unobvious reasons. The problem, succinctly stated, is implementing regular expressions with Java.

For historical WebObjects reasons, I have not had easy access to a regexp library. Java people have used the open source JEdit libraries to implement regular expressions, but with WebObjects this was too much like hard work. Now, with Java 1.4, we get java.util.regex, and a couple of regex methods added in to the String class. These support full regular expression patterns, like I would expect to find in sed, awk and perl.

With WebObjects, I could use the array split and combine functions to implement a simple replace method:

static public String replace(String a, String b, String text) { NSArray bits = NSArray.componentsSeparatedByString(text, a); return bits.componentsJoinedByString(b); }

As I am always frustrated by the inflexibility of Java, I had to extend this to take a dictionary of parameters – keys being the search string, values the replacement:

static public String replaceFromDict(String text, NSDictionary dict) { String result = text; if (dict == null) return result; Enumeration i = dict.keyEnumerator(); while (i.hasMoreElements()) { String key = (String)i.nextElement(); result = PLUtilities.replace(key, (String)dict.valueForKey(key), result); } return result; }

This was sufficient with Java 1.3. I used it to deal with form mails, using fixed parameters to replace with user names, etc. Mostly to implement a “forgotten password” routine for my stock login procedures.

Today I had a different problem. I was creating multipart emails using WebObjects components to generate the html. I already knew that the standard WOMailDelivery routine generates emails, with just a single html content, that gets caught up in SpamAssassin. The fix for this is to convert the email into a multi-part email, and give it a text content part. This is reasonably easy, although laborious. So I needed to create a plain text counterpart to a component html; the simple solution is to write a de-html routine. This has to rely on intelligent creation of the component to work successfully.

static public String dehtml(String value) { String result = new String(value); NSMutableArray patternArray = new NSMutableArray(); NSMutableArray replaceArray = new NSMutableArray();


		patternArray.addObject(" "); replaceArray.addObject(" ");

		patternArray.addObject("\\s+"); replaceArray.addObject(" ");

		patternArray.addObject("(?i)<p>"); replaceArray.addObject("\n\n");

		patternArray.addObject("(?i)<br>"); replaceArray.addObject("\n");

		patternArray.addObject("(?i)</tr>"); replaceArray.addObject("\n");

		patternArray.addObject("(?i)</td>"); replaceArray.addObject("\t");

		patternArray.addObject("(?i)<head>.+</head>"); replaceArray.addObject("");

		patternArray.addObject("<!--.+?-->"); replaceArray.addObject("");
		patternArray.addObject("<.+?>"); replaceArray.addObject("");

for (int i = 0; i < patternArray.count(); i++) { result = result.replaceAll((String)patternArray.objectAtIndex(i), (String)replaceArray.objectAtIndex(i)); } return result; }

This version uses the String replaceAll method, which works just fine. The only trap in creating the patterns was the second string, which replaces white space with a single space: the Java compiler requires you to write \\s rather than \s. The final pattern to remove all html tags needs to use the “reluctant” form, with the ?, as the simple </+> pattern will otherwise eat up the entire string from first tag to last.

Paul Lynch's Pages

Search and Replace

Post a Comment

Pages

Categories

Archives

Meta

RSS Links