Everyone seems to getting in on Tim Bray's WideFinder challenge; there is even a C++ implementation.
I think it was originally supposed to be a competition between all those new and exciting dynamic languages, but they have not be overly impressive thus far. So I've decided to post up my various attempts at a solution in Java. Here is the basic WideFinder in Java (all import statements are omitted for brevity):
public class WideFinder {
public static void main(String[] args) throws IOException {
Map<String, Integer> counts = new HashMap<String, Integer>();
Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(args[0]), "US-ASCII"));
String line = null;
while ((line = in.readLine()) != null) {
Matcher m = p.matcher(line);
if (m.find()) {
String key = m.group();
Integer currentCount = counts.get(key);
counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));
}
}
in.close();
List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counts.entrySet());
Collections.sort(results, new Comparator<Entry<String, Integer>>() {
public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)
{
return o2.getValue().compareTo(o1.getValue());
}
});
for(int i = 0; i < 10; i++) {
System.out.println(results.get(i));
}
}
}
The basic version is very similar to the Ruby one, if a little more verbose. Like all of the others, it is primarily IO bound. One important thing I found, was that specifying the encoding as being ASCII made about 30% improvement in the time taken. This is because Java uses unicode strings internally, and if you read text without specifying an encoding it will use the platform default. ASCII is a good bit quicker to convert than ISO-8859-1.
For a file with 1 million entries I got the following times:
real 0m3.208s
user 0m3.460s
sys 0m0.215s
Taking the regexp out:
real 0m2.267s
user 0m2.329s
sys 0m0.210s
So we can potentially save up to 1 second, almost 1/3 of the time. My first solution was to try to keep things as close to the original as possible. For each line, instead of matching it directly, I submit a new task to an ExecutorService. To make the map safe to access from multiple threads I use a ConcurrentHashMap, with AtomicIntegers as values.
public class WideFinderConcurrent {
public static void main(String[] args) throws IOException, InterruptedException {
int threads = Integer.parseInt(args[1]);
ExecutorService executorService = Executors.newFixedThreadPool(threads);
final ConcurrentHashMap<String, AtomicInteger> counts = new ConcurrentHashMap<String, AtomicInteger>();
final Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]), "US-ASCII"));
String line = null;
while ((line = in.readLine()) != null) {
final String lineForTask = line;
executorService.execute(new Runnable() {
public void run() {
Matcher m = p.matcher(lineForTask);
if (m.find()) {
String key = m.group();
AtomicInteger currentCount = counts.get(key);
if (currentCount == null) {
currentCount = new AtomicInteger(0);
AtomicInteger old = counts.putIfAbsent(key, currentCount);
if (old != null) {
currentCount = old;
}
}
currentCount.incrementAndGet();
}
}
});
}
in.close();
executorService.shutdown();
executorService.awaitTermination(100000, TimeUnit.SECONDS);
List<Entry<String, AtomicInteger>> results = new ArrayList<Map.Entry<String, AtomicInteger>>(counts.entrySet());
Collections.sort(results, new Comparator<Entry<String, AtomicInteger>>() {
public int compare(Entry<String, AtomicInteger> o1, Entry<String, AtomicInteger> o2) {
int anotherVal = o1.getValue().get();
int thisVal = o2.getValue().get();
return (thisVal<anotherVal ? -1 : (thisVal==anotherVal ? 0 : 1));
}
});
for(int i = 0; i < 10; i++) {
System.out.println(results.get(i));
}
}
}
The performance for this version is pretty poor:
real 0m5.407s
user 0m11.921s
sys 0m4.023s
Most of the locking is probably uncontended but there is still significant overhead. It also is probably not a great idea to use lots and lots of AtomicIntegers. With more work being done in the tasks it might be more reasonable. The next step was to get my hands dirty with low level threading. Many bloggers will have you believe this is impossible to get right, but somehow I've managed to write a fair bit of multithreaded code with manual synchronization without causing horrible errors. For this version I just used two threads: the main thread doing the reading, and another thread doing the regexp match and map update. I didn't bother trying to parallelize the sort. The main thread will pass lines on a queue to the regexp thread. Once the file is finished a special object is put on the queue to indicate that. We then wait for the regexp thread to finish and get the map back. Obviously locking overhead will be a problem so instead of putting lines on the queue one by one batches of lines are used.
public class WideFinder2ThreadsBulk {
public static void main(String[] args) throws IOException, InterruptedException {
int batchSize = Integer.parseInt(args[1]);
Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");
LineCounter counter = new LineCounter(p);
Thread t = new Thread(counter);
t.start();
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]), "US-ASCII"));
String line = null;
List<String> batch = new ArrayList<String>(batchSize);
while ((line = in.readLine()) != null) {
batch.add(line);
if (batch.size() == batchSize) {
counter.addLines(batch);
batch = new ArrayList<String>(batchSize);
}
}
counter.addLines(batch);
counter.finished();
t.join();
in.close();
List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counter.getCounts().entrySet());
Collections.sort(results, new Comparator<Entry<String, Integer>>() {
public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)
{
return o2.getValue().compareTo(o1.getValue());
}
});
for(int i = 0; i < 10; i++) {
System.out.println(results.get(i));
}
}
static class LineCounter implements Runnable {
private static final List<String> FINISHED = new ArrayList<String>();
private final BlockingQueue<List<String>> queue = new ArrayBlockingQueue<List<String>>(100);
private final Pattern pattern;
private final Map<String, Integer> counts = new HashMap<String, Integer>();
LineCounter(Pattern pattern) {
this.pattern = pattern;
}
public void run() {
List<String> lines = null;
try {
while ((lines = queue.take()) != FINISHED) {
for (String line : lines) {
Matcher m = pattern.matcher(line);
if (m.find()) {
String key = m.group();
Integer currentCount = counts.get(key);
counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));
}
}
}
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
void addLines(List<String> batch) throws InterruptedException {
queue.put(batch);
}
void finished() throws InterruptedException {
queue.put(FINISHED);
}
Map<String, Integer> getCounts() {
return counts;
}
}
}
Finally we get some speedup (for batches of 500 lines):
real 0m2.597s
user 0m4.137s
sys 0m0.294s
You could generalize this for multiple threads, but it likely would not be worth the effort. Really, this whole exercise probably isn't worth the effort. In fact, I feel like we have all failed one of Steve Yegge's interview questions. A commenter on Tim Bray's blog shows that Unix command line utilities are the way to go, while we have been writing big programs, and trying to optimize file IO. Another - probably easy - way to do this would be to suck the data into your preferred DB, and run a single (simple) SQL query. But that wouldn't be fun, would it?
8 comments:
Friend, is this code :
AtomicInteger currentCount = counts.get(key);
if (currentCount == null) {
currentCount = counts.putIfAbsent(key, new AtomicInteger(0));
} currentCount.incrementAndGet();
not equivalent to this :
AtomicInteger currentCount = counts.putIfAbsent(key, new AtomicInteger(0);
currentCount.incrementAndGet();
only, errr, slower?
It is equivalent, but it probably isn't faster.
AtomicIntegers aren't as cheap to create as regular Integers, so we probably don't want to create lots of them unnecessarily. You would expect most articles to have multiple entries in the log file, so you would be creating a lot of objects to throw away immediately. That probably isn't the best idea even for regular Integers. putIfAbsent probably holds onto a lock for ever so slightly longer than a get, so that might make a difference too.
Shoot me the code with the imports and I'll give it a whack on the big iron. NetBeans couldn't auto-figure-out the imports and my Java is getting rusty... tim dot bray at sun dot com
I sent the code on a couple of days ago. Hopefully it hasn't been eaten or something.
Ehm,
why wouldnt you show the imports in your post ? It would give us instant gratification ;)
T.
I found this site using [url=http://google.com]google.com[/url] And i want to thank you for your work. You have done really very good site. Great work, great site! Thank you!
Sorry for offtopic
Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!
Online Casino Games tyuueooru
http://stonewalljacksoncarnival.org/ - No Download Casino
Therefore, the more the online casino, the more the casino options and offers as well.
[url=http://stonewalljacksoncarnival.org/]Virtual Casino[/url]
Therefore, the more the online casino, the more the casino options and offers as well.
Casino Money
You can enjoy online casino simply by getting connected to the Internet.
Post a Comment