When I use Apache Nutch to crawl the Chinese website, there’s a problem make me unhappy, the language-identifier plugin which nutch provide can’t detect Chinese characters, I have to find a new method to identify the language that the website uses. Finally I went about solving this problem with ‘langdetect’, an open source java language detection library. The project is deployed on http://code.google.com/p/language-detection/.
LangDetect support 53 languages, awesome! We can check the support language list from here: http://code.google.com/p/language-detection/wiki/LanguageList.
And we can use it as simple as the sample show in it’s project homepage.
import java.util.ArrayList;
import com.cybozu.labs.langdetect.Detector;
import com.cybozu.labs.langdetect.DetectorFactory;
import com.cybozu.labs.langdetect.Language;
class LangDetectSample {
public void init(String profileDirectory) throws LangDetectException {
DetectorFactory.loadProfile(profileDirectory);
}
public String detect(String text) throws LangDetectException {
Detector detector = DetectorFactory.create();
detector.append(text);
return detector.detect();
}
public ArrayList<Language> detectLangs(String text) throws LangDetectException {
Detector detector = DetectorFactory.create();
detector.append(text);
return detector.getProbabilities();
}
}
When I test it, I found I need to add jsonic-1.2.x.jar into the project’s build path. Which is not contain in the package when I downloaded langdetect. So I have to download jsonic by myself and add it to build path of the project. After all, everything goes on the track now. Enjoy it!
By the way, langdetect provide a build of nutch’s plugin, we can integerate it with our cluster conviniently.