Tuesday, April 15, 2008

JDIC

Tired of having to go to Jim Breen's WWWJDIC, I decided to write my own searching script for the edict file they provide. The file's a simple newline-delimited format. Here are a few lines:

% zcat edict.gz | iconv -f 'euc-jp' -t 'utf8' | tail -n400 | head -n30
餃子 [ぎょうざ] /(n) gyoza (Japanese crescent-shaped pan-fried dumplings stuffed with minced pork and vegetables)/(P)/
餡 [あん] /(n) (uk) red bean jam/anko/(P)/
餡かけ [あんかけ] /(n) (food) thick starchy sauce made of kuzu or katakuriko flour/
餡こ [あんこ] /(n) (uk) red bean jam/anko/(P)/
餡パン [あんパン] /(n) bread roll filled with anko/
餡掛け [あんかけ] /(n) (food) thick starchy sauce made of kuzu or katakuriko flour/
餡蜜 [あんみつ] /(n) (food) syrup-covered anko (bean jam) and fruit/mitsumame mixed with an(ko)/
餡餅 [あんも] /(n) (1) (fem) mochi rice cake with red bean jam filling/mochi rice cake covered in red bean jam/(2) mochi rice cake/
餡餅 [あんもち] /(n) (1) (fem) mochi rice cake with red bean jam filling/mochi rice cake covered in red bean jam/
餡饅 [あんまん] /(n) bun with anko filling/
餝り車 [かざりぐるま] /(n) type of carriage beautifully decorated with gold, silver, gems, etc., for use by Heian era nobles at festivals and like activities/
餞 [はなむけ] /(n) farewell gift/
餞別 [せんべつ] /(n) farewell gift/
餠 [あも] /(oK) (ok) (n) (uk) sticky rice cake/
餠 [かちん] /(oK) (ok) (n) (uk) sticky rice cake/
餠 [もち] /(oK) (n) (uk) sticky rice cake/
餠 [もちい] /(oK) (ok) (n) (uk) sticky rice cake/
餬口 [ここう] /(n) bare existence/living on others/
饂飩 [うどん] /(n) (uk) udon (thick Japanese wheat noodles)/(P)/
饂飩屋 [うどんや] /(n) noodle shop/
饂飩鋤 [うどんすき] /(n) (uk) seafood and vegetables cooked sukiyaki style and served with udon/
饂飩粉 [うどんこ] /(n) flour/
饂飩粉病 [うどんこびょう] /(n) powdery mildew/
饅頭 [まんじゅう] /(n) manjuu/steamed yeast bun with filling/
饐える [すえる] /(v1,vi) to go bad/to turn sour/
饋還 [きかん] /(n,vs) (1) (electrical) feedback/
饋電線 [きでんせん] /(n) (obs) feeder/
饑い [ひだるい] /(adj-i) (uk) hungry/
饑える [うえる] /(v1,vi) to starve/to thirst/to be hungry/
饑餓 [きが] /(n) hunger/starvation/
The file's actually stored in EUC-JP, so some intelligent language would be needed. Unfortunately, Python fails at string.find() when it comes to 'unicode' objects. I decided on Perl, since I know it plays well with Unicode:

jsearch.pl

Searches for Japanese text

#!/usr/bin/perl -w
use strict;
use warnings;
use utf8;
use PerlIO::gzip;
use Encode;

scalar @ARGV || die "Usage: $0 edict\n";
my $db = shift @ARGV;

my $srch = '';
scalar @ARGV and ($srch = shift @ARGV);

open(DB, '<:gzip', $db) || die "$!\n";
while(<DB>) {
$_ = encode('utf-8', decode('euc-jp', $_));
my @parts = split(/\//, $_);
my $word = shift @parts;
my $sterm = $word;
pop @parts if($parts[$#parts] !~ /\S/);
($word =~ /\[([^\]]+)\]/) and ($sterm = $1);
print "$word " . ($parts[$#parts] eq '(P)' and pop(@parts) or '') . "\n\t" . join('; ', @parts) . "\n" if(length($srch) == 0 || ($word =~ /$srch/ || $sterm =~ /$srch/));
}
close(DB);

esearch.pl

Searches for English text

#!/usr/bin/perl -w
use strict;
use warnings;
use utf8;
use PerlIO::gzip;
use Encode;

scalar @ARGV || die "Usage: $0 edict\n";
my $db = shift @ARGV;

my $srch = '';
scalar @ARGV and ($srch = shift @ARGV);

open(DB, '<:gzip', $db) || die "$!\n";
while(<DB>) {
$_ = encode('utf-8', decode('euc-jp', $_));
my @parts = split(/\//, $_);
my $word = shift @parts;
pop @parts if($parts[$#parts] !~ /\S/);
print "$word " . ($parts[$#parts] eq '(P)' and pop(@parts) or '') . "\n\t" . join('; ', @parts) . "\n" if(length($srch) == 0 || scalar grep(/\b$srch\b/i, @parts) > 0);
}
close(DB);

Next up in this line will be a script that searches KanjiDic, which stores things in XML.

No comments: