Spain is a big region with a lot of languages, we've the official one for
the whole region, Spanish or Castilian (es-ES), and other officials languages
spoken in other regions, like the Galician (gl-ES) spoken in Galicia,
the Basque (eu-ES), spoken in the Basque Country, the Catalan (ca-ES)
spoken in Catalonia.
But we've more languages that doesn't have the same official support, the
Valencian, the Aragonese, the Asturian, and more. And also we've some
dialects like the Andalusian.
All of this languages was discredit during the dictatorship period when the
only official language was the Spanish and the others was treated like
vulgar languages or non educated one, and in some cases the language was
prohibited.
Whe the democracy arrives, some regions spend a lot of resources trying to
recover the language and are supporting it until now with official institutions
and lessons in the official education system. But in other regions the stigma
continued until today.
That's the case of the Andalusian, that right now is not considered a language
but a dialect. In any case, this dialect is treated as a non cultivated
language, spoken by illiterates. We've a lot of Spanish movies and Series where
the character that spoke Andalusian was the illiterate, from the countryside
or the servant of the family.
Be proud of it
In Andalusia we've a lot of culture, literature and music done in Andalusian,
but we don't have a way to write that and many people try to avoid the acent
and the local words to look more cultured and because there's no a writing
system, we write in Spanish, it's hard to write lyrics, poetry and other
kind of literature. The Andalusian is not only a way to talk, some words are
short and we've contractions and other vowel sounds so if you write in Spanish,
it's not the same as spoken, some information is missed and for example in music
or poetry the metric doesn't match.
There's a movement trying to define a writing Andalusian and promoting the
language, trying to make people proud of it and talk and write without complexes.
Translator, here comes the code
And there's a group of developers that are working in some tools to provide
direct translation from Spanish and other tools to ease the Andalusian writing.
I like to write code and I'm always happy to find new problems to solve, to learn
new languages, tools and to spend some time trying to code something that I've not
done before. So I decided to write a translator from Spanish to Andaluh using rust,
and I've created the andaluh-rs lib.
The translator is more or less easy, there're some rules that should be applied
from top to bottom that basically replaces some group of letters. There's a
implementation in python that uses regular expressions for that. There're
a lot of regular expressions, so I thougth that it could be easy to use a parser,
so I used the pest parser.
// supress muted /h/
H = { ("h" | "H") }
initial_h = { H ~ letter }
CH = { C ~ H }
inner_ch = { CH ~ letter }
inner_h = { !inner_ch ~ H ~ letter }
hua = { H ~ ("ua" | "UA" | "Ua" | "uA") }
hue = { H ~ ("ue" | "UE" | "Ue" | "uE") }
noh = { !CH ~ !H ~ letter }
h = _{ ((sp|SOI)? ~ initial_h* ~ ((hua | hue | inner_ch | inner_h) | noh+)+)+ }
I've defined each rule in the pest format, so I've a parser for each rule
and then I can replace the word with the correct replacement.
pub fn h_rule(input: &str) -> Result<String, Error> {
rule!(Rule::h, &input, Some(&defs::H_RULES_EXCEPT),
Rule::initial_h | Rule::inner_h => |pair: Pair<Rule>| {
let s = pair.as_str();
let h = slice!(s, 0, 1);
let next = slice!(s, 1);
keep_case(&next, &h)
},
Rule::hue => |pair: Pair<Rule>| {
keep_case("güe", &pair.as_str())
},
Rule::hua => |pair: Pair<Rule>| {
keep_case("gua", &pair.as_str())
})
}
To simplify the code, I've defined the rule
macro, with the code used in
all rules:
macro_rules! rule {
($rule: expr, $input: expr, $( $($t: pat)|* => $r: expr ),* ) => {{
let map: Option<HashMap<&str, &str>> = None;
rule!($rule, $input, map, $( $($t)|* => $r ),*)
}};
($rule: expr, $input: expr, $map: expr, $( $($t: pat)|* => $r: expr ),* ) => {{
let (repl, input) = match $map {
Some(ref m) => replace_exceptions($input, m),
None => (vec![], $input.to_string())
};
let pairs = AndaluhParser::parse($rule, &input)?;
let mut output: Vec<String> = vec![];
for pair in pairs {
let chunk = match pair.as_rule() {
$( $($t)|* => {
$r(pair)
} ),*
_ => {
String::from(pair.as_str())
},
};
output.push(chunk);
}
let mut outstr = output.join("");
if $map.is_some() {
outstr = replace_exceptions_back(&outstr, repl);
}
Ok(outstr)
}}
}
And because the Spanish and the Andaluh languages uses unicode and rust Strings
can not be iterated by unicode, I've used unicode_segmentation
crate, and
defined some utility macros to get the real String len and to get slices of that
String.
macro_rules! chars {
($input: expr) => {
UnicodeSegmentation::graphemes($input, true)
}
}
macro_rules! slice {
($input: expr, $start: expr, $end: expr) => {
chars!($input)
.skip($start)
.take($end - $start)
.collect::<String>()
};
($input: expr, $start: expr) => {
chars!($input)
.skip($start)
.collect::<String>()
}
}
macro_rules! len {
($input: expr) => {
chars!($input).count()
}
}
With all this done, we only have to apply all rules, in the correct order,
to the input string so we can get the translated String as output.
pub fn epa(input: &str) -> Result<String, Error> {
// TODO: escape links
let rules = [
h_rule,
x_rule,
ch_rule,
gj_rule,
v_rule,
ll_rule,
l_rule,
psico_rule,
vaf_rule,
word_ending_rule,
digraph_rule,
exception_rule,
word_interaction_rule,
];
let mut output = input.to_string();
for r in rules.iter() {
let out = r(&output)?;
output = out.to_string();
}
Ok(output)
}
Performance
This code is not the best one, I'm doing a lot of string operations with copies
and clones, I'm sure that anyone with more experience with rust can view a lot
of points where we can optimize this code. At first I thought that the translation
could be done during the parsing, keeping a length to be able to view backward
and forward.
Maybe it's possible to read char by char, keeping a buffer, and detecting if
we can apply any of the rules with the content in the buffer, but I've based
this lib in the python one, so for me it was easier to translate each regex to
pest regex and then do the same translations in the lib.
But I still think that there's a better solution for this problem, but some times
it's better to have something that just works instead of a never done best
solution.
During this process I've learned to use pest and I've been playing a lot with
regular expressions, so it was a fun project.
There are comments.