Friday, September 21, 2012

Space shuttle on its way to LA

Tuesday, September 18, 2012

Google translation API emulation(Free API :P)


1, I'm missing the free API

I feel sad when Google start charging for their translation API. But after some research, I think, Google is still open, friendly to those technical otaku, like me :P

2, Emulate the translation request

Here's the capture, showing the AJAX request  when I translating the English word "china" to Chinese in Google translate page.

There're several parameters in this GET operation, including 
"sl": source language, 
"tl": target language,
"text": text to be translated to target language
etc...
You will get a 403 error when you just send the query parameters in the request, you should disguised your code as a browser to "cheat" Google.


queryArgs = {'hl':'zh-CN',
        'ie':'UTF-8','oe':'UTF-8',
        "client": "t",
        'text':text, "sl": sl, "tl":tl,
        "multires":1, "ssel":0, "tsel":0, "sc":1}

req = urllib2.Request("http://translate.google.com/translate_a/t", urllib.urlencode(queryArgs))
req.headers["Refer"] = "http://translate.google.com/"
req.headers["Host"] = "translate.google.com"
req.headers["Connection"] = "Close"
req.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"

response = urllib2.urlopen(req)
tres = response.read()


3, understand the response

Response is a JavaScript list, with 10 elements inside.
[[["中国","china","Zhōngguó",""]],[["noun",["中国","瓷器","华","中华","瓷"],[["中国",["China"]],["瓷器",["porcelain","china","chinaware"]],["华",["China","flower","flora"]],["中华",["China"]],["瓷",["porcelain","china","chinaware"]]]],["adjective",["中国的","瓷的"],[["中国的",["China","Chinese"]],["瓷的",["china"]]]]],"en",,[["中国",[5],0,0,1000,0,1,0]],[["china",4,,,""],["china",5,[["中国",1000,0,0],["瓷器",0,0,0]],[[0,5]],"china"]],,,[["en"]],4]


Let's check out the detail:
index 0: [["中国","china","Zhōngguó",""]]
Summary in target language. position 2 in the picuture.

index 1: [["noun",["中国","瓷器","华","中华","瓷"],[["中国",["China"]],["瓷器",["porcelain","china","chinaware"]],["华",["China","flower","flora"]],["中华",["China"]],["瓷",["porcelain","china","chinaware"]]]],["adjective",["中国的","瓷的"],[["中国的",["China","Chinese"]],["瓷的",["china"]]]]]
Detail in target language. There're 1+ items inside. In this sample, there're 2 items inside, one for the adjective explain, another for noun explain.

e.g. the noun explain:
["noun",["中国","瓷器","华","中华","瓷"],[["中国",["China"]],["瓷器",["porcelain","china","chinaware"]],["华",["China","flower","flora"]],["中华",["China"]],["瓷",["porcelain","china","chinaware"]]]]
3 parts inside:

  • parts of speech: noun, adj ...  position 3 in the picture
  • explains: position 4 in the picture
  • Synonyms of the explains. position 5 in the picture

index 2: en
target language

index 3: None
unknown

index 4: [["中国",[5],0,0,1000,0,1,0]]
unknown

index 5: [["china",4,,,""],["china",5,[["中国",1000,0,0],["瓷器",0,0,0]],[[0,5]],"china"]]
unknown

index 6: None
unknown

index 7: None
unknown

index 8: [['en']]
source language

index 9: 3
unknown

Although there're some unknown slices, but information we're interested located in index 0 & index 1.

4, write code to parse the response

4.1 by JavaScript parser

This is my first reaction. No doubt a JavaScript array can be parsed by JavaScript parser. As a super convenient programming language, there're definitely 3rd libraries to do it. 
First, I found WebKitGTK+, which is a GTK implement of WebKit. Unfortunately, I give it up after read its document, it's too heavy to use. So does QTWebkit.
Then I found PyV8 (http://code.google.com/p/pyv8/), which is a python binding to Google's v8 JavaScript Engine. It's easy to use, really.


res = tres.decode("utf8")
import PyV8 as v8
ctxt = v8.JSContext()
ctxt.enter()
x= ctxt.eval(res)

summaryJSObj = x[0]
detailJSObj = x[1]

summary = summaryJSObj[0][0]
detail = ""

if detailJSObj:
    for dx in detailJSObj:
        detail += "%s."%dx[0] # nonu, adj...
        detail += "%s\n"%str(dx[1])
        
print summary, "\n", detail


4.2 transfer to python list

It's kind of tricking. Comparing the JavaScript array to python list format, you'll find the only difference is: JavaScript gamma allows "[,,,]" while Python list only support a None element after a unNone element, like  "[1,]". 
So when we force the JavaScript array in string to a python list, we should prevent the continually comma.

Let's replace it!Our goal, or we say our test cases:
JavaScript array -> Python list
[1,] -> [1,] or  [1,None]
[1,1] -> [1,1]
[1,,,] -> [1,None,None,] or [1,None,None,None]
[1,,,2,3] -> [1,None,None,2,3]
[,1] -> [None,1]
[,,1] -> [None,None,1]

A regular-expression can be a key to all test cased below.
re.sub("(?<=,),|(?<=\[),", "None," , JS_ARRAY_STRING)
It takes care of 2 grammar cases
(?<=,), 
a comma after a comma. .e.g. ",," => ",None,"
(?<=\[), 
a comma after a '[' . e.g. "[," -> "[None,"


res = re.sub("(?<=,),|(?<=\[),","None,",tres)
l = eval(res)

summaryJSObj = l[0]
detailJSObj = l[1]
summary = summaryJSObj[0][0]
detail = ""

if detailJSObj:
    for dx in detailJSObj:        
        detail += "%s."%dx[0] # nonu, adj...
        for exp in dx[1]:
            detail += "%s,"%str(exp)
        detail += "\n"
print summary, "\n", detail


over