中国古诗文

正在考虑建立一个中国古诗文网站,目的是给孩子看,类似于《唐诗三百首》和《古文观止》,内容也多从其中来。特色在于:

  • 在内容上,以原文为本。不要译文,基本没有注释。
  • 在布局上,以内容为主,去除一切干扰因素。要求干净,大方。
  • 各古诗文名家作者就是blog作者,既直观有趣,又方便组织。

需要做的工作:

  1. 开发批量插入作者的功能——已完成;
  2. 开发批量插入帖子的功能——已完成;
  3. 建立分类目录:需要专业人员帮助。
  4. 开发新的数据结构以存储传统计时方式:朝代,年号,年号纪年,公元纪年,季节,时令,时辰。
  5. 扩展帖子属性,其中帖子发表时间采用新的数据结构。
  6. 扩展作者属性,其中作者生卒时间采用新的数据结构。
  7. 修改界面,使帖子能够显示传统计时(年号,天干地支)。

当前状态见这里:http://gushiwen.genglinxiao.com/

Blog move done

Finally completely moved all my blog posts from the original blogspot site to my own server.

All posts are here now and the original site:

http://retidigca.blogspot.com/

is deprecated now.

Still a lot of topics were started but haven’t been completed. I will try to put some existing draft online recently.

building self-aware device – part 2

After about 15 months, here’s part 2. You can see part1 here: building self-aware device – part 1

So now we define self-awareness as:

  1. Knows ones own properties and boundary;
  2. Able to learn ones own identity from self-initiated training;

Can we build device that is self-aware? After some thought, we have to say, achieving the most general sense of item 1 is way beyond our reach. Animals learn its own properties and boundary (again) through learning. The learning process correlates the visual signal from eyes,  touch signal from sensors covering the whole body, signals from motor neurons and maybe more. We might be able to build an artificial eye, but currently there’s no technology that come close to a distributed sensing system like the skin and the fur, or a distributed control system like the muscle.

But, if we limit our device to have a rigid body, then it becomes something we can handle, at least to a certain extent.

If the 3D model (the shape, the size) of the rigid body is known. Then with a GPS receiver installed in a fixed point within the rigid body, and a gyroscope to tell the orientation of the device, we basically have a device that knows it’s own properties and boundary. (In our simplified case, both the properties and the boundary are static. Properties are whatever inside the rigid body, boundary is the boundary of the rigid body.)

Following this approach, we might be able to add some moving parts into this rigid body gradually.

<Here further expansion is needed>

Then let’s move to the next step. Let’s put it in front of a mirror. The device has to start some random movement, and then correlate the movement with the movement it sees in the mirror.

For that we need a neural network, the input would be, on the one hand, the instructions for the random movement and on the other hand, the actual movement it sees in the mirror. So this becomes a supervised learning problem.

So we see that building self-aware device is still a long way to go. However, I believe we should be able to experiment it in controlled scenarios, like fully automated driving, and try to push the limit to see how far can we go.

Computer, another origin

When talking about computer origin, people tend to think about chips, CPU, world war II and ENIAC, and in some occasions, date back to slide rule and Suanpan. This is also how it is taught in schools and universities. The logic behind this story line is, computer originated from the need for fast computing, especially arithmetic computing.

However, there’s another origin that is at least as important, has a longer history, and at least to some people, more fascinating. That is how the architecture of modern computer came into being. Computer nowadays is so powerful that sometimes it seems inconceivable to link it to its ancient ancestors. On the other hand, you only have to look into some of the remarkable masterpieces of the past to know the linkage is simply undeniable.

A recent BBC program I watched, “Mechanical Marvels – Clockwork Dreams” showed one of such remarkable masterpieces: A mechanical machine boy that is able to write up to 40 letters of text, depending on the configuration, built back in 1770s.

What makes it remarkable is, the fact that the text is customizable. That means, the mechanical boy is programmable.

Of course, it’s still an Finite State Machine, but being programmable, that means adding scratch memory to it and then it will be a complete Turing Machine!

It’s actually from this origin where techniques were developed for modern computer scientists to deal with abstract topics like computational complexity, formal language.

Some thoughts on Vehicle Telematics & Big Data application in CRM

From OEMs perspective, every new technology has to serve the ultimate goal, selling more cars.

Traditionally, just like other business department, CRM can only base their decision making on sales related data. In comparison, the data collected directly from the vehicle via telematics has the following advantages:

  • much better data – OEMs get data from customers directly instead of through 3 parties; this means both accuracy and completeness of data.
  • has a much higher frequency – instead of once in a while, we get updates from customers/vehicles constantly;
  • higher flexibility – OEMs are able to change the data collection policy and setup with minimal management overhead;

The same can be said for the other direction – delivering information from OEMs to customers. The advantages of precise targeting are so well established that I don’t even need to list them here.

The only side effect of doing CRM using vehicle telematics is, this will accumulate huge amount of data – Big Data.

Interestingly, for OEMs, the motivation to employ big data hasn’t been strong enough till recently. That’s because, for OEMs, there’s already established practices in traditional business intelligence. The added value of changing to a new technology hasn’t been so obvious or significant.

It’s true that Big Data as a field of special expertise, stemmed from real world need of storing, processing ever increasing amount of data. However, A whole set of technologies were developed to deal with the new challenges posed by the volume of data. Along with this development, the focus of Big Data have changed from volume and velocity towards data mining, artificial intelligence and machine learning. It’s in these aspects where Big Data may provide unique value to CRM compared to traditional BI.

Take the sales pipeline for example:

  1. Campaign
  2. Leads
  3. Opportunities
  4. Sales
  5. Client
  6. Retention

Traditional wisdom dictates that the data in each stages has to be complete, reliable and continuous, otherwise the model won’t make much sense. However, that is actually a limitation of the ability of the tools and the model. Traditional BI is incapable of dealing with incomplete data; the model it based upon cannot handle fuzziness.

However, in real world, incomplete data is the norm, fuzziness is simply the nature of human as oppose to machines. Luckily, techniques have been developed in Big Data to deal with incomplete data and fussiness. With these techniques, CRM system will behave more like human, making predictions based on  incomplete data with probabilities in mind.

创建DVD Video光盘

朋友需要制作DVD,遇上很多问题。我帮忙查了一下,发现这方面的信息散见于各个论坛,多数仅适用于单一场景,没有一个完整统一的指导手册(因为DVD Video的标准是非公开的,付费5000美元才能得到相关授权,还必须签订保密协议)。因此尝试把相关信息整理如下。

首先,这里说的DVD,是指DVD Video。它的标准由DVD Forum制定。这个标准不仅限定了存储介质,文件系统,目录结构,也限制了视频编码方式。

DVD Video使用标准DVD光盘作为存储介质。22厘米直径,使用650nm激光读取。

DVD Video采用UDF Bridge文件系统。此文件系统兼容ISO9660文件系统。

DVD Video有如下目录结构:

DVD Video文件结构 - From Wikipedia
DVD Video文件结构

视频文件就存储在\VIDEO_TS\目录下。其中.VOB文件包含视频,其它文件是各种辅助文件。比如DVD菜单和章节信息。

因为有以上要求,DVD光盘刻录软件中通常有单独的制作DVD Video的选项。选中这个选项就可以制作标准的DVD Video光盘。下面来谈一谈视频的压制和字幕的制作。

继续阅读创建DVD Video光盘

为什么不应该尝试自创加密算法/协议?

偶然在知乎上看到一个类似的问题,没有看到令人满意的回答,遂决定自己尝试一下。

阅读任何一本现代密码学的基础书籍,读者都会被提醒:不要尝试自己发明加密算法(或者协议)。Please don’t try to invent your own encryption algorithm or protocol。

但是,对于对现代密码学所知不多,而又有自尊心的理工男而言,这么一句话简直是在下挑战书。因此,这里尝试提供一个完整的说明。

首先,并不是所有人都被禁止发明加密算法和协议。如果你有两个数学博士学位,在密码学圈子浸淫数十年,对现有的安全体系了如指掌,又看到了现有算法或协议的可改进之处,学界和工业界都会欢迎你做出新的贡献。当然,这只是打个比方,也可能有人原本没有任何头衔,却因为发明了新的算法而一举成名。重点在于,你需要首先了解现状和最新进展,从头发明轮子是不智的。而了解密码学现状和最新进展,已经是很高的门槛。如果你已经越过这个门槛,很可能你已经打消了最初的冲动,也不再是入门书籍的目标受众了。

其次,这个警告适用于需要严肃对待信息安全的情境。如果你只是想对你的女朋友保密,或者只是想瞒过你的领导,你自己发明一个什么算法可能也就够了。

那么对于无法跨越现代密码学门槛,而又需要认真对待信息安全的人(比如你的软件产品或者系统会保存很多用户信息)来说,如何抑制这种冲动呢?

首先需要理解的是,现代加密算法和协议的安全性不依赖于对算法的保密。自创算法并不能提高安全性,而公开的算法安全性很好。大部分打算自创加密算法的人对这一点不甚了了,还有一部分听说过这个说法但是心存幻想。

其次,安全是一个系统工程,任何一个地方的细微失误都会影响整个体系的安全性。已有的算法和协议经受过多年锤炼,其中包含很多很多精妙的细节。自创的算法协议的安全性不可能与已有的公开算法的协议相提并论。换句话说,自创算法和协议协议的安全性只会比已有的公开算法差。

最后,现代计算的计算能力非常惊人,破解有缺陷的加密算法非常容易。因此,如果你对安全和加密严肃认真,请不要使用自创算法

中国地图坐标(GCJ-02)偏移算法破解小史

2006年,Google开始与AutoNavi合作使用后者所提供的中国地图。这应该是外企首次接触到这个问题。

从2009年开始,中国地图的坐标偏移开始为外界所知。Garmin的用户发现在美国购买的GPS到了中国几乎无法使用,而在中国购买的Garmin产品则没有问题。Google Maps API的使用者发现兴趣点无法被准确标注在中国地图上。更有意思的是,有用户反复就此报告bug给Google,却从未得到任何回应。类似的,Garmin也声称自己没有解决方案,建议客户在需要的情况下在中国境内购买GPS设备。

于此同时,各路豪杰开始尝试破解这种偏移算法。其中有两条路径值得注意:

2010年1月,网友wuyongzheng发现:

I accidentally found the Chinese version of Google Map ditu.google.com to be able to correlate satellite image with map, and it gives the amount of deviation for any location in China. This URL queries the deviation of 34.29273N,108.94695E (Xi’an): http://ditu.google.com/maps/vp?spn=0.001,0.001&t=h&z=18&vp=$34.29273,108.94695 (seems it’ doesn’t work now)

有了足够的数据,wuyongzheng建议使用回归算法来逼近这个偏移算法:https://wuyongzheng.wordpress.com/2010/01/22/china-map-deviation-as-a-regression-problem/

在此之前的尝试都是零星的,针对个别城市的。wuongzheng的这个建议算是在全面系统地解决这个问题上迈出了第一步。

2013年5月,Maxime Guilbot根据这个建议得到4-5米精度的逼近:

https://github.com/maxime/ChinaMapDeviation

2013年10月,wuyongzheng自己进行了回归,得到如下结果:

http://wuyongzheng.github.io/china-map-deviation/paper.html

Maxime Guibot和wuyongzheng的回归结果基本代表了在黑暗中摸索的最佳结果,因此得到了广泛的注意和应用。

在另一条路径上,2010年4月,emq project增加了一个文件,Converter.java:

http://emq.googlecode.com/svn/emq/src/Algorithm/Coords/Converter.java

这段代码可以以很高的精度把WGS-84坐标转换到GCJ-02坐标。

2013年2月,这段代码被网友coolypf注意到,整理后用到了他自己的项目中:

https://on4wp7.codeplex.com/SourceControl/changeset/view/21483#353936

其中的关键代码值得贴在这里:

        const double pi = 3.14159265358979324;

        //
        // Krasovsky 1940
        //
        // a = 6378245.0, 1/f = 298.3
        // b = a * (1 - f)
        // ee = (a^2 - b^2) / a^2;
        const double a = 6378245.0;
        const double ee = 0.00669342162296594323;

        //
        // World Geodetic System ==> Mars Geodetic System
        public static void transform(double wgLat, double wgLon, out double mgLat, out double mgLon)
        {
            if (outOfChina(wgLat, wgLon))
            {
                mgLat = wgLat;
                mgLon = wgLon;
                return;
            }
            double dLat = transformLat(wgLon - 105.0, wgLat - 35.0);
            double dLon = transformLon(wgLon - 105.0, wgLat - 35.0);
            double radLat = wgLat / 180.0 * pi;
            double magic = Math.Sin(radLat);
            magic = 1 - ee * magic * magic;
            double sqrtMagic = Math.Sqrt(magic);
            dLat = (dLat * 180.0) / ((a * (1 - ee)) / (magic * sqrtMagic) * pi);
            dLon = (dLon * 180.0) / (a / sqrtMagic * Math.Cos(radLat) * pi);
            mgLat = wgLat + dLat;
            mgLon = wgLon + dLon;
        }

2013年3月,coolypf在自己的博客中介绍了这一段代码:

http://blog.csdn.net/coolypf/article/details/8686588

2014年9月,wuyongzheng注意到了coolypf的项目。至此,两条路径合流,坐标偏移问题基本得到了完美解决。

从上面的代码可以看出,相对于WGS-84,GCJ-02一方面采用了不同的参考椭球体(SK-42, Krasovsky。应该属于前苏联影响的遗留),另一方面引入了高频非线性偏移。

RSA illustration with not-so-small numbers – part 2

Let’s have a closer look at the encryption. During the communication, what’s been exposed are:

Alice’s public key (n=2627, e=13) , and the encrypted message.

For anyone who’s entered the world of modern cryptography from the old age, it’s tempting to try to decrypt the encrypted message using the encrypting key, the public key.

For these people, I have the below chart that shows the mapping between the plain text and the encrypted data:

encryption_mapping

x-axis is the plain-text data (sorted from 1 to 2627) and y-axis is the encrypted data(from 0 to 2626). I did the calculation using this line of script:

~$ for i in `seq 1 2627`; do echo "$i^13 %2627" | bc; done > /tmp/encryption.mapping

Below is part of this chart zoomed-in:

encryption_mapping_part

So you know the encrypted data, let’s say 2144, and you know the public key (n=2627, e=13). How do you find the number x such that x^13 % 2627 = 2144.

You cannot unless you compute everyone possible 1<x<2627 and then find the correct one. That’s brutal force. This is one of the basic assumption behind the security of RSA: There’s no efficient way to find x. This is called the discrete logarithm problem.

In real world scenarios, the 2 prime numbers will be so large that brutal force is simple impractical.

Then to decrypt the message, one would need the private key. The private key is the modular inverse of phi(n). However, in order to get phi(n), he has to know the factors that form n. And factoring large number is mathematically hard. That is the other assumption behind the security of RSA: There’s no efficient way to factor a large number.

As you will see in other places, these 2 assumptions are the corner stones of modern cryptography.