Some thoughts on Vehicle Telematics & Big Data application in CRM

From OEMs perspective, every new technology has to serve the ultimate goal, selling more cars.

Traditionally, just like other business department, CRM can only base their decision making on sales related data. In comparison, the data collected directly from the vehicle via telematics has the following advantages:

  • much better data – OEMs get data from customers directly instead of through 3 parties; this means both accuracy and completeness of data.
  • has a much higher frequency – instead of once in a while, we get updates from customers/vehicles constantly;
  • higher flexibility – OEMs are able to change the data collection policy and setup with minimal management overhead;

The same can be said for the other direction – delivering information from OEMs to customers. The advantages of precise targeting are so well established that I don’t even need to list them here.

The only side effect of doing CRM using vehicle telematics is, this will accumulate huge amount of data – Big Data.

Interestingly, for OEMs, the motivation to employ big data hasn’t been strong enough till recently. That’s because, for OEMs, there’s already established practices in traditional business intelligence. The added value of changing to a new technology hasn’t been so obvious or significant.

It’s true that Big Data as a field of special expertise, stemmed from real world need of storing, processing ever increasing amount of data. However, A whole set of technologies were developed to deal with the new challenges posed by the volume of data. Along with this development, the focus of Big Data have changed from volume and velocity towards data mining, artificial intelligence and machine learning. It’s in these aspects where Big Data may provide unique value to CRM compared to traditional BI.

Take the sales pipeline for example:

  1. Campaign
  2. Leads
  3. Opportunities
  4. Sales
  5. Client
  6. Retention

Traditional wisdom dictates that the data in each stages has to be complete, reliable and continuous, otherwise the model won’t make much sense. However, that is actually a limitation of the ability of the tools and the model. Traditional BI is incapable of dealing with incomplete data; the model it based upon cannot handle fuzziness.

However, in real world, incomplete data is the norm, fuzziness is simply the nature of human as oppose to machines. Luckily, techniques have been developed in Big Data to deal with incomplete data and fussiness. With these techniques, CRM system will behave more like human, making predictions based on  incomplete data with probabilities in mind.

创建DVD Video光盘

朋友需要制作DVD,遇上很多问题。我帮忙查了一下,发现这方面的信息散见于各个论坛,多数仅适用于单一场景,没有一个完整统一的指导手册(因为DVD Video的标准是非公开的,付费5000美元才能得到相关授权,还必须签订保密协议)。因此尝试把相关信息整理如下。

首先,这里说的DVD,是指DVD Video。它的标准由DVD Forum制定。这个标准不仅限定了存储介质,文件系统,目录结构,也限制了视频编码方式。

DVD Video使用标准DVD光盘作为存储介质。22厘米直径,使用650nm激光读取。

DVD Video采用UDF Bridge文件系统。此文件系统兼容ISO9660文件系统。

DVD Video有如下目录结构:

DVD Video文件结构 - From Wikipedia
DVD Video文件结构

视频文件就存储在\VIDEO_TS\目录下。其中.VOB文件包含视频,其它文件是各种辅助文件。比如DVD菜单和章节信息。

因为有以上要求,DVD光盘刻录软件中通常有单独的制作DVD Video的选项。选中这个选项就可以制作标准的DVD Video光盘。下面来谈一谈视频的压制和字幕的制作。

继续阅读创建DVD Video光盘

为什么不应该尝试自创加密算法/协议?

偶然在知乎上看到一个类似的问题,没有看到令人满意的回答,遂决定自己尝试一下。

阅读任何一本现代密码学的基础书籍,读者都会被提醒:不要尝试自己发明加密算法(或者协议)。Please don’t try to invent your own encryption algorithm or protocol。

但是,对于对现代密码学所知不多,而又有自尊心的理工男而言,这么一句话简直是在下挑战书。因此,这里尝试提供一个完整的说明。

首先,并不是所有人都被禁止发明加密算法和协议。如果你有两个数学博士学位,在密码学圈子浸淫数十年,对现有的安全体系了如指掌,又看到了现有算法或协议的可改进之处,学界和工业界都会欢迎你做出新的贡献。当然,这只是打个比方,也可能有人原本没有任何头衔,却因为发明了新的算法而一举成名。重点在于,你需要首先了解现状和最新进展,从头发明轮子是不智的。而了解密码学现状和最新进展,已经是很高的门槛。如果你已经越过这个门槛,很可能你已经打消了最初的冲动,也不再是入门书籍的目标受众了。

其次,这个警告适用于需要严肃对待信息安全的情境。如果你只是想对你的女朋友保密,或者只是想瞒过你的领导,你自己发明一个什么算法可能也就够了。

那么对于无法跨越现代密码学门槛,而又需要认真对待信息安全的人(比如你的软件产品或者系统会保存很多用户信息)来说,如何抑制这种冲动呢?

首先需要理解的是,现代加密算法和协议的安全性不依赖于对算法的保密。自创算法并不能提高安全性,而公开的算法安全性很好。大部分打算自创加密算法的人对这一点不甚了了,还有一部分听说过这个说法但是心存幻想。

其次,安全是一个系统工程,任何一个地方的细微失误都会影响整个体系的安全性。已有的算法和协议经受过多年锤炼,其中包含很多很多精妙的细节。自创的算法协议的安全性不可能与已有的公开算法的协议相提并论。换句话说,自创算法和协议协议的安全性只会比已有的公开算法差。

最后,现代计算的计算能力非常惊人,破解有缺陷的加密算法非常容易。因此,如果你对安全和加密严肃认真,请不要使用自创算法

中国地图坐标(GCJ-02)偏移算法破解小史

2006年,Google开始与AutoNavi合作使用后者所提供的中国地图。这应该是外企首次接触到这个问题。

从2009年开始,中国地图的坐标偏移开始为外界所知。Garmin的用户发现在美国购买的GPS到了中国几乎无法使用,而在中国购买的Garmin产品则没有问题。Google Maps API的使用者发现兴趣点无法被准确标注在中国地图上。更有意思的是,有用户反复就此报告bug给Google,却从未得到任何回应。类似的,Garmin也声称自己没有解决方案,建议客户在需要的情况下在中国境内购买GPS设备。

于此同时,各路豪杰开始尝试破解这种偏移算法。其中有两条路径值得注意:

2010年1月,网友wuyongzheng发现:

I accidentally found the Chinese version of Google Map ditu.google.com to be able to correlate satellite image with map, and it gives the amount of deviation for any location in China. This URL queries the deviation of 34.29273N,108.94695E (Xi’an): http://ditu.google.com/maps/vp?spn=0.001,0.001&t=h&z=18&vp=$34.29273,108.94695 (seems it’ doesn’t work now)

有了足够的数据,wuyongzheng建议使用回归算法来逼近这个偏移算法:https://wuyongzheng.wordpress.com/2010/01/22/china-map-deviation-as-a-regression-problem/

在此之前的尝试都是零星的,针对个别城市的。wuongzheng的这个建议算是在全面系统地解决这个问题上迈出了第一步。

2013年5月,Maxime Guilbot根据这个建议得到4-5米精度的逼近:

https://github.com/maxime/ChinaMapDeviation

2013年10月,wuyongzheng自己进行了回归,得到如下结果:

http://wuyongzheng.github.io/china-map-deviation/paper.html

Maxime Guibot和wuyongzheng的回归结果基本代表了在黑暗中摸索的最佳结果,因此得到了广泛的注意和应用。

在另一条路径上,2010年4月,emq project增加了一个文件,Converter.java:

http://emq.googlecode.com/svn/emq/src/Algorithm/Coords/Converter.java

这段代码可以以很高的精度把WGS-84坐标转换到GCJ-02坐标。

2013年2月,这段代码被网友coolypf注意到,整理后用到了他自己的项目中:

https://on4wp7.codeplex.com/SourceControl/changeset/view/21483#353936

其中的关键代码值得贴在这里:

        const double pi = 3.14159265358979324;

        //
        // Krasovsky 1940
        //
        // a = 6378245.0, 1/f = 298.3
        // b = a * (1 - f)
        // ee = (a^2 - b^2) / a^2;
        const double a = 6378245.0;
        const double ee = 0.00669342162296594323;

        //
        // World Geodetic System ==> Mars Geodetic System
        public static void transform(double wgLat, double wgLon, out double mgLat, out double mgLon)
        {
            if (outOfChina(wgLat, wgLon))
            {
                mgLat = wgLat;
                mgLon = wgLon;
                return;
            }
            double dLat = transformLat(wgLon - 105.0, wgLat - 35.0);
            double dLon = transformLon(wgLon - 105.0, wgLat - 35.0);
            double radLat = wgLat / 180.0 * pi;
            double magic = Math.Sin(radLat);
            magic = 1 - ee * magic * magic;
            double sqrtMagic = Math.Sqrt(magic);
            dLat = (dLat * 180.0) / ((a * (1 - ee)) / (magic * sqrtMagic) * pi);
            dLon = (dLon * 180.0) / (a / sqrtMagic * Math.Cos(radLat) * pi);
            mgLat = wgLat + dLat;
            mgLon = wgLon + dLon;
        }

2013年3月,coolypf在自己的博客中介绍了这一段代码:

http://blog.csdn.net/coolypf/article/details/8686588

2014年9月,wuyongzheng注意到了coolypf的项目。至此,两条路径合流,坐标偏移问题基本得到了完美解决。

从上面的代码可以看出,相对于WGS-84,GCJ-02一方面采用了不同的参考椭球体(SK-42, Krasovsky。应该属于前苏联影响的遗留),另一方面引入了高频非线性偏移。

RSA illustration with not-so-small numbers – part 2

Let’s have a closer look at the encryption. During the communication, what’s been exposed are:

Alice’s public key (n=2627, e=13) , and the encrypted message.

For anyone who’s entered the world of modern cryptography from the old age, it’s tempting to try to decrypt the encrypted message using the encrypting key, the public key.

For these people, I have the below chart that shows the mapping between the plain text and the encrypted data:

encryption_mapping

x-axis is the plain-text data (sorted from 1 to 2627) and y-axis is the encrypted data(from 0 to 2626). I did the calculation using this line of script:

~$ for i in `seq 1 2627`; do echo "$i^13 %2627" | bc; done > /tmp/encryption.mapping

Below is part of this chart zoomed-in:

encryption_mapping_part

So you know the encrypted data, let’s say 2144, and you know the public key (n=2627, e=13). How do you find the number x such that x^13 % 2627 = 2144.

You cannot unless you compute everyone possible 1<x<2627 and then find the correct one. That’s brutal force. This is one of the basic assumption behind the security of RSA: There’s no efficient way to find x. This is called the discrete logarithm problem.

In real world scenarios, the 2 prime numbers will be so large that brutal force is simple impractical.

Then to decrypt the message, one would need the private key. The private key is the modular inverse of phi(n). However, in order to get phi(n), he has to know the factors that form n. And factoring large number is mathematically hard. That is the other assumption behind the security of RSA: There’s no efficient way to factor a large number.

As you will see in other places, these 2 assumptions are the corner stones of modern cryptography.

RSA illustration with not-so-small numbers

Modern cryptography is difficult to understand without illustrations. One of the reason is, modern cryptography involves very large numbers that easily exceed the capacity of a standard calculator, let alone human comprehension. There are some illustrations out there using small numbers. The problem is, the numbers are too small to be convincing. So I’d like to try some no-so-small numbers here. Most of the necessary calculations can be done with GNU bc, so you can try yourself on just any GNU Linux distribution.

Let’s say Bob wants to send the below number to Alice (and make sure only Alice can decrypt the message):

520

Here’s what Alice will do first:

  1. Pick up two distinct prime numbers. The numbers should be sufficiently large so that brutal force is difficult. Here we choose p=37 and q=71.
  2. Calculating n=pq=37*71=2627.
  3. Calculating the n‘s totient function: phi(n)=(p-1)*(q-1)=2520.
  4. Pick a number e between 1 and phi(n) that is co-prime with phi(n). Here we choose 13.
  5. Find number d so that e*d mod (phi(n)) =1. Here we choose 1357. This step cannot be done with bc. Intead, you can try this online calculator. Just put “modinv(13,2520)” in the text field and then press “go” you’ll get the result.

Now Alice has a public key (n=2627, e=13) and a private key (n=2627, d=1357). She can simply distribute her public key to everyone, including Bob.

Now for Bob to encrypt the message 520 to Alice, he has to encrypt the message using Alice’s public key:

520^13 % 2627 = 2235

Now Alice received this number 2235 from Bob. In order to decrypt this message, she do the following calculation(using her private key):

2235^1357 % 2627 = 520

Actually, here Bob can encrypt just any number that is less than or equal to n=2627 in this way.

Bob:

1^13 % 2627 = 1

Alice:

1^1357 % 2627 = 1

Bob:

2^13 % 2627 = 311

Alice:

311^1357 % 2627 = 2

Bob:

3^13 % 2627 = 2361

Alice:

2361^1357 % 2627 = 3

Bob:

4^13 % 2627 = 2149

Alice:

2149^1357 % 2627=4

Bob:

137^13 % 2627 = 2431

Alice:

2431^1357 % 2627 = 137

If his message is large, then he has to split his message into chunks that are smaller than n and encrypt them one by one.

Note that this only illustrates how Bob can send secrete messages to Alice. If Alice wants to send secrete messages to Bob then she has to have Bob do the same first:

  1. Pick up 2 sufficiently large prime numbers;
  2. Get the product of these 2 prime numbers – This is part of the keys;
  3. Get the totient of this product;
  4. Pick a number that is co-prime with this totient but smaller – This combined with the product is the public key;
  5. Find the number that is the multiply modular inverse of this number – This combined with the product is the private key;

Then Bob sends his public key to Alice and Alice can encrypt the messages using Bob’s public key. Upon receiving the messages, Bob can decrypt the messages using his private key.

关于中国地图坐标偏移

这个问题曾经困扰我很长时间,因为没有权威资料,各方面的信息混乱,大家使用不同的名词,导致很多误解。这里尝试总结廓清一下。

  • 什么是地图坐标偏移

地图偏移在官方看来是一个坐标系问题,即官方要求中国所有的地图使用GCJ-02坐标系(被广泛称为火星坐标),而从其他坐标系到GCJ-02坐标系的转换算法是保密的。

然而GCJ-02不仅仅是一个坐标系选择的问题。把卫星地图和官方的GCJ-02地图进行重合可以发现,两者的偏差是非线性的(无法通过平移和缩放把一个重合到另一个上去)。因此,有理由认为,GCJ-02地图是经过某种偏移的。(在卫星地图上一条直线在GCJ-02地图上将不是直线)

  • 国内不同厂商提供的地图是否一致?

国内所有的厂商提供的地图都是GCJ-02地图,因此是可以通过平移、缩放重合的。不同的厂商可能采用不同的坐标系,但这些不同的坐标系跟GCJ-02的区别是线性的。

  • GPS设备呢?

GPS设备通常返回WGS-84坐标,因此如果直接标注到GCJ-02地图上会不准确。没有证据表明GPS信号或者GPS芯片被修改。国产的GPS设备可以返回GCJ-02坐标,但是不清楚这种坐标转换是硬件实现还是可以软件实现。

  • 地图怎么可能被偏移而不被察觉

从现有资料看,偏移发生在大尺度上。因此,如果不跟外部系统(非GCJ-02系统)进行对比,日常生活的确不会有影响。这里有一片文章根据泄露出来的数据对偏移算法进行了回归。这里是回归的结果。

总结,根源是政府掌握了地图的测绘资质和发布资质。GPS输出结果必须做相应的便宜,否则无法被准确地标记到地图上去。

How to dodge “the Great Cannon”

I don’t want to go in details and risk my own blog. So basically one of the scripts that’s very common among websites is targeted and redirection code was injected.

Using Adblock, you can simply block this script:

http://platform.twitter.com/widgets.js

And then you won’t get redirected. It’s that simple. 🙂

There might be other scripts I haven’t encounter yet, but you should be able to use the same technique to block them as well.

Stereotyping and its costs

Recently I watched this

And this:

I’ve been watching TED videos for years now but still feel like an eye opening.

People may say, “Oh come on, these are TED videos right? They are meant to impress people.” I’m actually not that easily impressed. I’m not talking about the technology or the plasticity of human brain. I’m talking about the very fact that a disabled person could become an MIT professor, lead a world class research team or could be so sharp, so articulate and appear so *normal*.

Despite all the pride of being Chinese, we have to admit, that would not happen in modern China.

If Mr. Hugh Herr had been born in China, he would have probably at best dropped out of school very early on and attended a special school or even worse, simply stay at home, completely isolated. If Mr. Daniel Kish were in China, he won’t have had the chance to share his personal experience with others. Instead, with his outstanding ability, he probably will end up making a living by showing off his special ability in a circus (Or in Beijing subway if circus fade out of favor completely).

The reason behind the differences, I believe, lies primarily in everyone’s mind.

I happen to know the concept of “stereotype threat”. For those who don’t know, according to wikipedia it is “one of the most widely studied topics in the field of social psychology”, that evaluates the impact of stereotyping. As it turns out, a lot of performance gaps between groups can be explained by this stereotype threat. I personally believe that stereotype threat is the key reason behind the performance gap between disabilities in China and disabilities in the US.

Let’s face it: China is still a country full of biased stereotypes. It’s true that stereotyping is part of human nature and that stereotypes exist in every society. However, China stands out in allowing stereotypes to go unchecked in every corner of everyday life, TV programs, newspapers, magazines, even textbooks for children. As a consequence, people are so used to all sort of stereotypes that no one even bothers to stand up against said stereotype, even though everyone has been a victim of one form of stereotype or another.

I have to admit that, I only started to pay attention to this topic after my wife and I had a child. My wife and I are lucky, our daughter is normal in every aspect. However, as new and inexperienced parents, at times when my daughter was sick and sometimes we became scared and couldn’t help but think about all kinds of what-if scenarios.

Out of this kind of reasoning I became a person that is conscious about stereotype. Bit by bit I recalled how I have struggled against all sorts of stereotypes against myself when I was young. I started to realized how I have stereotyped others and how destructive that could be. Everyone is a victim of this inescapable net of stereotyping.

So, on this special day, I propose one thing we could do to bring positive changes to China, without disturbing the government: reflect on ourselves and stop stereotyping.

To end this article, here’s a Stanford professor on this topic: