Heartbleed vulnerability

Just saw some friends sharing this in Wechat. Seems I will have a lot of work to do – patching my servers, replacing certificates, regenerating keys, etc.:(

It’s really unbelievable. This will be a big blow on the open source community. I already saw people saying “see? Open source is no securer”. Well, they are right.

To me, this has nothing to do with open source or not. M$ may have something even worse but you will need longer time to find out. I just searched the web and found that as a library that has been so widely used, OpenSSL has only one full time developer and receives on average $2,000 donation per year. What do you expect?

But that simply won’t justify such a disaster and it will cast a bad image for open source community in general. I can only hope leaders in this industry see this differently and start to support these great open source projects. They deserve better!

spf记录怎么用

曾经稀里糊涂地用过好几回,但是一直没有认真研究过。论坛上关于使用方法众说纷纭,可能分别适用于不同的场景,但是很少看到完整的介绍。刚刚看到DigitalOcean上这一篇文章说的比较清楚了。

How To use an SPF Record to Prevent Spoofing & Improve E-mail Reliability

作为一个text记录,spf字段的结构如下:

<spf preamble> <address list>

v=spf1 spf记录的起始标志。目前只有1号版本的spf,所以只有这一种写法。

后面是一个地址或者名称的列表,列表项用空格符隔开。

在每个地址列表项之前,可以有如下四种修饰符:

  1. + 表示该地址被明确允许发送来自该域名的邮件。该修饰符可省略。
  2. – 表示该地址被明确禁止发送来自该域名的邮件。
  3. ? 表示对该地址目前暂时没有策略。
  4. ~ 表示对该地址应该使用柔性策略,接收方应该接受,但是可以进行特殊标记。

地址列表项本身有如下几种格式:

  1. all 匹配任何地址;
  2. a 匹配该域中任何A记录地址;
  3. ip4: 匹配紧随其后的IPv4地址或者地址段;
  4. ip6: 匹配紧随其后的IPv6地址或者地址段;
  5. mx 匹配该域中的任何mx记录地址;

因此,假设一个域domainabc.com有spf记录如下:

v=spf1 ip4:123.23.4.5 -all

这个记录的意思是,除了123.23.4.5这个地址之外,其他任何地址发送来自domainabc.com的邮件,都应该被拒绝。

考虑到高可用性,一个域通常至少有两台smtp服务器在工作,因此这条记录更可能是这样的:

v=spf1 ip4:123.23.4.5 ip4:123.23.4.5 -all

如果该域之前已经有MX记录,指向就是123.23.4.5和123.23.4.6,那么该记录可以简化为:

v=spf1 mx -all

如果有一天,domainabc.com需要额外允许一个网段发送邮件,可以将记录改为:

v=spf1 mx ip4:107.8.9.0/24 -all

假设domainabc.com需要迁移自己的邮件服务器,mx记录已经切换,但是原本的邮件服务器需要逐步停止使用,那么可以首先这么修改:

v=spf1 mx ip4:107.8.9.0/24 ?123.23.4.5 -all

其中123.23.4.5是准备退休的邮件服务器。经过一段时间之后,可以再把记录改为:

v=spf1 mx ip4:107.8.9.0/24 ~123.23.4.5 -all

最后,确认所有内部应用都开始使用新的服务器了,再把原来的服务器地址防到禁止列表里去。

v=spf1 mx ip4:107.8.9.0/24 -all

注意最后的“-all”匹配任意地址。就是这样。

Linear regression in Excel – continued

Doing linear regression in Excel is simple. Now I’d like to show you how exactly this is done.

Let’s use the same data set:

x 9.42 2.78 5.87 5.49 6.48 4.91 7.82 2.70 2.65 8.42 5.85 0.93
y 52.24 4.15 13.72 7.00 5.28 6.98 42.00 7.15 3.41 31.00 12.59 1.09

Let’s say we want do a order-2 linear regression. That means, we want a function:

f(x)=c*x^2 + b*x+a

a, b and c should be chosen in a way such that if we put x into f(x), the difference/error between f(x) and y, will be minimized, in the least square sense.

So we have the difference R as a function of a,b,c:

R(a,b,c)=(c* 9.42^2 + b* 9.42 + a – 52.24)^2+(c*2.78^2+b*2.78+a-4.15)^2+…+(c*0.93^2+b*0.93+a)

R is smooth every where in space. So R is minimal where:

D(R)/D(a)=0; D(R)/D(b)=0; D(R)/D(c)=0

Now we’ll have to employ the compact form to avoid tedious typing:

equ1

In order for R to be minimal, we need:

equ2

Insert R into the equations and expand them, we get:

equ3

That is:

equ4

Or in matrix form:

equ5

Now this is a very simple equation, so we can simply solve it using Cramer’s rule:

equ6

It looks scary but actually not too complex. We need the sum of x, x^2, x^3, x^4, x*y and x^y. We can actually verify this in excel:

x y x^2 x^3 x^4 x*y x^2*y
9.42 52.24 88.69578 835.323 7866.942 491.974 4633.334
2.78 4.15 7.721688 21.45697 59.62447 11.53932 32.06537
5.87 13.72 34.43254 202.0476 1185.6 80.52797 472.5321
5.49 7.00 30.18894 165.8715 911.3719 38.47401 211.3934
6.48 5.28 41.99053 272.099 1763.204 34.19838 221.6059
4.91 6.98 24.08858 118.227 580.2595 34.23907 168.0457
7.82 42.00 61.22442 479.0568 3748.43 328.6334 2571.426
2.70 7.15 7.269068 19.59829 52.83935 19.27266 51.96143
2.65 3.41 7.003793 18.53532 49.05312 9.028243 23.89296
8.42 31.00 70.90802 597.0945 5027.948 261.0414 2198.149
5.85 12.59 34.27254 200.6409 1174.607 73.71933 431.5732
0.93 1.09 0.869064 0.810173 0.755272 1.014426 0.945685
63.32 186.61 408.665 2930.761 22420.63 1383.662 11016.92

So we have:

calculation_0

calculation_1calculation_2

calculation_3

So in the end we get:

cal_result

This is exactly what Excel displayed on the chart:

trend_line_res

I actually put all this in an Excel file so you can try it yourself. Please note that in the excel file we have a data set of 12 points. If your data set has a different size, you may have to adapt the formulas a little bit. 🙂

Linear regression in Excel

Excel has a very handy feature that does linear regression for you very intuitively.

Let’s say you have a data set like this:

x 9.42 2.78 5.87 5.49 6.48 4.91 7.82 2.70 2.65 8.42 5.85 0.93
y 52.24 4.15 13.72 7.00 5.28 6.98 42.00 7.15 3.41 31.00 12.59 1.09

First, you’ll have to insert a scatter chart:

insert_scatter_chart

You’ll get a chart like this (Excel is smart enough to figure out what should be the x-axis and what should be the y-axis):

scatter_chart_ex1

Now, right click on the dots and then you can select “Add Trendline…” from the pop-up menu.

add_trendline

Then you’ll be presented with a dialog like this:

trend_line_options

Note that Excel has different options for “Linear” and “Polynomial”. Here the option is about the relationship between the variables. All these different options are all Linear Regressions.

Let’s try a order 2 polynomial regression and check the option “Display Equation on chart” and “Display R-squared value on chart”. This is what you’ll get:

order_2_regression

Good thing about doing this in Excel is, you can always change the data set and your chart and equation will be updated automatically.

Depending on the nature of your data set, you may want to try different model to get a better fit, the procedure will be the same.

Windows mind set vs. Linux mind set

I wanted to talk about this for a long time, however, the key points were not clear until recently I ran into the pipeline concept in PowerShell.

When you do pipeline in PowerShell, you’re passing not a text stream from the left command to the right command, but rather, you’re passing a .net object.

When I read this from TechNet, I smiled to myself, yes, this is typical Microsoft!

Windows PowerShell provides a new architecture that is based on objects, rather than text. 
The cmdlet that receives an object can act directly on its properties and methods without 
any conversion or manipulation. Users can refer to properties and methods of the object by
 name, rather than calculating the position of the data in the output.

And I said to myself, now I know what I wanted to say.

One key difference between Windows and Linux is, Windows always tries to be smarter, where GNU Linux tries to stay plain and humble. Pipeline in scripting is just one recent example.

Linux has been using text configuration files. Windows came along and said, we need something better. That’s how Windows Registry came about.

Linux has been using pipeline for IPC. Windows came along and said, we need something better. That’s how COM came about.

Linux has been using lock files to prevent processes to start another instance. Windows came along and said, we need something better. So they use Kernel Object instead.

Linux has been using permissions as a basic security measure. Windows came along and said, we need something better. So they do everything using ACLs.

Linux has been using symbolic links. Windows came along and said, we need something better and introduced short cuts.

To be fair, not all of them are bad ideas.

Despite its complexity and awkward configuration, COM gained such popularity that it became the basic foundation of modern Windows. Open registry in any Windows that has been used for a while and chances are the biggest tree is HKLM\Classes\CLSIDS. (One of the key reasons why you Windows becomes slower as you install more and more software).

Kernel Object is much more reliable than lock files. ACLs indeed provide much flexibility in terms of access control. Linux is also doing it now.

However, we all know that Windows have record of being smart in cheap ways and then fail pathetically. (SilverLight is the one that came into my mind as I write this) To me, this idea of passing objects through pipeline looks like just another one that will fail.

Having said that, I have to admit that, this difference between Windows and Linux is also not surprising.  Linux is developed by community led by technical experts. Introducing new features always involves extensive discussions between these experts.That’s why Linux has a bad reputation of not listening to its users.

Where as for Windows, most likely, new features are proposed by requirement collection team and developer team, then the list of new features have to go through rounds of prioritization processes. If there are disputes, then there will be escalations and some manager will decide. Once decided, then features still in the list will be implemented. Period.

Now, it is actually surprising that Microsoft actually made some key decisions right, right? 🙂

Monty Hall Problem simulation in Excel

Monty Hall Problem is definitively the most confusing puzzle among educated people. After explaining this to different colleagues and friends (With a success rate below 50%), I found it’s easier simply presenting the fact. So here’s a simulation of it in excel.

monty_hall_excel

Let’s me explain a little bit about this excel:

Each row in the table is a test case, or an experiment.

In each such a experiment, a random number is generated to determine where do we put the car. A goat will be placed behind each of the rest doors. So in the table, we have “C” denotes a Car and “G” denotes a goat.

If the number is between 0-1/3, then the car will be placed behind Door1. If the number is between 1/3-2/3, then the car will be placed behind Door2. If the number is between 2/3-1, then the car will be placed behind Door3.

Then in the 6th column of the table, another random number is generated to determine the 1st choice.

Again, if the number is between 0-1/3, that means the guest chose Door1 in the first attempt. If the number is between 1/3-2/3, then the guest chose Door2 in his/her first attempt. If the number is between 2/3-1, then he/she chose Door3 in his/her first attempt.

Note that both of the 2 random numbers have a unify distribution between 0 and 1. And the 2 numbers are completely independent to each other.

Also in the table, the guest’s first choice is high-lighted in green.

The next column, the 7th column, shows what the guest will get if he sticks to his/her first attempt. And the 8th column shows whether he/she wins a car or not. (1 wins, 0 loss).

The 9th column shows what the guest will get if he/she switches to the other door when he/she is given another chance. And the next column shows whether he/she will win or not if he switches.

In the end of the table, after 400+ experiments, the outcome of the 2 strategy was summed up. Out of 432 experiments, you’ll get a car 142 times, if you stick to your original choice, you’ll get a car 290 times, if you switch.

The good thing about this excel is, you can simply press F9 and ask Excel to recalculated and see with your own eyes of the outcome.

What slows down your Windows?

Windows has long been complained of performing slower and slower as being used. This essay tries to explain how and why.

Before we go into details, let’s make it clear that we’re talking about the architecture design of Windows that makes it not performing in certain situation. So this is consistently measurable performance difference. We’re not talking about bad performance because of wrong configuration. Nor are we going to talk about performance of a specific program.  A specific instance of slow down, for example, your notepad will performance slower when you have a lot of other programs running, is not what we’re going to talk about.

First of all, there’s no reason Windows should perform worse (or better) if you just spend more time on it. If the installation keeps the same, the size of your computer keeps the same, Windows should perform the same.

However, if one of your running programs or a device driver has memory leak, then it will eat up more and more memory as time pass by. That will slow down your Windows. The unique thing about this type of performance issue is, after a fresh restart, Windows should perform well. In the early days, memory leak was a common issue on Windows. That attributed much to the common belief that you should restart your Windows once in a while to keep better performance. Now, most commonly used program is mature enough to be free of memory leak, Windows should perform just the same as time goes by.

However, as time goes by, you’ll probably keep installing new software on your Windows. This could indeed slow down your computer. The reason is, by installing any non-trivial software, you’re not only copying files to the disk, but also registering COM components to Windows. These registry key/values will be keep in memory. So the more software you install, the less memory your program will be able to get.

As an example, after you install a program that is able to open a new file format, chances are:

  • You’ll not be able to see a thumb nail that shows the content of the file in Explorer;
  • You’ll see this program listed in the pop-up menu of this file type;

They were made possible using COM technology that depends heavily on Windows Registry.

The problem is, this registry keys/values won’t get cleaned up when you uninstall the software. So Windows registry keeps growing and growing, your programs have lesser and lesser memory to use.

The other factor that would for sure slow down your computer is disk fragmentation. Disk fragmentation affect performance not only when your program does disk IO. If your page file is fragmented, paging operation will be slow. That will cause noticeable sluggish. Also remember that all executable files and dll files become memory map files, so if they are fragmented, your program will be slow not only during start up, but also in running phase.

Bloated registry and disk fragmentation are the two reason that your Windows slows down in the long run. In some of my Windows computers, I actually create separate partitions for page file, for outlook pst file and Windows tmp file. These techniques worked quite well.

泊松分布在容量管理中的应用

其实是泊松分布的经典应用:

已知某呼叫中心每小时会进ci个电话,每个电话平均时长为tp,则在任意给定时刻,并发电话的为x的概率服从泊松分布。此处假设不同的电话是完全独立事件。

并发电话为x的概率,等同于在tp时长内,进入x路电话的概率。在tp时长内,平均会进的电话数为tp*ci。因此,并发电话小于或等于x的概率为(excel公式):

POISSON(x,tp*ci,TRUE)

其中最后一个参数TRUE,表明这里计算的是并发电话小于或者等于x的概率;如果用FALSE,则只计算等于x的概率。

此公式可用于呼叫中心容量规划,比如:如果已知呼叫量和平均通话时长,则要确保容量在99%的情况下够用,则只需求x,使得POISSON(x,tp*ci,TRUE)>0.99。

在某呼叫中心,把历史数据应用了一下,符合度很高:

call_vol_verification

其中横轴是时间,黄色为实际并发量,蓝色为按照0.999计算出的并发量上限。

如果把横轴改成按照实际并发量排序,则得到:

call_vol_verification_2

可以看到实际并发量几乎总是在预计容量之下。几个实际并发量高于预计容量的情况,后来证实都是系统出了故障,导致客户竞相反复拨打,意味着不仅原来的参数不再适用,独立同分布的假设也不再适用。