python里去除极值的三种方法 (rolling_median)
2018-01-23 15:14
1581 查看
3 ways to remove outliers from your data
https://ocefpaf.github.io/python4oceanographers/blog/2015/03/16/outlier_detection/According to Google Analytics, my post "Dealing
with spiky data", is by far the most visited on the blog. I think that the reasons are: it is one of the oldest posts, and it is a real problem that people have to deal everyday.
Recently I found an amazing series of post writing by Bugra on how to perform outlier detection using FFT,
median filtering, Gaussian
processes, and MCMC
I will test out the low hanging fruit (FFT and median filtering) using the same data from my original
post.
In [2]:
from datetime import datetime from pandas import read_table fname = './data/spikey_v.dat' cols = ['j', 'u', 'v', 'temp', 'sal', 'y', 'mn', 'd', 'h', 'mi'] df = read_table(fname , delim_whitespace=True, names=cols) df.index = [datetime(*x) for x in zip(df['y'], df['mn'], df['d'], df['h'], df['mi'])] df = df.drop(['y', 'mn', 'd', 'h', 'mi'], axis=1) df.head()
Out[2]:
j | u | v | temp | sal | |
---|---|---|---|---|---|
1994-11-22 04:00:00 | 1056.1667 | -0.1 | 0.0 | 23.9 | 34.6 |
1994-11-22 05:00:00 | 1056.2083 | -0.2 | 0.7 | 23.9 | 34.6 |
1994-11-22 06:00:00 | 1056.2500 | -0.1 | 2.0 | 23.9 | 34.6 |
1994-11-22 07:00:00 | 1056.2917 | 0.0 | 3.1 | 23.9 | 34.6 |
1994-11-22 08:00:00 | 1056.3333 | -0.1 | 2.7 | 23.9 | 34.6 |
get_median_filtered()we
have:
In [3]:
import numpy as np def get_median_filtered(signal, threshold=3): signal = signal.copy() difference = np.abs(signal - np.median(signal)) median_difference = np.median(difference) if median_difference == 0: s = 0 else: s = difference / float(median_difference) mask = s > threshold signal[mask] = np.median(signal) return signal
In [4]:
import matplotlib.pyplot as plt figsize = (7, 2.75) kw = dict(marker='o', linestyle='none', color='r', alpha=0.3) df['u_medf'] = get_median_filtered(df['u'].values, threshold=3) outlier_idx = np.where(df['u_medf'].values != df['u'].values)[0] fig, ax = plt.subplots(figsize=figsize) df['u'].plot() df['u'][outlier_idx].plot(**kw) _ = ax.set_ylim(-50, 50)
Not bad. Still, it missed the two lower spikes. Let's go for the
detect_outlier_position_by_fft().
In [5]:
def detect_outlier_position_by_fft(signal, threshold_freq=0.1, frequency_amplitude=.001): signal = signal.copy() fft_of_signal = np.fft.fft(signal) outlier = np.max(signal) if abs(np.max(signal)) > abs(np.min(signal)) else np.min(signal) if np.any(np.abs(fft_of_signal[threshold_freq:]) > frequency_amplitude): index_of_outlier = np.where(signal == outlier) return index_of_outlier[0] else: return None
In [6]:
outlier_idx = [] y = df['u'].values opt = dict(threshold_freq=0.01, frequency_amplitude=0.001) win = 20 for k in range(win*2, y.size, win): idx = detect_outlier_position_by_fft(y[k-win:k+win], **opt) if idx is not None: outlier_idx.append(k + idx[0] - win) outlier_idx = list(set(outlier_idx)) fig, ax = plt.subplots(figsize=(7, 2.75)) df['u'].plot() df['u'][outlier_idx].plot(**kw) _ = ax.set_ylim(-50, 50)
Not sure if this method is the best here... Maybe if the signal was contaminated by high frequency noise this method would perform better.
Inspired by Bugra's median filter let's try a
rolling_medianfilter
using pandas.
In [7]:
from pandas import rolling_median threshold = 3 df['pandas'] = rolling_median(df['u'], window=3, center=True).fillna(method='bfill').fillna(method='ffill') difference = np.abs(df['u'] - df['pandas']) outlier_idx = difference > threshold fig, ax = plt.subplots(figsize=figsize) df['u'].plot() df['u'][outlier_idx].plot(**kw) _ = ax.set_ylim(-50, 50)
Update: A friend, that knows this data, challenged me to use the same
technique on $v$. Here it is:
In [8]:
df['pandas'] = rolling_median(df['v'], window=3, center=True).fillna(method='bfill').fillna(method='ffill') difference = np.abs(df['v'] - df['pandas']) outlier_2 = difference > 2 outlier_3 = difference > 3 fig, ax = plt.subplots(figsize=figsize) df['v'].plot() df['v'][outlier_2].plot(**kw) kw.update(color='g', marker='*', alpha=1) df['v'][outlier_3].plot(**kw) _ = ax.set_ylim(-50, 60)
Note that I had to reduce the threshold from 3 -> 2 to get them all.
相关文章推荐
- python爬虫插入MySQL数据库前去除重复数据的几种方法
- python实现字符串连接的三种方法及其效率、适用场景详解
- python 字典访问的三种方法
- Python操作MySQL数据库的三种方法
- Python_XML的三种解析方法
- JavaScript去除空格的三种方法
- Python 打印中文字符的三种方法
- python实现去除下载电影和电视剧文件名中的多余字符的方法
- 三种方法去除打开文件提示安全警告(转)
- python中try except处理程序异常的三种常用方法
- python统计方法median、var、mean、diff
- 基于VSCode环境的三种使用Python运行其他程序方法
- Python 求解斐波那切(三种方法)
- Python——调用shell命令的三种方法
- JavaScript去除空格的三种方法 (trim)
- python去除文件中空格、Tab及回车的方法
- Python 判断文件是否存在的三种方法
- python有三种导入模块的方法
- python 调用shell命令三种方法
- python list 增加元素的三种方法