Aerial Cactus identification

比赛地址:https://www.kaggle.com/c/aerial-cactus-identification

Description

To assess the impact of climate change on Earth’s flora and fauna, it is vital to quantify how human activities such as logging, mining, and agriculture are impacting our protected natural areas. Researchers in Mexico have created the VIGIA project, which aims to build a system for autonomous surveillance of protected areas. A first step in such an effort is the ability to recognize the vegetation inside the protected areas. In this competition, you are tasked with creation of an algorithm that can identify a specific type of cactus in aerial imagery.

Data preparation

数据集的存放分布:

As you know,data should be processed into appropriatly pre-processed floating point tensors before being fed to our network. So the steps for getting it into our network are roughly

  • Read the picture files
  • Decode JPEG content to RGB pixels
  • Convert this into floating tensors
  • Rescale pixel values (between 0 to 255) to [0,1] interval.

we will make use of ImageDataGenerator method available in keras to do all the preprocessing.

1
2
datagen=ImageDataGenerator(rescale=1./255)
batch_size=150

flow_from_dataframe Method

This method is useful when the images are clustered in only one folder. To put in other words images from different class/labels reside in only one folder. Generally, with such kind of data, some text files containing information on class and other parameters are provided. In this case, we will create a dataframe using pandas and text files provided, and create a meaningful dataframe with columns having file name (only the file names, not the path) and other classes to be used by the model. For this method, arguments to be used are:

dataframe : Dataframe having meaningful data (file name, class columns are a must)

Pandas dataframe containing the filepaths relative to directory (or absolute paths if directory is None) of the images in a string column.

It should include other column/s depending on the class_mode: - if class_mode is "categorical" (default value) it must include the y_col column with the class/es of each image. Values in column can be string/list/tuple if a single class or list/tuple if multiple classes. - if class_mode is "binary"or "sparse" it must include the given y_col column with class values as strings. - if lass_mode is "other" it should contain the columns specified in y_col. - if class_mode is "input" or None no extra column is needed.

directory value : The path to the parent directory containing all images.

Pandas 中的 Dataframe

定义为可包含不同类型的列的二维标记(label)数据结构。

由三个主要组件组成:数据、索引和列。

数据集的train.csv 文件内容:

1
2
X = pd.read_csv(‘。。。。/train.csv’)
Print(X)

输出的结果:

列名所在行不算进矩阵内。

1
2
3
#Getting a basic idea
train.head(5)
train.has_cactus=train.has_cactus.astype(str)
1
2
3
4
5
6
7
8
9
10
11
def extract_features(sample_count,dir,df):
features = np.zeros(shape=(sample_count,4,4,512))
labels = np.zeros(shape=(sample_count))
data = train_datagen.flow_from_dataframe(
dataframe=train,
directory=dir,
x_col='id',
y_col='has_cactus',
target_size=(150,150),
batch_size=batch_size,
class_mode='binary')

上面的代码中,因为 y_col中的 “has_cactus"列不是字符串格式,class_mode不应该是 binary

pathlib介绍-比os.path更好的路径处理方式

本部分来自:pathlib介绍-比os.path更好的路径处理方式 | 知乎

1
2
3
4
5
#错误的方式:手动拼接
data_folder = "source_data/text_files/"
file_to_open = data_folder + "raw_data.txt"
f=open(file_to_open)
print(f.read())

这样写,你不能让每个操作系统都能顺利执行。并会让其他程序员用怀疑的眼光看着你。

1
2
3
4
5
6
#旧的解决方式:os.path模块
import os.path
data_folder = os.path.join("source_data", "text_files")
file_to_open = os.path.join(data_folder, "raw_data.txt")
f = open(file_to_open)
print(f.read())

此代码可以在各个平台顺利运行,但是反复使用os.path.join很啰嗦。os.path有很多功能,但是比较繁琐,导致大家虽然都直到,但是就是懒得用。

1
2
3
4
5
6
#更好解决方式:pathlib模块
from pathlib import Path
data_folder = Path("source_data/text_files/")
file_to_open = data_folder / "raw_data.txt"
f = open(file_to_open)
print(f.read())

Dogs vs. Cats

比赛地址:https://www.kaggle.com/c/dogs-vs-cats

Analysis Training Data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
filenames = os.listdir("../input/train/train")
categories = []
for filename in filenames:
category = filename.split('.')[0]
if category == 'dog':
categories.append(1)
else:
categories.append(0)

df = pd.DataFrame({
'filename': filenames,
'category': categories
})

创建dataframe:

  • 读 :pd.read_csv(‘…‘)
  • 制作 :pd.DataFrame({ })

Prepare Training data

Because we will use image genaretor with class_mode="categorical". We need to convert column category into string. Then imagenerator will convert it one-hot encoding which is good for our classification.

So we will convert 1 to dog and 0 to cat

1
df["catagory"] = df["catagory"].replace({0: 'cat', 1: 'dog'})

Train_test_split

1
2
3
4
5
from sklearn.model_selection import train_test_split
...
train_df, validate_df = train_test_split(df,test_size=0.20, random_state=42)
train_df = train_index(drop=True)
validate_df = validate_df.reset_index(drop=True)

softmax仅包含一个单元

1
model.add(Dense(2, activation='softmax')) #2 because we have cat and dog classes

《Python深度学习》上设为1,返回的是一个概率值。每个分组中两个类别的样本数相同,这是一个平衡的二分类问题,分类精度可作为衡量成功的指标.

因为最后一层是单一sigmoid单元,所以用binary_crossentropy

因为使用了binary_corssentropy,所以需要用二进制标签。

利用生成器生成的二进制标签:

对于flow_from_dataframe,column应该为string:

Prepare data

Digit Recongnizer

地址:https://www.kaggle.com/c/digit-recognizer/overview

Description

MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.

Data Analysis

针对数据,直接操作,不用管第一行第一列

1
2
3
# Normalize the data
x_train = x_train / 255.0
test_df = test_df / 255.0

Reshape后就不再是dataframe了

而且必须是0,不能是 ‘ :’

  • target_size: 整数元组(height, width), 默认为(256,256)。所有找到的图片都会调整到这个维度。所以必须指定。

FLOW

这个,生成器的目标是numpy 矩阵。见:Keras 中文文档

1
flow(x, y=None, batch_size=32, shuffle=True, sample_weight=None, seed=None, save_to_dir=None, save_prefix='', save_format='png', subset=None)

采集数据和标签数组,生成批量增强数据。

参数

  • x: 输入数据。秩为 4 的 Numpy 矩阵或元组。如果是元组,第一个元素应该包含图像,第二个元素是另一个 Numpy 数组或一列 Numpy 数组,它们不经过任何修改就传递给输出。可用于将模型杂项数据与图像一起输入。对于灰度数据,图像数组的通道轴的值应该为 1;对于 RGB 数据,其值应该为 3;对于 RGBA 数据,值应该为 4。
  • y: 标签。
  • batch_size: 整数 (默认为 32)。
  • shuffle: 布尔值 (默认为 True)。
  • sample_weight: 样本权重。
  • seed: 整数(默认为 None)。
  • save_to_dir: None 或 字符串(默认为 None)。这使您可以选择指定要保存的正在生成的增强图片的目录(用于可视化您正在执行的操作)。
  • save_prefix: 字符串(默认 '')。保存图片的文件名前缀(仅当 save_to_dir 设置时可用)。
  • save_format: “png”, “jpeg” 之一(仅当 save_to_dir 设置时可用)。默认:”png”。
  • subset: 数据子集 (“training” 或 “validation”),如果 在 ImageDataGenerator 中设置了 validation_split

返回

一个生成元组 (x, y)Iterator,其中 x 是图像数据的 Numpy 数组(在单张图像输入时),或 Numpy 数组列表(在额外多个输入时),y 是对应的标签的 Numpy 数组。如果 ‘sample_weight’ 不是 None,生成的元组形式为 (x, y, sample_weight)。如果 y 是 None, 只有 Numpy 数组 x 被返回。

NOAA Fisheries Steller Sea Lion Population Count

比赛地址:https://www.kaggle.com/c/noaa-fisheries-steller-sea-lion-population-count

Description

Steller sea lions in the western Aleutian Islands have declined 94 percent in the last 30 years. The endangered western population, found in the North Pacific, are the focus of conservation efforts which require annual population counts. Specially trained scientists at NOAA Fisheries Alaska Fisheries Science Center conduct these surveys using airplanes and unoccupied aircraft systems to collect aerial images. Having accurate population estimates enables us to better understand factors that may be contributing to lack of recovery of Stellers in this area.

Currently, it takes biologists up to four months to count sea lions from the thousands of images NOAA Fisheries collects each year. Once individual counts are conducted, the tallies must be reconciled to confirm their reliability. The results of these counts are time-sensitive.

In this competition, Kagglers are invited to develop algorithms which accurately count the number of sea lions in aerial photographs. Automating the annual population count will free up critical resources allowing NOAA Fisheries to focus on ensuring we hear the sea lion’s roar for many years to come. Plus, advancements in computer vision applied to aerial population counts may also greatly benefit other endangered species.

高斯滤波GaussianBlur()

平滑/模糊(**Smooth/Blur)**是图像处理中最简单和常用的操作之一,使用该操作的原因之一就为了给图像预处理时候降低噪声。图像平滑处理往往使图像中的边界、轮廓变得模糊

Show image

1
2
3
4
5
6
#1. import cv2
im = cv2.imread("../input/train/train/01e30c0ba6e91343a12d216fcafc0dd.jpg")
plt.imshow(im)
...
#2.
from keras.preprocessing.image import laod_img