Kaggle入门比赛 - 笔记
Aerial Cactus identification
To assess the impact of climate change on Earth’s flora and fauna, it is vital to quantify how human activities such as logging, mining, and agriculture are impacting our protected natural areas. Researchers in Mexico have created the VIGIA project, which aims to build a system for autonomous surveillance of protected areas. A first step in such an effort is the ability to recognize the vegetation inside the protected areas. In this competition, you are tasked with creation of an algorithm that can identify a specific type of cactus in aerial imagery.
Data preparation
As you know,data should be processed into appropriatly pre-processed floating point tensors before being fed to our network. So the steps for getting it into our network are roughly
- Read the picture files
- Decode JPEG content to RGB pixels
- Convert this into floating tensors
- Rescale pixel values (between 0 to 255) to [0,1] interval.
we will make use of ImageDataGenerator method available in keras to do all the preprocessing.
1 | datagen=ImageDataGenerator(rescale=1./255) |
This method is useful when the images are clustered in only one folder. To put in other words images from different class/labels reside in only one folder. Generally, with such kind of data, some text files containing information on class and other parameters are provided. In this case, we will create a dataframe using pandas and text files provided, and create a meaningful dataframe with columns having file name (only the file names, not the path) and other classes to be used by the model. For this method, arguments to be used are:
dataframe : Dataframe having meaningful data (file name, class columns are a must)
Pandas dataframe containing the filepaths relative to directory (or absolute paths if directory is None) of the images in a string column.
It should include other column/s depending on the class_mode
: - if class_mode
is "categorical"
(default value) it must include the y_col
column with the class/es of each image. Values in column can be string/list/tuple if a single class or list/tuple if multiple classes. - if class_mode
is "binary"
or "sparse"
it must include the given y_col
column with class values as strings. - if lass_mode
is "other"
it should contain the columns specified in y_col
. - if class_mode
is "input"
or None
no extra column is needed.
directory value : The path to the parent directory containing all images.
Pandas 中的 Dataframe
数据集的train.csv 文件内容:
1 | X = pd.read_csv(‘。。。。/train.csv’) |
1 | #Getting a basic idea |
1 | def extract_features(sample_count,dir,df): |
上面的代码中,因为 y_col
中的 “has_cactus"
不应该是 binary
本部分来自:pathlib介绍-比os.path更好的路径处理方式 | 知乎
1 | #错误的方式:手动拼接 |
1 | #旧的解决方式:os.path模块 |
1 | #更好解决方式:pathlib模块 |
Dogs vs. Cats
Analysis Training Data
1 | filenames = os.listdir("../input/train/train") |
- 读 :pd.read_csv(‘…‘)
- 制作 :pd.DataFrame({ })
Prepare Training data
Because we will use image genaretor with class_mode="categorical"
. We need to convert column category into string. Then imagenerator will convert it one-hot encoding which is good for our classification.
So we will convert 1 to dog and 0 to cat
1 | df["catagory"] = df["catagory"].replace({0: 'cat', 1: 'dog'}) |
1 | from sklearn.model_selection import train_test_split |
1 | model.add(Dense(2, activation='softmax')) #2 because we have cat and dog classes |
Prepare data
Digit Recongnizer
MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.
In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.
Data Analysis
1 | # Normalize the data |
而且必须是0,不能是 ‘ :’
- target_size: 整数元组(height, width), 默认为(256,256)。所有找到的图片都会调整到这个维度。所以必须指定。
这个,生成器的目标是numpy 矩阵。见:Keras 中文文档
1 | flow(x, y=None, batch_size=32, shuffle=True, sample_weight=None, seed=None, save_to_dir=None, save_prefix='', save_format='png', subset=None) |
- x: 输入数据。秩为 4 的 Numpy 矩阵或元组。如果是元组,第一个元素应该包含图像,第二个元素是另一个 Numpy 数组或一列 Numpy 数组,它们不经过任何修改就传递给输出。可用于将模型杂项数据与图像一起输入。对于灰度数据,图像数组的通道轴的值应该为 1;对于 RGB 数据,其值应该为 3;对于 RGBA 数据,值应该为 4。
- y: 标签。
- batch_size: 整数 (默认为 32)。
- shuffle: 布尔值 (默认为 True)。
- sample_weight: 样本权重。
- seed: 整数(默认为 None)。
- save_to_dir: None 或 字符串(默认为 None)。这使您可以选择指定要保存的正在生成的增强图片的目录(用于可视化您正在执行的操作)。
- save_prefix: 字符串(默认
设置时可用)。 - save_format: “png”, “jpeg” 之一(仅当
设置时可用)。默认:”png”。 - subset: 数据子集 (“training” 或 “validation”),如果 在
一个生成元组 (x, y)
的 Iterator
,其中 x
是图像数据的 Numpy 数组(在单张图像输入时),或 Numpy 数组列表(在额外多个输入时),y
是对应的标签的 Numpy 数组。如果 ‘sample_weight’ 不是 None,生成的元组形式为 (x, y, sample_weight)
。如果 y
是 None, 只有 Numpy 数组 x
NOAA Fisheries Steller Sea Lion Population Count
Steller sea lions in the western Aleutian Islands have declined 94 percent in the last 30 years. The endangered western population, found in the North Pacific, are the focus of conservation efforts which require annual population counts. Specially trained scientists at NOAA Fisheries Alaska Fisheries Science Center conduct these surveys using airplanes and unoccupied aircraft systems to collect aerial images. Having accurate population estimates enables us to better understand factors that may be contributing to lack of recovery of Stellers in this area.
Currently, it takes biologists up to four months to count sea lions from the thousands of images NOAA Fisheries collects each year. Once individual counts are conducted, the tallies must be reconciled to confirm their reliability. The results of these counts are time-sensitive.
In this competition, Kagglers are invited to develop algorithms which accurately count the number of sea lions in aerial photographs. Automating the annual population count will free up critical resources allowing NOAA Fisheries to focus on ensuring we hear the sea lion’s roar for many years to come. Plus, advancements in computer vision applied to aerial population counts may also greatly benefit other endangered species.
Show image
1 | #1. import cv2 |