Build Models: Advanced#
In this section, we discuss the flexibility of input format, the defaults of required information, and the use of filter information for building models.
Input format#
We can store image set information using xlsx, csv, or DataFrame.
xlsx input#
In the :ref:build model basics <build_basics_building_models> section, we used an
xlsx excel file as the carrier of the image set information:
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
|
control-ogd-30min |
50 |
5 |
c1 |
We converted the xlsx file to a DataFrame that vampire.model.build_model() can take in:
>>> import pandas as pd
>>> import vampire as vp
>>> build_df = pd.read_excel(r'C:\vampire-ogd\build.xlsx')
>>> vp.model.fit_models(build_df, random_state=1)
csv input#
vampire-analysis is also compatible with csv file to
store the information if it matches the workflow:
# build.csv
img_set_path, output_path, model_name, num_points, num_clusters, channel
C:\vampire-ogd\both, C:\vampire-ogd, control-ogd-30min, 50, 5, c1
We can then convert the csv file to a DataFrame that vampire.model.build_model can take in:
>>> build_df = pd.read_csv(r'C:\vampire-ogd\build.csv')
>>> vp.model.fit_models(build_df, random_state=1)
DataFrame input#
For pipelines fully automated by Python, direct use of DataFrame is encouraged:
>>> d = {'img_set_path': [r'C:\vampire-ogd\both'],
... 'output_path': [r'C:\vampire-ogd']
... 'model_name': ['control-ogd-30min'],
... 'num_points': [50],
... 'num_clusters': [5],
... 'channel': ['c1']}
>>> build_df = pd.DataFrame(data=d)
>>> vp.model.fit_models(build_df, random_state=1)
... 'model_name': ['control-ogd-30min'],
... 'num_points': [50],
... 'num_clusters': [5],
... 'channel': ['c1']}
>>> build_df = pd.DataFrame(data=d)
>>> vp.model.fit_models(build_df, random_state=1)
... 'model_name': ['control-ogd-30min'],
... 'num_points': [50],
... 'num_clusters': [5],
... 'channel': ['c1']}
>>> build_df = pd.DataFrame(data=d)
>>> vp.model.build_models(build_df, random_state=1)
Input file structure#
The input file for building models consists of required information in the first 5 columns and optional filter information in additional columns, if needed.
See also
:func:`vampire.model.build_models`
.. _build_advanced_required_info:
Defaults of required information#
Here, we discuss rules of the required information and their default values and provide some examples.
Rules#
The input DataFrame img_info_df must contain, in order, the 5
required columns of
img_set_path: strPath to the image set(s) to be used to build model.
output_path: str, defaultPath of the directory used to output model and figures. Defaults to the path of the directory of each image set.
model_name: str, defaultName of the model. Defaults to time of function call.
num_points: int, defaultNumber of sample points of object contour. Defaults to 50.
num_clusters: int, defaultNumber of clusters of K-means clustering. Defaults to 5. Recommended range [2, 10].
in the first 5 columns. The default values are used in default columns when
the space is left blank in csv or xlsx file before converting to DataFrame
the space is
None/np.NaNin the DataFrame
Example: default values#
For example, the csv or xlsx file with content
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
c1 |
is equivalent to
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
|
2021-08-04_13-45-37 |
50 |
5 |
c1 |
because
the default of
output_pathisimg_set_paththe default of
model_nameis the model build time, which is, for example, 2021-08-04_13-45-37the default of
num_pointsis 50the default of
num_clustersis 5
Example: order matters#
The five required columns must appear in order.
For example, if we want to input the following information
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
|
ogd-both |
c1 |
do not shuffle the columns
Error
model_name |
img_set_path |
num_clusters |
output_path |
num_points |
channel |
|---|---|---|---|---|---|
ogd-both |
|
|
c1 |
it will NOT give the desired output, because vampire-analysis will read
the table in ordered sequence as:
Error
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
ogd-both |
|
|
c1 |
which makes no sense.
Example: column headings and default values#
Even when you have left the columns blank for default, the column heading has to appear as a placeholder.
For example, the table without required default column headings
Error
img_set_path |
channel |
|---|---|
|
c1 |
will throw ValueError: Input DataFrame does not have enough number of columns. Instead, use column headings as placeholders:
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
c1 |
Example: multiple image sets and defaults#
You may specify multiple image sets used to build model with flexible use of defaults:
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
ogd-both |
40 |
c1 |
||
|
|
80 |
10 |
c1 |
|
|
c1 |
||||
|
seven |
7 |
c1 |
which is equivalent to
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
|
ogd-both |
40 |
5 |
c1 |
|
|
2021-08-04_13-45-37 |
80 |
10 |
c1 |
|
|
2021-08-04_13-46-11 |
50 |
5 |
c1 |
|
|
seven |
50 |
7 |
c1 |
Note that because the analysis takes some time, the model name that defaults to the build model time will differ for different image sets.
Use of filter information#
Here, we discuss rules and example use of filter information in the optional columns.
Rules#
The input DataFrame img_info_df could contain any number (none
to many) of optional columns at the right of the required columns.
These optional columns serve as filters to the image filenames.
The images with filenames containing values of all filters are used
in analysis.
filter1 : str, optional Unique filter of image filenames to be analyzed. E.g. “c1” for channel 1. filter2 : str, optional Unique filter of image filenames to be analyzed. E.g. “cortex” for sample region. … : str, optional Unique filter of image filenames to be analyzed. E.g. “40x” for magnification.
Example: no filter#
Suppose we have images
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_hippocampus_1_c1.png4-50-1_40x_hippocampus_1_c2.jpeg
We want to analyze all images in the image set folder.
We can simply not have any columns at the right of the required columns
to signify we are not using any filters. That is, all images, with supported
extensions '.tiff', '.tif', '.jpeg', '.jpg', '.png', '.bmp', '.gif',
will be used in building the model:
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
|---|---|---|---|---|
|
All the files are used to build model:
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_hippocampus_1_c1.png4-50-1_40x_hippocampus_1_c2.jpeg
Example: one filter#
Suppose we have images
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_1_c2.tiff4-50-1_40x_cortex_1_c3.tiff
and we only want to include channel 1 images, which contain c1 in their
filenames, we can use an optional column as filter:
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
|---|---|---|---|---|---|
|
c1 |
so that only channel 1 image is used to build model:
4-50-1_40x_cortex_1_c1.tiff
Example: multiple filters#
Suppose we have images
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_1_c2.tiff4-50-1_40x_cortex_1_c3.tiff4-50-1_40x_hippocampus_1_c1.tiff4-50-1_40x_hippocampus_1_c2.tiff4-50-1_40x_hippocampus_1_c3.tiff
and we want to include images that are in channel 1 AND in hippocampus,
which contain c1 and hippocampus in their
filenames, we can use an optional columns as an AND filter:
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
region |
|---|---|---|---|---|---|---|
|
c1 |
hippocampus |
so that image whose filename contains c1 and hippocampus is used:
4-50-1_40x_hippocampus_1_c1.tiff
Note
The headings of the optional columns do not affect the analysis. Use headings that are descriptive for your purposes.
Warning
The optional columns serve as an AND filter, which means only images that satisfy condition 1 AND condition 2 will be used. To illustrate this, see the next example.
Example: AND filter#
Suppose we have images
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_1_c2.tif4-50-1_40x_cortex_1_c3.tiff4-50-1_40x_hippocampus_1_c1.png4-50-1_40x_hippocampus_1_c2.png4-50-1_40x_hippocampus_1_c3.jpeg
If the image set contains images with different file extensions, and we
only want a particular file extension, say tiff, to be used in
building the model, we can use
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
extension |
|---|---|---|---|---|---|
|
tiff |
so that only images whose filename contains tiff are used:
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_1_c3.tiff
However, we cannot use the optional columns to filter multiple extensions, such as
Error
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
extension1 |
extension2 |
extension3 |
|---|---|---|---|---|---|---|---|
|
tiff |
tif |
png |
because what we wanted is files with extension tiff OR tif OR png,
but vampire-analysis is looking for files that contains tiff AND
tif AND png. None of the image satisfied such condition.
OR filtering is currently not supported.
Example: filter combinations#
Suppose we have images
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_1_c2.tiff4-50-1_40x_cortex_2_c1.tiff4-50-1_40x_cortex_2_c2.tiff4-50-1_40x_hippocampus_1_c1.tiff4-50-1_40x_hippocampus_1_c2.tiff4-50-1_40x_hippocampus_2_c1.tiff4-50-1_40x_hippocampus_2_c2.tiff
We want to build models from this image set using a combination of channels and regions, as well as the image set as a whole. We can accomplish this with:
Tip
img_set_path |
output_path |
model_name |
num_points |
num_clusters |
channel |
region |
|---|---|---|---|---|---|---|
|
c1 |
cortex |
||||
|
c2 |
hippocampus |
||||
|
c1 |
|||||
|
hippocampus |
|||||
|
so that the 1st model is based on
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_2_c1.tiff
the 2nd model is based on
4-50-1_40x_hippocampus_1_c2.tiff4-50-1_40x_hippocampus_2_c2.tiff
the 3rd model is based on
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_2_c1.tiff4-50-1_40x_hippocampus_1_c1.tiff4-50-1_40x_hippocampus_2_c1.tiff
the 4th model is based on
4-50-1_40x_hippocampus_1_c1.tiff4-50-1_40x_hippocampus_1_c2.tiff4-50-1_40x_hippocampus_2_c1.tiff4-50-1_40x_hippocampus_2_c2.tiff
the 5th model is based on the whole image set
4-50-1_40x_cortex_1_c1.tiff4-50-1_40x_cortex_1_c2.tiff4-50-1_40x_cortex_2_c1.tiff4-50-1_40x_cortex_2_c2.tiff4-50-1_40x_hippocampus_1_c1.tiff4-50-1_40x_hippocampus_1_c2.tiff4-50-1_40x_hippocampus_2_c1.tiff4-50-1_40x_hippocampus_2_c2.tiff
Conclusion#
We have explored options to provide input information to build models using csv, xlsx, and DataFrame. We also looked at the requirements and examples of required and optional filtering information for building models.
Next, we will look at some advanced options when specifying image set information for applying models.