Build Models: Advanced#

In this section, we discuss the flexibility of input format, the defaults of required information, and the use of filter information for building models.

Input format#

We can store image set information using xlsx, csv, or DataFrame.

xlsx input#

In the :ref:build model basics <build_basics_building_models> section, we used an xlsx excel file as the carrier of the image set information:

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

C:\vampire-ogd

control-ogd-30min

50

5

c1

We converted the xlsx file to a DataFrame that vampire.model.build_model() can take in:

>>> import pandas as pd
>>> import vampire as vp
>>> build_df = pd.read_excel(r'C:\vampire-ogd\build.xlsx')
>>> vp.model.fit_models(build_df, random_state=1)

csv input#

vampire-analysis is also compatible with csv file to store the information if it matches the workflow:

# build.csv
img_set_path, output_path, model_name, num_points, num_clusters, channel
C:\vampire-ogd\both, C:\vampire-ogd, control-ogd-30min, 50, 5, c1

We can then convert the csv file to a DataFrame that vampire.model.build_model can take in:

>>> build_df = pd.read_csv(r'C:\vampire-ogd\build.csv')
>>> vp.model.fit_models(build_df, random_state=1)

DataFrame input#

For pipelines fully automated by Python, direct use of DataFrame is encouraged:

>>> d = {'img_set_path': [r'C:\vampire-ogd\both'],
...      'output_path': [r'C:\vampire-ogd']
...      'model_name': ['control-ogd-30min'],
...      'num_points': [50],
...      'num_clusters': [5],
...      'channel': ['c1']}
>>> build_df = pd.DataFrame(data=d)
>>> vp.model.fit_models(build_df, random_state=1)
...      'model_name': ['control-ogd-30min'],
...      'num_points': [50],
...      'num_clusters': [5],
...      'channel': ['c1']}
>>> build_df = pd.DataFrame(data=d)
>>> vp.model.fit_models(build_df, random_state=1)
...      'model_name': ['control-ogd-30min'],
...      'num_points': [50],
...      'num_clusters': [5],
...      'channel': ['c1']}
>>> build_df = pd.DataFrame(data=d)
>>> vp.model.build_models(build_df, random_state=1)

Input file structure#

The input file for building models consists of required information in the first 5 columns and optional filter information in additional columns, if needed.

See also

:func:`vampire.model.build_models`

.. _build_advanced_required_info:

Defaults of required information#

Here, we discuss rules of the required information and their default values and provide some examples.

Rules#

The input DataFrame img_info_df must contain, in order, the 5 required columns of

  • img_set_path : str

    • Path to the image set(s) to be used to build model.

  • output_path : str, default

    • Path of the directory used to output model and figures. Defaults to the path of the directory of each image set.

  • model_name : str, default

    • Name of the model. Defaults to time of function call.

  • num_points : int, default

    • Number of sample points of object contour. Defaults to 50.

  • num_clusters : int, default

    • Number of clusters of K-means clustering. Defaults to 5. Recommended range [2, 10].

in the first 5 columns. The default values are used in default columns when

  • the space is left blank in csv or xlsx file before converting to DataFrame

  • the space is None/np.NaN in the DataFrame

Example: default values#

For example, the csv or xlsx file with content

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

c1

is equivalent to

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

C:\vampire-ogd\both

2021-08-04_13-45-37

50

5

c1

because

  • the default of output_path is img_set_path

  • the default of model_name is the model build time, which is, for example, 2021-08-04_13-45-37

  • the default of num_points is 50

  • the default of num_clusters is 5

Example: order matters#

The five required columns must appear in order.

For example, if we want to input the following information

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

C:\vampire-ogd

ogd-both

c1

do not shuffle the columns

Error

model_name

img_set_path

num_clusters

output_path

num_points

channel

ogd-both

C:\vampire-ogd\both

C:\vampire-ogd

c1

it will NOT give the desired output, because vampire-analysis will read the table in ordered sequence as:

Error

img_set_path

output_path

model_name

num_points

num_clusters

channel

ogd-both

C:\vampire-ogd\both

C:\vampire-ogd

c1

which makes no sense.

Example: column headings and default values#

Even when you have left the columns blank for default, the column heading has to appear as a placeholder.

For example, the table without required default column headings

Error

img_set_path

channel

C:\vampire-ogd\both

c1

will throw ValueError: Input DataFrame does not have enough number of columns. Instead, use column headings as placeholders:

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

c1

Example: multiple image sets and defaults#

You may specify multiple image sets used to build model with flexible use of defaults:

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

ogd-both

40

c1

C:\vampire-ogd\both

C:\vampire-ogd

80

10

c1

C:\vampire-ogd\both

c1

C:\vampire-ogd\both

seven

7

c1

which is equivalent to

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

C:\vampire-ogd\both

ogd-both

40

5

c1

C:\vampire-ogd\both

C:\vampire-ogd

2021-08-04_13-45-37

80

10

c1

C:\vampire-ogd\both

C:\vampire-ogd\both

2021-08-04_13-46-11

50

5

c1

C:\vampire-ogd\both

C:\vampire-ogd\both

seven

50

7

c1

Note that because the analysis takes some time, the model name that defaults to the build model time will differ for different image sets.

Use of filter information#

Here, we discuss rules and example use of filter information in the optional columns.

Rules#

The input DataFrame img_info_df could contain any number (none to many) of optional columns at the right of the required columns. These optional columns serve as filters to the image filenames. The images with filenames containing values of all filters are used in analysis.

filter1 : str, optional Unique filter of image filenames to be analyzed. E.g. “c1” for channel 1. filter2 : str, optional Unique filter of image filenames to be analyzed. E.g. “cortex” for sample region. … : str, optional Unique filter of image filenames to be analyzed. E.g. “40x” for magnification.

Example: no filter#

Suppose we have images

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_hippocampus_1_c1.png

  • 4-50-1_40x_hippocampus_1_c2.jpeg

We want to analyze all images in the image set folder. We can simply not have any columns at the right of the required columns to signify we are not using any filters. That is, all images, with supported extensions '.tiff', '.tif', '.jpeg', '.jpg', '.png', '.bmp', '.gif', will be used in building the model:

Tip

img_set_path

output_path

model_name

num_points

num_clusters

C:\vampire-ogd\both

All the files are used to build model:

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_hippocampus_1_c1.png

  • 4-50-1_40x_hippocampus_1_c2.jpeg

Example: one filter#

Suppose we have images

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_1_c2.tiff

  • 4-50-1_40x_cortex_1_c3.tiff

and we only want to include channel 1 images, which contain c1 in their filenames, we can use an optional column as filter:

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

C:\vampire-ogd\both

c1

so that only channel 1 image is used to build model:

  • 4-50-1_40x_cortex_1_c1.tiff

Example: multiple filters#

Suppose we have images

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_1_c2.tiff

  • 4-50-1_40x_cortex_1_c3.tiff

  • 4-50-1_40x_hippocampus_1_c1.tiff

  • 4-50-1_40x_hippocampus_1_c2.tiff

  • 4-50-1_40x_hippocampus_1_c3.tiff

and we want to include images that are in channel 1 AND in hippocampus, which contain c1 and hippocampus in their filenames, we can use an optional columns as an AND filter:

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

region

C:\vampire-ogd\both

c1

hippocampus

so that image whose filename contains c1 and hippocampus is used:

  • 4-50-1_40x_hippocampus_1_c1.tiff

Note

The headings of the optional columns do not affect the analysis. Use headings that are descriptive for your purposes.

Warning

The optional columns serve as an AND filter, which means only images that satisfy condition 1 AND condition 2 will be used. To illustrate this, see the next example.

Example: AND filter#

Suppose we have images

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_1_c2.tif

  • 4-50-1_40x_cortex_1_c3.tiff

  • 4-50-1_40x_hippocampus_1_c1.png

  • 4-50-1_40x_hippocampus_1_c2.png

  • 4-50-1_40x_hippocampus_1_c3.jpeg

If the image set contains images with different file extensions, and we only want a particular file extension, say tiff, to be used in building the model, we can use

Tip

img_set_path

output_path

model_name

num_points

num_clusters

extension

C:\vampire-ogd\both

tiff

so that only images whose filename contains tiff are used:

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_1_c3.tiff

However, we cannot use the optional columns to filter multiple extensions, such as

Error

img_set_path

output_path

model_name

num_points

num_clusters

extension1

extension2

extension3

C:\vampire-ogd\both

tiff

tif

png

because what we wanted is files with extension tiff OR tif OR png, but vampire-analysis is looking for files that contains tiff AND tif AND png. None of the image satisfied such condition. OR filtering is currently not supported.

Example: filter combinations#

Suppose we have images

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_1_c2.tiff

  • 4-50-1_40x_cortex_2_c1.tiff

  • 4-50-1_40x_cortex_2_c2.tiff

  • 4-50-1_40x_hippocampus_1_c1.tiff

  • 4-50-1_40x_hippocampus_1_c2.tiff

  • 4-50-1_40x_hippocampus_2_c1.tiff

  • 4-50-1_40x_hippocampus_2_c2.tiff

We want to build models from this image set using a combination of channels and regions, as well as the image set as a whole. We can accomplish this with:

Tip

img_set_path

output_path

model_name

num_points

num_clusters

channel

region

C:\vampire-ogd\both

c1

cortex

C:\vampire-ogd\both

c2

hippocampus

C:\vampire-ogd\both

c1

C:\vampire-ogd\both

hippocampus

C:\vampire-ogd\both

so that the 1st model is based on

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_2_c1.tiff

the 2nd model is based on

  • 4-50-1_40x_hippocampus_1_c2.tiff

  • 4-50-1_40x_hippocampus_2_c2.tiff

the 3rd model is based on

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_2_c1.tiff

  • 4-50-1_40x_hippocampus_1_c1.tiff

  • 4-50-1_40x_hippocampus_2_c1.tiff

the 4th model is based on

  • 4-50-1_40x_hippocampus_1_c1.tiff

  • 4-50-1_40x_hippocampus_1_c2.tiff

  • 4-50-1_40x_hippocampus_2_c1.tiff

  • 4-50-1_40x_hippocampus_2_c2.tiff

the 5th model is based on the whole image set

  • 4-50-1_40x_cortex_1_c1.tiff

  • 4-50-1_40x_cortex_1_c2.tiff

  • 4-50-1_40x_cortex_2_c1.tiff

  • 4-50-1_40x_cortex_2_c2.tiff

  • 4-50-1_40x_hippocampus_1_c1.tiff

  • 4-50-1_40x_hippocampus_1_c2.tiff

  • 4-50-1_40x_hippocampus_2_c1.tiff

  • 4-50-1_40x_hippocampus_2_c2.tiff

Conclusion#

We have explored options to provide input information to build models using csv, xlsx, and DataFrame. We also looked at the requirements and examples of required and optional filtering information for building models.

Next, we will look at some advanced options when specifying image set information for applying models.