Several Matlab toolboxes have no Python counterparts.
Many Python modules/packages are in an immature state of development and/or are poorly documented and/or poorly supported. (The lack of good support is not surprising given that much of this is coming from volunteers who are donating their time). It is worth investigating whether a given module/package is actively maintained before developing a dependence on it; otherwise, one may find oneself in the position of having to devise workarounds and patches for that code!
For scientific and engineering work, NumPy
, SciPy
,
and matplotlib
are arguably the most important Python packages.
NumPy
and matplotlib
are generally well documented,
but SciPy
documentation is often unclear or simply missing.
Here is a specific example.
scipy.interpolate.LSQUnivariateSpline
fits a smoothing spline to
data. The documentation for the method get_coeffs()
does not
explain the meaning of the coefficients returned by the method. (Given that the
method returns fewer coefficients than one would expect, the lack of
documentation is problematic).
matplotlib
is a very capable package for non-interactive
plotting, but there are some significant problems. Three themes are common to
most of the issues with this package:
There is in general a lack of uniformity among the interfaces to the
various functions and methods. Here's a concrete example: When one generates a
text box using the pyplot.annotate
function or the
axes
object's annotate
method, the
xycoords
keyword can be used to specify whether the text location
is specified in terms of data coordinates, axes fractional coordinates, or
figure fractional coordinates. When using the pyplot.text
function,
on the other hand, there is no xycoords
keyword, and the text
location can only be specified in data coordinates, which is typically not what
is desired.
Default behaviors are often not sensible, with the result that producing simple but professional-looking plots tends to require more fiddling than should be required. Here are three examples:
By default, the margin area of a figure has a dark gray background color, and titles, axes labels, and other text are black. Black against a dark gray background produces poor contrast. One can easily solve this problem by changing the margin background color ('facecolor') from dark gray to white, as shown in the following snippet of code, but this requires the additional code marked in red.
fig= pyplot.figure(
facecolor=[1, 1, 1]
)
Plot titles are by default jammed against the upper edge of the plot. Fixing this requires the additional code shown in red in the following snippet:
axes.set_title('text'
, y=1.02
)
When one produces a polar contour plot, the 'y' axis (circumferential
axis) labels are by default displayed in degrees while the 'x' axis (radial
axis) labels are displayed in radians! A sample plot illustrating this
behavior appears below.
This plot was generated via the script
inconsistent_angle_units.py
.
The interface puts too much burden on the programmer's memory (and on references). Here are a few examples:
Although Python and most Python packages follow the computer science convention of counting from zero, matplotlib counts subplots starting from one.
The functions contour
and contourf
can be
called with up to four positional arguments, with the interpretation of the
fourth argument depending on its type. This excessive use of positional
arguments and type-dependent behaviors, which is reminiscent of Matlab,
creates the potential for confusion. Use of alternative keyword arguments
(with suitable error checks for the presence of conflicting keywords) would be
a better design.
To annotate a figure containing multiple subplots, with the annotation
location specified in figure fractional coordinates, one must choose a subplot
at random and apply the annotation to it. I had expected that I would need
to invoke an annotate
method of the figure
object,
but there is no such method. I suspect that most programmers would find this
behavior confusing.
There is no mechanism for defining a named constant in Python. (When one defines a constant in a language such as C++, this instructs the compiler that any attempt to change the value should be treated as an error).
There is no clean way to break out of two or more nested loops. One must do one of the following:
Set a flag before breaking out of the first loop, and then test that flag in each of the outer loops.
Use exception handling.
Move the nested loops to be broken out of into a function, and use the
return
statement to terminate the function.
scipy.signal
is of limited utility for engineering applications.
I'm going to focus here on what I see as one of the key limitations—the
filtering functionality. I'd like to see the following:
support for lowpass, bandpass, and bandstop FIR filter design, with the user specifying (a) the passband ripple, (b) the minimum stopband rejection, and (c) the design method, with this last being optional. Rather than forcing the user to specify the order of the filter, which requires many iterations to determine the minimum filter order that will do the job, the code should automatically determine the minimum filter order that can meet the user's specs.
support for fixed-binary-point arithmetic.
support for filtering and the design of filters that use fixed-binary-point arithmetic.
Such changes would be a big step in the direction of making Python+NumPy+SciPy a viable alternative to Matlab + the Matlab Signal Processing Toolbox.
As an aside, I'd like to comment on the documentation for `scipy.signal.kaiserord`, which says the following:
scipy.signal.kaiserord(ripple, width)[source] <snip> ripple : float Positive number specifying maximum ripple in passband (dB) and minimum ripple in stopband.
When designing a lowpass digital filter, one normally specifies the maximum ripple in the passband and the minimum rejection in the stopband. With this function, there is no way to specify how much rejection one gets in the stopband, and the filter design code is also apparently trying to limit stopband ripple, which is something that no engineer would care about. The documentation can't just be badly worded, because there would have to be another parameter to specify the stopband rejection.
Support for reading and writing of Microsoft Excel files is unsatisfactory. In particular:
Although a single package—xlrd
—can read both
the older 2000/2003 and the newer 2007+ Excel file formats, separate packages
with different interfaces must be used to write Excel files in these two
formats. The xlwt
and openpyxl
packages can be used
to write Excel files in the older and newer formats, respectively. As of this
writing, I know of no open source Python code that answers the need for a
single interface that works transparently across both of these formats.
When working with Excel files containing column-organized data, one may
wish to extract data from columns whose numbers are not a priori known,
identifying the desired columns by strings appearing in the header row rather
than by numbers. xlrd
and openpyxl
do not support
this.
As of this writing, xlrd
and openpyxl
do not
permit one to read the formulas from a workbook or write a workbook containing
formulas. (If one reads a workbook and writes it out again, all formulas in
all worksheets are deleted).
When an Excel file is loaded into memory using xlrd
or
openpyxl
, there is no mechanism for determining the memory
occupied by the associated Python data structure. (Excel 2007/2010 files are
compressed, and the memory footprint may thus be much greater than the size of
the file on disk). The inability to determine the occupied memory is
problematic for file caching systems, which almost always operate with a
specified total memory limit, and thus must be able to determine the memory
footprint of each item in the cache.
xlrd
and openpyxl
tend to innundate the user
with a flood of nuisance messages.
If an Excel 2007/2010 workbook contains chart sheets,
openpyxl
gets confused about the names and indices of the
sheets, so that a request for data from one sheet may produce data from a
different sheet.
It appears that the openpyxl
package is no longer
maintained.
Footnote: Someone I respect opined that "Using Excel is the problem here". For better or worse, Excel is ubiquitous in engineering organizations, and one cannot always dictate the format in which data to be analyzed is provided.
SciPy includes the scipy.optimize
package for performing
optimization. There are numerous issues with this package:
Several of the algorithms are rather dated. For example, it appears that SciPy implements the original, 1965 version of the Nelder-Mead algorithm. This version may fail to converge or converge to a non-solution. And even when it does converge to a valid solution, convergence may be very slow. An article by Saša Singer and John Nelder discusses some of these issues and proposes a more efficient version of the algorithm.
The package needs a uniform mechanism for specifying termination conditions (iteration limits and tolerances). The lack of this makes experimentation with alternative optimization algorithms more cumbersome.
scipy.optimize
does not provide any capabilities for
dividing work over multiple cores in a single computer, or over multiple
nodes in a cluster of computers.
The scipy.optimize.brute
solver implements a brute force
grid search. This is useful, but there are situations in which one does not
a priori know what grid spacing to use, and would like to let the solver
continue to search until a solution of some minimal quality is found.
Something like this can be implemented using subrandom numbers, which
cover the domain of interest quickly and evenly without requiring advance
specification of the number of points. (See, e.g.,
the WikiPedia
article on subrandom numbers).
When investigating a class of related optimization problems, it is
sometimes important to be able to determine whether multiple local minima
typically occur, and if so, how accurately the starting point of the search
needs to be specified for a given optimization algorithm to be able to
converge to the global minimum. Currently, scipy.optimize.brute
cannot be used to study such problems because its finishing search
(stage-2) search) is performed using only the best result from the initial
(stage-1) search. An option to perform a finishing search for a specified
fraction of the grid points, with output including statistics on the number
of distinct local minima found, would make this function far more useful.
The addition of one or more genetic optimization algorithms would be welcome.
The scipy.optimize.brute
solver permits one to combine
brute force grid search with a second stage of optimization, but there is
no mechanism for passing termination conditions or other options to the
second-stage optimizer.
The mystic
package, of which Mike McKerns is the primary
developer, appears to offer many advantages over scipy.optimize
,
but is not currently well enough documented to represent a useable alternative.
Python's Standard Library provides a module called
ConfigParser
for extracting parameter values (e.g., simulation
model parameters) from ini (configuation) files, but this module was not
well designed and is not a practical tool for serious simulation projects. A
few specific issues are the following:
The definition of model parameters and processing of the ini file have been conflated together. These are logically separate steps, i.e., there should be one call to define a specific parameter, and another call to parse the ini file and recover all parameter values appearing in a given section. The parameter definition call would specify the parameter name, the data type, and optionally a default value and help text to be displayed on user request.
To understand the benefit of the proposed design, suppose that there are m parameters in a given model and that one wants to recover the parameter values appearing in n sections of the ini file. With the present design, one needs m times n calls. With the proposed design, one would need m plus n calls. (There would be no need to repeat parameter definitions unless for some reason these change from one section of the ini file to the next, which would be strange).
There is no mechanism for data validation, i.e., for specifying a condition that a given parameter must satisfy. If, for example, an integer parameter represents the number of customers waiting for service, one would like to be able to specify as part of the parameter definition that a positive value is required.
There is no complex number parameter type.
There is no list parameter type. (One would also need to be able to specify the allowed types of objects that a given list can contain; these could be viewed as sub-types).
There is no mechanism for defining help text to be displayed on user request.
NumPy supports arrays of strings, but such arrays suffer from special
disabilities. Because NumPy's .min()
and .max()
methods work for numeric arrays, and Python's min()
and
max()
functions work for strings, one might reasonably expect
NumPy's .min()
and .max()
methods to work for arrays
of strings, but they don't, as the following iPython
session
demonstrates:
In [1]: array([['dd', 'de', 'cc'], ['ae', 'be', 'hf']]).max(axis=0) TypeError: cannot perform reduce with flexible type
Python provides no true block comments. A consequence is that there is no mechanism that is both fast and bullet-proof for temporarily deactivating a large block of code—one must either laboriously comment out each line, or simply delete the block. Support for block comments should be added, with the implementation allowing a block comment to contain ordinary comments and/or nested block comments.
Footnote: Python supports strings that span multiple lines; these are
delimited by triple quotes. It is sometimes claimed that tripled-quoted strings
can be used for block commenting, but this is problematic. To see why, suppose
that one wishes to comment out lines 20 through 50, and that lines 30 through 40
are a triple quoted string. Enclosing lines 20 through 50 in triple quotes will
convert lines 20 through 30 and lines 40 through 50 into triple-quoted strings,
but the contents of lines 30 through 40 will no longer be recognized as a
tripled-quoted string. What's needed is a pair of markers that unambiguously
indicate the start and end of a block comment, e.g., #*
and
*#
. (The Haskell programming language uses {-
and
-}
).
Example #1: d= {'a':1, 'a':2}
is equivalent to
d= {'a':2}
.
Example #2: Suppose that the variable a
has some scalar value.
Aside from consuming a few CPU cycles, the statement a == 4
does
absolutely nothing. Clearly the person who coded this meant to either assign
a
or do something with the result of the comparison. In either
case, the statement as coded should be treated as invalid.
Oct. 21, 2017: I've been griping to Enthought for a couple of years about the issues with interpreter and package configuration control (see item 1 in the Archive section below), and am very pleased to be able to report that they listened. I've been playing with the recently-released Enthought Deployment Manager (EDM), and my preliminary opinion is that it provides a powerful and easy-to-use toolset for managing multiple configurations of the interpreter and packages.
EDM allows one to create multiple environments, each with its own version of the interpreter and a completely independent set of package files. If one updates package X, and that update depends on updated versions of packages Y and Z, Y and Z are automatically upgraded as well. If one downgrades, the same thing works in the opposite direction.
[Canopy 1.X used Python's built-in virtual environment facility for user
Python environments. This model worked well for pure Python packages, but tended
to be fragile for extension packages (binary libraries), and also required a
high level of coordination between the maintainers of base and child
environments. For example, NumPy with MKL was fragile, depending on the sequence
and location of update events.]
This section is a repository for limitations that have been addressed.
Currently, there is only one item here, but I'm hopeful that there will be more over time.
All non-trivial Python applications depend on packages (or modules) that are not part of the top-level script. Although Python handles package imports in a much cleaner fashion than some other languages (e.g., Matlab), there are still issues that tend to frequently bedevil Python developers:
If application X depends on one version of a package, while application Y depends on a different version, there is currently no clean way to handle this. It would be great if there were some way to specify the version of a package that is to be imported, or an acceptable range of version numbers. I'm not sure in practice how this would be implemented; it could potentially use a scheme similar to that employed by the Windows OS for the registration of DLLs.
Because one can discover at most one missing dependency per execution, tracking down dependencies by running a script can be a time-consuming proposition. I would love to have a tool that automatically analyzes a Python script or package, determines all dependencies, and reports all missing packages. As far as I know, nothing like this exists.
import
,
the sequence of folders specified via the PYTHONPATH environment variable is searched for the
requested module. One can override this path by modifying the contents of
sys.path
, which is initialized based on
PYTHONPATH, but
this is bad practice. One can also use my
CustomImporter.py
module, which allows one to import a module from a specific location, but this
code was not created to solve the problem of version-specific dependencies and
provides at best a clumsy workaround.]
Last update: 21 Oct., 2017