Results of the Contouring Challenge

Discussions

Feel free to submit your responses to the Contouring Challenge results by e-mail to . We will do what we can to maintain a list of discussion threads on this page. Responses to specific opinions and claims on this page should be identified as such, using the tags 'THREAD #.#' in the section headers.

THREAD 1.1

After discussion with V. Gregoire, we were a bit puzzled by the conclusions, or at least by their wording. Do we really have to conclude that "manual" methods are the best? This could be misunderstood by the audience. Manual delineation cannot be accurate without complex guidelines. Automatic or semi-automatic delineation was precisely an attempt to implicitly (but perhaps also simplistically) hard-code those guidelines in software. Hence, we fear that the take-home-message "manual is best" could be understood without nuance that comes with it and could give the impression that PET delineation is not really a complex problem after all. A second fear, with most probably weak foundations, is that within the framework of a contest, manual delineation has an advantage because the observer can take all the necessary time, optimize and correct the contours, whereas automatic methods do not have this possibility.

- J. Lee, Center for Molecular Imaging and Experimental Radiotherapy, Universite Catholique de Louvain, Brussels, Belgium.

--------END OF THREAD 1.1----------

THREAD 1.2

Thanks for your response, this is the kind of discussion that the challenge hoped to provoke. The question of whether manual can be best is pretty philisophical.

One could argue that, with only images to work with (no invasive procedures) the human expert, with their experience of imaging a particular tumour-type and access to all extra information such as treatment response and tumour stage (and perhaps the 'complex guidelines' you refer to), should give the most accurate delineations.

If we take this view, then the role of the software is to give exactly the same contours as the human. Equivalently, the software should provide an estimate of the 'best' contours for the human to edit, and the best algorithm is that which needs the least amount of post-editing (if the software produces exactly the same that the human would have produced then no editing would be necessary). Personally I don't think that a physician or radiologist should be comfortable with an automatic segmentation that he or she is not able to edit and I think it would be wrong to discourage them from editing auto-contours. The final decision in a treatment plan should come from the human expert.

However, there is a paradox: the reasoning above implies that all humans give the same 'best' delineation, which is of course not true. The probabilistic accuracy metric encodes manual delineations from multiple experts, and would give a high score to contours that matched with ANY of the expert manual delineations, without favouring one expert over another. Note that the phantom case is different and in that case (semi) automatic algorithms out-performed manual delineation.

I agree that software should encode guidelines about PET segmentation, to reduce the work of the human. Similarly, I think that software should encode expert knowledge about the specific image being contoured, directly obtained from the expert. The discussion at the symposium concluded that 'Machine Learning' is sadly lacking in state-of-the-art (semi) auto-contouring, where things like supervised classification and 'on-line learning' (where a supervised segmentation algorithm and its internal parameters adapt in response to the behaviour of the expert during an interactive segmentation) should be pursued.

-T. Shepherd, Turku PET Centre, Turku, Finland.

--------END OF THREAD 1.2----------

THREAD 1.3

The debate around manual or automatic segmentation should clearly identify the various sources of information and the various sources of errors. It is obvious that experts are needed. Experts are IMHO very good at determining whether a volume or a piece of volume should or be considered as tumor or not. However, I am also convinced that the same experts are not at all prepared to produce an accurate delineation of a complex object on a blurred image. The object edges are often distorted by resolution blur (or PVE) in unexpected ways. My personal view is that the expert should be involved at the end of the delineation process only, in order to select which (piece of) volumes should be regarded as tumor, whereas the computer should produce these (pieces of) volumes. The issue of resolution blur is precisely that gradients are smoothed and edges are not sharp enough, which introduces uncertainty. I would never rely on somebody drawing manually the contours: I think the expertise will allow him/her to exclude / include the wrong / right pieces of the tumor, but at the same time these pieces will not be accurately delineated.

- J. Lee, Center for Molecular Imaging and Experimental Radiotherapy, Universite Catholique de Louvain, Brussels, Belgium.

--------END OF THREAD 1.3----------

THREAD 1.4

I agree with John that manual delineation is eventually not the way to go. Although manual delineation now turned out the be best, there are a few things that we need to consider: We don't know yet the (long term) reproducibility of the (manual) delineation...precision is equally important as accuracy. Likely automated methods are much more reproducible. Another issue is the observer variability, which is almost impossible to reduce within acceptable levels....these and other points can be taken up in the discussion of a paper. There are many ways for manual delineations:

Basic manual method is someone drawing contours directly onto the image (and allowing user to fully change color window settings)
Second level could be a first guess by automated methods followed by manual refinement or 'supervised automated methods'
Other option is to use secondary datasets (CT, MRI)
Yet another option is to generate additional datasets, such as an gradient image and manually define the contour using both intensity as well as gradient image...inthis case the manual definitions can be considered as a manual watershed algorithm (that is what we did, so manual definition with gradient support).
ETC

We can make almost endless combinations (YAPETSM:-)

I think in general the findings of the challange support the use of gradients (considering that we (team 12) used a gradient image along with the normal image to manually define the ''tumor'' contours)

- R. Boellaard, Department of Nuclear Medicine & PET Research, VU University Medical Center (VUMC), Netherlands.

--------END OF THREAD 1.4----------

THREAD 1.5

Is really PET delineation a complex problem? As pointed out by J. Lee (thread 1.1), the message "manual is best" could give the impression that PET delineation is not really a complex problem after all. However, it can be also argued that “not difficult” problems can be easily solved using an automatic method without any user interaction, and that manual interaction is mandatory to solve difficult problems for which the computer is failing. Computer methods that are based on gradient information are now able to generate very good delineations when contours are sharp and well defined from a perceptual perspective. This seems to be the case of at least 85% of slices of the 4 volumes provided for the contest. For these contours, manual editing is likely not necessary. However, for lesions with an incomplete or ill-defined contour (at most 15% of slices), the delineation problem is more difficult and experts are much better than computers for interpolating and optimizing the correct contour. In conclusion, delineation of sharp and well defined contours can be easily performed using an automatic method, while lesions with incomplete or ill-defined contours will need additional expert interaction for contour correction and optimization.

Time required for user interaction. As shown by delineation results submitted to the contouring challenge, very good delineations scores have been achieved by experts after time consuming interactive editing of contours. For example, the best semi-automatic delineation method, which is based on manual editing with gradient support, requires time from 15min to one hour per volume (oral communication from the author). In comparison, some semi-automatic methods require less than 5min of user interaction per PET volume. The time required for user interaction should be minimized without compromising delineation accuracy. To achieve this objective, fast semi-automatic methods could be used jointly with advanced interactive contour optimization in the frame of future experimental work.

Aim of computer-aided delineation methods. It is generally considered that the aim of computer-assisted delineation methods is to help the expert reducing the time necessary for correcting and optimizing contours. It would reduce the work of the human. But, it would not discourage physicists from editing auto-contours, the expert would be involved at the end of the delineation process only (as wished by J. Lee in thread 1.3), and the final decision in a treatment plan would come from the human expert (as specified by T. Shepherd in thread 1.2). The objective of the comparison between time consuming manual delineation methods and faster semi-automatic delineation methods should be to test if computer methods are precise enough to efficiently support experts. These comparisons should allow answering the following question: Is it possible to reduce the expert editing time from more than 15min to less than 5min or 10min when improving interactively computer delineation after automatic delineation, without decreasing delineation accuracy? This question should be addressed in the discussion part of the foreseen joint publication.

Subjectivity & Reproducibility of Human Delineation: Clinicians with various levels of experience may give different opinions on the same sample, while the same expert's opinion may vary depending on factors like fatigue. This brings into play the two factors subjectivity and reproducibility. What seems like a “good” delineation to one person, may seem like a “not so good delineation” for some other person. Thus, the results by human interpretation is subjective in nature and the issue of reproducibility comes into play as an outcome of human interpretation. Thus, the goal should be to provide a less subjective (if possible objective) and reproducible result. Hence, development should be in the direction of automated delineation methods with possibility to interactively update and optimize the contours at the end of the process.

Suggestions for the preparation of the joint paper:

Instead of only one group of methods (mixing manual and automatic methods), consider three groups of methods: (1) automatic delineation methods, (2) fast supervised automated methods that require low interaction from the user (less than 5 minutes per volume for example), and (3) more time consuming delineation methods for which the observer can take all the necessary time for interactive contour correction and optimization (from 15 to 60 minutes per volume). In other words, it would be a good idea to group the “Proposed Methods” section into 3 subsections as suggested.
Identify all sources of information for each delineation method: use of gradient information and/or use of intensity for lesion segmentation, physics based denoising or denoising by a generic method (anisotropic diffusion or morphological filtering for example), use of a secondary dataset (CT) , encoding of expert knowledge about the specific image being contoured, etc.
Estimate the proportion of slices (5%, 10% or more?) for which lesion contouring can be considered as a difficult delineation problem by the 3 to 5 best methods (using for example the average probabilistic accuracy metric evaluated for each slice). These difficult slices would be characterized by a significant difference between the average score got by experts and the average score achieved by the first best semi-automatic delineation methods.
Identify the two slices with the maximal and minimal average probabilistic accuracy metric. Use these two slices, which respectively represent the easiest case and the most difficult case, to illustrate the contouring results of each delineation method.
Precision of manual and computer delineations: As pointed out by R. Boellaard (thread 1.4), we don't know yet the reproducibility of the manual delineation . Furthermore, we do not know the precision of expert delineations that have been used to evaluate contest results. The precision of contours provided by experts should be estimated. Expert variability could be estimated using the probabilistic accuracy metric by comparing each expert delineation to the delineations of the two other experts.

Suggestions for future work:

Organization in 2012 of a PET contest on computer-assisted delineation methods for which a larger sample of PET volumes would be provided (> 20).
Involvement of additional experts in contouring lesions in order to be able to derive a precise statistical estimate of the expert variability. As a result, it would be possible to test if the scores of two delineation methods are significantly different from a statistical point of view.

- M. Bruynooghe, SenoCAD Research GmbH, Germany.

--------END OF THREAD 1.5----------

----------------------END OF THREAD 1----------------------

----------------------THREAD 2----------------------

THREAD 2.1

The discussion so far has been interesting regarding whether manual is better than (semi-)automatic with respect to the contouring. Looking at the results I see a number of interesting areas which also merit discussion:

1.) Influence of initial contours: The "high" interactivity methods, those with post-editing, in the extreme could be considered manual since none of the original region may remain. It is interesting that these methods have such a discrepancy with the manual results. In this area the results of K & L are interesting. Here the manual outperforms the edited RG in both patient and phantom - yet presumably the same person is doing the post-editing. Are the experts influenced when given an initial contour into doing something different?

2.) Agreement between experts at the same institution / or between institutions: For the methods we applied (S & T), it is interesting that the manual editing improved the results for the phantom, yet made things much worse in the patient case. I wonder if this is a result of disagreement between experts about what should be contoured. Would the experts we consulted disagree with those at Turku? I wonder what we might find if the same analysis was performed on a leave-one-out basis for all manual segmentation (Turku experts + G, K & X). (This was suggested in Thread 1.5 but for the 3 Turku experts only)

3.) Time taken and required accuracy: It has already been raised that the time taken hasn't been considered (Thread 1.5, and implied in 1.2). However related to this is a question as to the accuracy required. At the end of the day, what impact does a small variation in the contour (say ACC score 0.8->06) have on the RT plan? Or the actual dose the patient receives? Perhaps this should have been the metric by which contours are assessed? The similarity to the expert dose plan? If a semi-automated / post-edited result can reduce the contouring time from 1 hr to 5mins, while maintaining the same effective treatment plan/level, then it is of value, regardless of the absolute precision of contours themselves.

I think the last point is the most significant unanswered question and would be well worth another paper in itself!

- M. Gooding, Mirada Medical, Oxford, UK.

--------END OF THREAD 2.1----------