Friday, December 16, 2022

More Open Use of Data About Schools

A vital part of our mission, and something that makes us distinctive from most university departments, is our emphasis on improving the access and use of research data outside of academia.  This is underpinned by a realization that research will only make a difference if it can be converted into initiatives that change practice.  With the current trend towards decentralization in education, schools and colleges are going to have greater freedoms around all aspects of their practice - what to teach, which materials to use, and most importantly, what teaching methods to adopt.

 

It is in this context that we believe it is essential that professionals are provided with robust, reliable and accessible research-based evidence in order to make informed decisions about their practice. This includes:

 

-           Accessible overviews of effective interventions – What works? How does it work?

-           Trustworthy assessments of the evidence - What are the proven outcomes?

-           Practical information on training, costs, materials, school links etc so practitioners may apply an approach with confidence.

 

In order for data to be meaningful, they need to generate reliable comparisons.  The most obvious ones are comparisons over time and across geography, but there are others – across industries and across subgroups of the population.  My right to data becomes less valuable if I cannot draw worthwhile conclusions from it.

 

Suppose we have a nationally compiled dataset, obtained from an administrative or delivery process.  One would hope that it would yield satisfactory comparisons across areas, industries, or subgroups of the population.  But will it yield proper comparisons with other years.  The answer will often be that it cannot, because the process has been changed regularly over time.  These changes are often perfectly justifiable, and may be essential reactions to problems highlighted by earlier tranches of data.  But they are likely to destroy comparability.

 

There are two well known examples.  Unemployment data from the benefits system became notoriously difficult to compare as the system changed repeatedly.  Recorded crime data are impacted by changes in guidance on how to record crime, changes in the performance of police forces in following this guidance (and of course the likelihood that a crime may be reported, which could reflect insurance-related issues). 

 

The general (though not universal) view is that these shortcomings render the administrative data unsuitable for analysis of trends over time, and that the statistical sources available are far better for this purpose.

 

However, this gives rise to another problem.  There is, unsurprisingly, a lot of interest in seeing data for very small areas.  The statistical surveys cannot meet this requirement, as the sample sizes required would be unpopular and prohibitively expensive.  So the administrative sources have to meet this need, but comparisons between small area data for different periods are unlikely to be reliable if systems have changed.

 

There is of course a wider issue with data based on very small numbers.  The numbers of cases can fluctuate a lot from year to year purely as a result of statistical chance but users may see them as meaningful and requiring an explanation from the service deliverer. 

 

Issues such as these can be flagged up in metadata with the original release, but experience suggests that these metadata tend to disappear from subsequent applications of the data.  This is not an argument for opposing release, but there are issues which will take some getting used to, and in the worst case scenario generate large numbers of requests to explain what does not really require explanation.

 

Perhaps comparability will be easier to achieve if data experts such as statisticians are involved in the outset in determining what changes should be made, and how they should be implemented.  There could be other advantages in involving analysts.  I have seen too many examples of analysts being brought in – too late – to analyze datasets obtained without any thought that they might later be used in analysis.  An apparently trivial example would be failure to record a postcode, which probably had no implications for the process itself, but could cause problems subsequently if geographical analysis is required, or if there is a need to link the data to some other source.

 

As a final point on nationally generated datasets, there will be inevitably be suspicion that changes of the nature described above are politically driven, in order to massage the figures.  There is a role for independent oversight of such changes, in order to reassure the public (and government itself) that they are indeed soundly based.

 

If we move away from national to locally generated datasets, the position could be still worse.  The localism agenda appears to imply that each local authority should decide (informed by residents' views) what data it should collect, compile and release.  So my right to data may give me information, but will it give me comparable data?  There seems to be a tension here between the understandable desire to get away from the excesses of the national indicators for local authorities and the need (in my view) for meaningful comparisons between areas.

 

If data are only available in more sophisticated formats, they may become inaccessible to much of the population.  I would argue that the simpler forms of presentation should continue to be available where practicable.  I would also like some attention to be given to barriers to the use of some simpler forms. 

 

I can see a justification for requiring registration for anyone intending to access the data source programmatically, not least because of the potential traffic generated, but it should not be necessary simply to download the files.  One of the principles of open data should be that what the end-user does with their data is none of the provider's business, unless there are implications for the performance and stability of the data provider's website.

 

The words "factual", "data" and "structured" are used on the definition of "dataset" but are not defined.  Does "structured" mean that each of the data items is in a standardized format, and "unstructured" that they could be free text?  Or does "structured" mean that the data are stored in (say) a spreadsheet or database?  Is user satisfaction data regarded as "factual"?  Is there a distinction to be drawn between raw data on individual's interactions with a provider, say, and aggregated data; the examples given seem more likely to be the latter.

 

The word "information" is used in the definition of "dataset" but with a quite different meaning to that given in the definition of "information" itself.  The model I would suggest is that "data" on their own are the raw material which analysis helps turn into "information".  In those terms, what is defined here as "information" might better be labelled as "analysis".

 

The obvious test is one of harm to the national interest or to individuals or businesses.  Although a value-for-money test would seem sensible, in some cases it is only after the dataset has been in the wild for a number of years that its value becomes apparent.  Trying to value a dataset in advance is likely to result in too restrictive an approach.  I would however accept that there are some datasets where it is next to impossible to imagine a worthwhile use.  On the other hand, it would be nice if some of those proposing access to datasets could manage more than a one line explanation of the benefits.

 

There is also an issue as to who should make these decisions.  Should it be left to individual organizations?  Should there be a group within government driving forward a common approach?  Should there be ministerial input into this, or should any responsible body have the same sort of detachment?

 

Leaving aside the difficulties of making this assessment, and of doing so consistently, there is a difficulty here.  What happens if second and third data requestors apply after the first data requestor has paid up?  Do they have to pay the same or do they get the data for free because the costs have already been recovered?

 

The main costs are likely to be turning data that are fit for internal purpose (but were never intended to go further) into something which can be made available widely, and the costs of ensuring that nothing is revealed, directly or by implication, about individual persons or businesses.  Arguably, any government that is serious about this agenda should be prepared to bear the first of these costs as a one-off.  The second cost is trickier as it will reoccur every time a dataset is updated. 

 

Be realistic about the resource need.  Governments of all hues tend to think that their own projects can be done at zero cost, while its opponents' projects were an exorbitant waste of resource.  The other encouragement would be to move some of the focus of the site back towards data analysis in its own right rather than as a means of enabling "armchair auditors" to highlight the more absurd spending items.  While these data should certainly be available openly, the overemphasis on this aspect does sometimes give the impression that the very people expected to deliver on this are finding themselves under siege as a result. 

 

Presumably the word "publication" is meant to imply the making available of raw datasets as well as what might be termed the results.  Simply making available the tables and analyses that government has chosen to produce would greatly reduce the scope for innovation by others.

 

I would be very reluctant to see cost used as a reason for refusing release.  I would draw a distinction between the one-off costs of changing IT systems to generate suitable formats and the ongoing costs associated with ensuring confidentiality of personal data.  The former should be borne by any government that seriously believes in this agenda.  The latter are genuinely more difficult as it is hard to reduce the series to a process of mechanical steps; some human resource input is probably required on an ongoing basis.

 

Some form of independent body is essential.  Not only to ensure that data holders are not seeking to circumvent the open data regime, but also to minimise suspicion that the government of the day is holding the door open for data that suit its purposes and obstructing data which do not.

 

Some features of such a body could be drawn from the statistics, and it is encouraging to see the parallels being drawn between open data and government statistics.  It is important that such a body should not be dominated by data producers and/or by the analytical community.  Open data do pose challenges in terms of personal privacy and it is important that the public does not feel that their interests are being disregarded by one or other of the special interest groups.  At a severely practical level, if people have doubts about how their data are being handled, they are less likely to participate in surveys or to provide entirely frank data when interacting with service delivery.

 

However, I would point out that the regime in place for statistical data is rigorous, probably more rigorous than most outside the statistical arena would imagine.  Data for a single individual or business must not be releases, but neither should data for multiple individuals or businesses which could reveal something to one of those "units" about another.  I have doubts that anonymization is sufficient, particularly if carried out by those without experience in this tricky field. 

 

The implications could be serious for any body which does not currently collate the potentially in scope information into a consistent format and for those faced with the ongoing burden of checking for confidentiality.  I assume that this is a cost-benefit question.  However, the benefit is not easy to assess in advance of making the dataset available.  The private sector might manage to use the dataset in quite unpredictable applications which have huge benefits to the economy.

 

There should be a role for the government (or an independent body) to establish consistent standards and approaches to collecting a much wider range of data.  What use are data for my borough to me if they are not consistent with the data for its neighbors, or even with its own data for previous years?  Delegation of decision-making is superficially attractive, but cuts across the need to obtain maximum benefit from the data that are collected. 

 

Hopefully fear of exposure would act as a strong distinctive to obstruction by staff.  However, failure in these areas is often for more complex reasons.  The most obvious one is a failure to accept the resource implications.  That could be a collective failing amongst a body's staff but it could also stem from reluctance at ministerial level to make the funding follow the priorities. 

Data should be consistent over time.  They should also be consistent over geography.  The latter is simpler to discuss.  It should be relatively easy to ensure that data from a central government system are on the same basis for each local authority area, for example.  It is harder to ensure that the same is true for data from each local authority's system are comparable.  Have the same definitions and criteria been used?  Have the data items been collected in the same way (even asking the same questions in a different order can produce different responses)? 

 

It would be wrong for a data provider to actively prioritize data for inclusion in an inventory, as this assumes they are well placed to assess the value of the data.  Far better to maximize the content of the inventory and then let the market consider what may or may not be of value.  I would accept that there are some datasets which have little conceivable value as open data, so there would be exceptions to my principle.

 

Clearly it may be necessary, even desirable, to stop collection where there is little prospect of it ever becoming relevant to policy or to the wider public, or where financial circumstances dictate.  It is however important that any government considers fully and fairly the likely needs of a successor government and does not simply drop sources because of ideological doubts about its basis.  The worst possible outcome would be a see-saw effect where what one government does is continually reversed by its successor.  Much of the value of data, particularly for investigation of economic and statistical relationships, lies in the availability of a long time series of data.

 

It should not matter to the user where data are held, as long as there are adequate routes to the data from anywhere the prospective user might expect to look. 

 

I would give priority to detail, as it provides the building blocks from which the user can build up the analysis that they wish to see.  I would apply this principle to data organized on both geographical and industry bases.

 

We can see a number of promising areas of development in this area at present, including greater support from central government in creating the necessary infrastructure to support better access and use of research information.  Although far from exhaustive, the following examples provide an indication of where progress is being made:

-           Funds are allocated to rigorous and independent evaluations of the funded projects, including the use of robust experimental trials. This information will then be made widely available to all schools across the sector.

 

-           To provide reliable, unbiased evidence on the effectiveness of educational programs. Give educators, policy makers, and researchers fair and useful information about the strength of evidence supporting a variety of programs available for both primary and secondary pupils. Users can access full reports that analyze all existing research in a particular area, for example primary mathematics, or can refer to simple summaries.

 

-           The interim findings of the review also recommend the formation of an independent program which would expand and improve by capturing and spreading data on effective child health and development programs.

 

The long-term aim is present data in a similar in which way a consumer might receive when buying say a new television, washing machine or car in a magazine style report i.e. the necessary level of detail to be able make an informed decision around implementing a new approach.  This will include practitioner feedback in addition to research-based data.

 

-           A very simple way in which we aim to open access to research information is through a new magazine for practitioners.  Aimed at educational leaders and policy makers, the magazine offers easy access to the latest developments in education research. 

 

-           The Media Center - a new initiative that aims to make education research more accessible to the media and policy makers and so improve policy development, practice and public understanding of education. 

 

Jeff C. Palmer is a teacher, success coach, trainer, Certified Master of Web Copywriting and founder of https://Ebookschoice.com. Jeff is a prolific writer, Senior Research Associate and Infopreneur having written many eBooks, articles and special reports.

 

Source: https://ebookschoice.com/more-open-use-of-data-about-schools/