Directory

⚓ T150106 Type collisions in log events causing indexing failures in ELK Elasticsearch
Page MenuHomePhabricator

Type collisions in log events causing indexing failures in ELK Elasticsearch
Closed, ResolvedPublic

Description

[2016-11-05 19:52:57,482][DEBUG][action.bulk              ] [logstash1004] [logstash-2016.11.05][0] failed to execute bulk item (index) index {[logstash-2016.11.05][mediawiki][AVg2DRcqpYcdBKMsbKZQ], source[{"message":"FormatJson::encode($events) failed: 5. Aborting send.","@version":1,"@timestamp":"2016-11-05T19:52:57.000Z","type":"mediawiki","host":"mw1167","level":"ERROR","tags":["syslog","es","es"],"channel":"EventBus","normalized_message":"FormatJson::encode($events) failed: 5. Aborting send.","url":"/rpc/RunJobs.php?wiki=commonswiki&type=refreshLinks&maxtime=30&maxmem=300M","ip":"127.0.0.1","http_method":"POST","server":"127.0.0.1","referrer":null,"wiki":"commonswiki","mwversion":"1.29.0-wmf.1","reqId":"7a74bdd65bc04eb4f1c99f62","backtrace":[{"file":"/srv/mediawiki/php-1.29.0-wmf.1/extensions/EventBus/EventBus.hooks.php","line":684,"function":"send","class":"EventBus","object":"[object] (EventBus: {\"logger\":{}})","type":"->","args":[[{"meta":{"uri":"https://commons.wikimedia.org/wiki/File%3AVoima_icebreaker.JPG","topic":"mediawiki.page-properties-change","request_id":"a3f7f3dc-1e6b-4f51-ab9a-599377ebc217","id":"72bab1de-a391-11e6-9d3b-90b11c27839b","dt":"2016-11-05T19:52:57+00:00","domain":"commons.wikimedia.org"},"database":"commonswiki","page_id":22417040,"page_title":"File:Voima_icebreaker.JPG","page_namespace":6,"page_is_redirect":false,"rev_id":185662873,"added_properties":{"kartographer":"\u001F‹\b\u0000\u0000\u0000\u0000\u0000\u0000\u0003EË\u000E‚0\u0010EÿeÖH\u0006)}ð\u0001®Ý\u001BbJ\u0019\f\u0011(\u0019ê\u0002IÿݲP·çÜÇ\u000E÷ŸSŠBaEo\u001CiAœ¢\u0002\u0005\u0019Ý*B-[£T¢¥ƒú¶CØ\u0016‚\u001A.dË\t2xŸ(ð\u0006õO^ý0‡€œ÷Ü\r³\rŽŠîYäF\u0015J—™ÄŒZ¢jb\u0006\u000Bû…8\fGf‡Éò“øŽnSëÇ4åìDlÓÖW\fïãaŽü ?u~ôœ0b…\u001DBŒMü\u0000Î>J¯Ø\u0000\u0000\u0000"}}]]},{"file":"/srv/mediawiki/php-1.29.0-wmf.1/includes/deferred/MWCallableUpdate.php","line":28,"function":"Closure$EventBusHooks::onLinksUpdateComplete","args":[]},{"file":"/srv/mediawiki/php-1.29.0-wmf.1/includes/deferred/DeferredUpdates.php","line":263,"function":"doUpdate","class":"MWCallableUpdate","object":"[object] (MWCallableUpdate: {})","type":"->","args":[]},{"file":"/srv/mediawiki/php-1.29.0-wmf.1/includes/deferred/DeferredUpdates.php","line":225,"function":"runUpdate","class":"DeferredUpdates","type":"::","args":["[object] (MWCallableUpdate: {})","[object] (LBFactoryMulti: {})",2]},{"file":"/srv/mediawiki/php-1.29.0-wmf.1/includes/deferred/DeferredUpdates.php","line":133,"function":"execute","class":"DeferredUpdates","type":"::","args":[[],"run",2]},{"file":"/srv/mediawiki/php-1.29.0-wmf.1/includes/jobqueue/JobRunner.php","line":279,"function":"doUpdates","class":"DeferredUpdates","type":"::","args":[]},{"file":"/srv/mediawiki/php-1.29.0-wmf.1/includes/jobqueue/JobRunner.php","line":185,"function":"executeJob","class":"JobRunner","object":"[object] (JobRunner: {})","type":"->","args":["[object] (RefreshLinksJob: {\"command\":\"refreshLinks\",\"params\":{\"pages\":{\"22417040\":[6,\"Voima_icebreaker.JPG\"]},\"rootJobSignature\":\"7f9ae1b7eba847fecb76332538ab98647d1ea3e2\",\"rootJobTimestamp\":\"20161030200205\",\"triggeredRecursive\":true,\"requestId\":\"7a74bdd65bc04eb4f1c99f62\"},\"metadata\":{\"uuid\":\"43110748a19a40d89c6889efc065f13d\",\"timestamp\":1478375570,\"QueuePartition\":\"rdb1-6381\"}})","[object] (LBFactoryMulti: {})","[object] (BufferingStatsdDataFactory: {\"prefix\":\"MediaWiki\"})",1478375576]},{"file":"/srv/mediawiki/rpc/RunJobs.php","line":47,"function":"run","class":"JobRunner","object":"[object] (JobRunner: {})","type":"->","args":[{"type":"refreshLinks","maxJobs":5000,"maxTime":"30"}]}],"events":[{"meta":{"uri":"https://commons.wikimedia.org/wiki/File%3AVoima_icebreaker.JPG","topic":"mediawiki.page-properties-change","request_id":"a3f7f3dc-1e6b-4f51-ab9a-599377ebc217","id":"72bab1de-a391-11e6-9d3b-90b11c27839b","dt":"2016-11-05T19:52:57+00:00","domain":"commons.wikimedia.org"},"database":"commonswiki","page_id":22417040,"page_title":"File:Voima_icebreaker.JPG","page_namespace":6,"page_is_redirect":false,"rev_id":185662873,"added_properties":{"kartographer":"\u001F‹\b\u0000\u0000\u0000\u0000\u0000\u0000\u0003EË\u000E‚0\u0010EÿeÖH\u0006)}ð\u0001®Ý\u001BbJ\u0019\f\u0011(\u0019ê\u0002IÿݲP·çÜÇ\u000E÷ŸSŠBaEo\u001CiAœ¢\u0002\u0005\u0019Ý*B-[£T¢¥ƒú¶CØ\u0016‚\u001A.dË\t2xŸ(ð\u0006õO^ý0‡€œ÷Ü\r³\rŽŠîYäF\u0015J—™ÄŒZ¢jb\u0006\u000Bû…8\fGf‡Éò“øŽnSëÇ4åìDlÓÖW\fïãaŽü ?u~ôœ0b…\u001DBŒMü\u0000Î>J¯Ø\u0000\u0000\u0000"}}],"json_last_error":5}]}
MapperParsingException[failed to parse [backtrace.args]]; nested: IllegalArgumentException[unknown property [meta]];
        at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:329)
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
        at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:326)
        at org.elasticsearch.index.mapper.DocumentParser.parseNonDynamicArray(DocumentParser.java:414)
        at org.elasticsearch.index.mapper.DocumentParser.parseArray(DocumentParser.java:379)
        at org.elasticsearch.index.mapper.DocumentParser.parseNonDynamicArray(DocumentParser.java:416)
        at org.elasticsearch.index.mapper.DocumentParser.parseArray(DocumentParser.java:379)
        at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:254)
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:306)
        at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:326)
        at org.elasticsearch.index.mapper.DocumentParser.parseNonDynamicArray(DocumentParser.java:414)
        at org.elasticsearch.index.mapper.DocumentParser.parseArray(DocumentParser.java:379)
        at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:254)
        at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
        at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
        at org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
        at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:327)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:120)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: unknown property [meta]
        at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateFieldForString(StringFieldMapper.java:362)
        at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:311)
        at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:321)
        ... 31 more

Event Timeline

bd808 triaged this task as Unbreak Now! priority.Nov 5 2016, 7:57 PM

I added Event-Platform here, but I don't think this is really their fault at all. The problem is fundamentally that the Elasticsearch cluster behind our ELK logging stack is wanting all log records of a given type to be normalized so that it can make strongly typed indexes. This was a feature thrust upon us by the Elasticsearch 2.x upgrade that @EBernhardson and the Discovery-ARCHIVED team helped out with.

The mapping for the mediawiki events (P4375) wants backtrace.args to be a string type but there is no guarantee that the Monolog exception formatter that we use to produce these events will produce only strings.

The Event-Platform change that is tripping over the Elasticsearch quirk of wanting normalized data in the log events is: https://gerrit.wikimedia.org/r/#/c/319646/

This change is adding a "backtrace" log context variable generated with [[https://secure.php.net/debug_backtrace|debug_backtrace()]]. The "args" array that is generated will have data of various types. In this case the Elasticsearch logging errors are triggered by the current index first seeing a log event with a ["backtrace"]["args"] element (possibly not even from the EventBus channel) that was a string. Now there seem to be a large number of events that are occurring that have a ["backtrace"]["args"] that is an array and each of them is being rejected on the Elasticsearch side.

Change 320016 had a related patch set uploaded (by BryanDavis):
logstash: Temporarily disable EventBus channel

https://gerrit.wikimedia.org/r/320016

Change 320016 merged by jenkins-bot:
logstash: Temporarily disable EventBus channel

https://gerrit.wikimedia.org/r/320016

Mentioned in SAL (#wikimedia-operations) [2016-11-05T21:36:04Z] <bd808@tin> Synchronized wmf-config/InitialiseSettings.php: logstash: Temporarily disable EventBus channel (T150106) (duration: 00m 50s)

bd808 lowered the priority of this task from Unbreak Now! to High.Nov 5 2016, 10:02 PM

Changing the logging config to stop trying to send the MediaWiki EventBus log channel to Logstash has stopped the UBN! problem of full / partitions on the Logstash ingestion servers.

We still need to find an actual fix for the underlying problem. Here's the easiest formulation of the core problem that I can come up with:

  • Elasticsearch assumes that documents of a give type are homogeneous in that all indexed fields that are present in any given subset of documents will be of the same data type.
  • When presented with a document that has a data type that conflicts with the current typed mapping, Elasticsearch 1.x and 2.x have differing behavior:
    • Elasticsearch 1.x will refuse to index the conflicting field and return a success code
    • Elasticsearch 2.x will refuse to index the entire document and return an error code

For many use cases the ES 1.x behavior was scary and unexpected. It was however perfect for something like log event storage where there may not be any central control of field naming and typing. Heterogeneously typed fields in the collection would be indexed for all documents, fields with conflicting types would be opportunistically indexed when a new document's field type happened to match the first type seen for that particular field.

We need to find a way to return to ES 1.x-like behavior under ES 2.x. This may mean that we need to create an explicit whitelist of safe fields that we will index and set the default mapping to ignore all other fields for the purpose of indexing. This would at least protect us from innocent collisions like this example that end up causing larger problems.

Ideally we would find a way to tell ES 2.x just to act like ES 1.x (ignore the field and don't whine about it) for our ELK stack use case. Some research needs to be done to see if this is possible with ES 2.x or any newer version of Elasticsearch.

bd808 renamed this task from Eventbus exception logs causing indexing failures in ELK Elasticsearch to MediaWiki logs on the EventBus channel causing indexing failures in ELK Elasticsearch.Nov 5 2016, 10:03 PM

My prior summary is almost correct. The heterogeneous typing issue in Elasticsearch 2.x is actually not just constrained to a single type (e.g. "mediawiki", "hhvm", "parsoid", etc in our environment). It is a constraint for all like-named fields across the entire index (e.g. each UTC day in our environment). "all fields with the same name in the same index must have the same mapping" -- https://www.elastic.co/blog/great-mapping-refactoring

I'm now seeing several thousand instances of this error on the Elasticsearch side and corresponding growth of the logstash error logs:

[2016-11-06 15:13:29,556][DEBUG][action.admin.indices.mapping.put] [logstash1004] failed to put mappings on indices [[logstash-2016.11.06]], type [changeprop]
java.lang.IllegalArgumentException: [event] is defined as a field in mapping [changeprop] but this name is already used for an object in other types
        at org.elasticsearch.index.mapper.MapperService.checkFieldUniqueness(MapperService.java:462)
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:335)
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:289)
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:303)
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230)
        at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

This is really going to be a game of whack-a-mole. I'm actually pretty surprised that we have made it this far without having major issues.

Unfortunaly with dynamic mappings I can't think of a structure that will save us from having this issue again in the future.

There are some possibilities that could limit this undesirable behavior and I don't know yet if they are easily applicable in our context:

If most of the time conflicts happen between fields in different types an option could be to use a specific index per application.
In the same vein prefixing all fieldnames of a given type with the typename itself could help

But it won't save us from having issues within the same application, i.e. when the app's logger is configured in such a way that the same field can be reused in different context: an "errcode" field is sent as an integer error code in some cases but as a string in few others.

A flawless config would be to disable dynamic mapping and force developers to carefully support schema evolutions like we do for eventlogging/mw->hdfs. Sadly this does not seem to be doable in our context.

Maybe there are better options that would allow us to keep dynamic mapping in place and reduce the cases when this can happen:

  • use dynamic templates to have sane and same defaults for all types
  • use some naming convention to allow developers to force a particular mapping
  • disallow complex json properties: force a flat json structure everywhere
  • allow custom logstash filters to rename/munge fields when real conflicts happen, might be useful when the log event is generated by a component we do not control.

It sounds like it isn't the main issue in this task, so just in case: The kartographer binary blob in the stack trace reminded me that we did run into JSON encoding issues with binary data stashed into pageprops recently. Events with binary data are basically dropped when the JSON encoding fails. @Pchelolo knows all about that.

Edit: See T148251.

Yes, this problem appeared when we started collecting the logs of failed attempts to post messages to the EventBus HTTP proxy service. But I agree with the sentiment that this should be dealt with on the ES level, since this is not the first time ES has complained about field types, and it's probably not going to be the last one either

A flawless config would be to disable dynamic mapping and force developers to carefully support schema evolutions like we do for eventlogging/mw->hdfs. Sadly this does not seem to be doable in our context.

This seems to be the attitude of the upstream that has led us to this messy end. I fundamentally disagree that strongly typed schemas are desirable for the log aggregation use case. It's really only something that you could enforce in a world where you did not allow the applications to produce their own structured log content and instead relied on Logstash or some other intermediate system to parse flat log data and cherry-pick typed fields.

Maybe there are better options that would allow us to keep dynamic mapping in place and reduce the cases when this can happen:

  • use dynamic templates to have sane and same defaults for all types
  • use some naming convention to allow developers to force a particular mapping
  • disallow complex json properties: force a flat json structure everywhere
  • allow custom logstash filters to rename/munge fields when real conflicts happen, might be useful when the log event is generated by a component we do not control.

We do currently use dynamic templates and the template tries very hard to make collisions difficult by type coercing numeric and boolean types into strings for indexing. The problem we hit is when an array or dict is sent with the same name as something that we have typed as a string previously in the same index (i.e. UTC day).

Here's a related and even worse failure happening today:

[2016-11-07 16:31:11,915][DEBUG][action.bulk              ] [logstash1004] [logstash-2016.11.07][0] failed to execute bulk item (index) index {[logstash-2016.11.07][mediawiki][AVg_oRf2pYcdBKMs7kvR], source[{"message":"[0b2230e6] PHP Fatal Error: request has exceeded memory limit","@version":1,"@timestamp":"2016-11-07T16:31:11.000Z","type":"mediawiki","host":"mw1164","level":"ERROR","tags":["syslog","es","es"],"channel":"fatal","normalized_message":"[{exception_id}] PHP Fatal Error: request has exceeded memory limit","url":"/rpc/RunJobs.php?wiki=commonswiki&type=categoryMembershipChange&maxtime=30&maxmem=300M","ip":"127.0.0.1","http_method":"POST","server":"127.0.0.1","referrer":null,"wiki":"commonswiki","mwversion":"1.29.0-wmf.1","reqId":"WCCsHwpAAEsAAarKjI0AAAAW","exception":{"class":"ErrorException","message":"PHP Fatal Error: request has exceeded memory limit","code":16777217,"file":"/srv/mediawiki/php-1.29.0-wmf.1/includes/parser/LinkHolderArray.php","line":111,"trace":[{"function":"handleFatalError","class":"MWExceptionHandler","type":"::","args":[]}]},"exception_id":"0b2230e6"}]}
MapperParsingException[failed to parse [exception.trace]]; nested: IllegalArgumentException[unknown property [function]];

Some event source other than MediaWiki was the first of the day to add an event with ["exception"]["trace"] value and that value was a string type rather than an array of objects. This means that all MediaWiki log events with an exception attached will be rejected from the index until 2016-11-08T00:00Z. At that point we will get to race again to see who gets their type into the index first.

Change 320231 had a related patch set uploaded (by BryanDavis):
MWExceptionHandler: Do not use 'exception' for custom log data

https://gerrit.wikimedia.org/r/320231

Change 320232 had a related patch set uploaded (by BryanDavis):
MWExceptionHandler: Do not use 'exception' for custom log data

https://gerrit.wikimedia.org/r/320232

Change 320231 merged by jenkins-bot:
MWExceptionHandler: Do not use 'exception' for custom log data

https://gerrit.wikimedia.org/r/320231

Change 320232 merged by jenkins-bot:
MWExceptionHandler: Do not use 'exception' for custom log data

https://gerrit.wikimedia.org/r/320232

Mentioned in SAL (#wikimedia-operations) [2016-11-07T17:56:26Z] <bd808@tin> Synchronized php-1.29.0-wmf.1/includes/exception/MWExceptionHandler.php: MWExceptionHandler: Do not use 'exception' for custom log data (T150106) (duration: 00m 47s)

bd808 renamed this task from MediaWiki logs on the EventBus channel causing indexing failures in ELK Elasticsearch to Type collisions in log events causing indexing failures in ELK Elasticsearch.Nov 7 2016, 5:57 PM
bd808 removed a project: Patch-For-Review.

The current main collision is: "java.lang.IllegalArgumentException: [event] is defined as a field in mapping [changeprop] but this name is already used for an object in other types".

This looks to be a collision between eventlogging sending ["event"] as an object and changeprop sending ["event"] as a string (a JSON object serialized as a string actually).

Change 320441 had a related patch set uploaded (by BryanDavis):
logstash: dynamically rename object values

https://gerrit.wikimedia.org/r/320441

My patch at https://gerrit.wikimedia.org/r/320441 does not work as hoped, and here's why: the dynamic mapping rules are only used the first time a new field name is encountered. When a mapping for x is not found in the current index the dynamic mapping rules will be used to create a concrete mapping. Once this has been done the dynamic rules will no longer be examined for fields named x. Effectively this means that we have to remap all of the dynamic names to use some type of hungarian notation for the distinction between string/numeric/boolean (which are all currently remapped to strings for indexing) and objects to work.

This should be possible using dynamic templates and isn't quite as restrictive as my prior proposal of only indexing whitelisted fields. I think a possible fix is to:

  • Add common properties that are heavily used in existing dashboards to the static properties in elasticsearch-template.json
  • Add dynamic rules for string, numeric, and boolean types that do not have a type specifier to disable the default mapping and add a copy_to that adds a prefix/suffix showing that this is a top level literal type.
  • Add a dynamic rule for objects that do not have a type specifier to disable the default mapping and add a copy_to that adds a prefix/suffix showing that this is a top level object type.
  • Update current dynamic string, numeric, and boolean type mappings to only match for fields that have a prefix/suffix that indicates that the field has been typed.

This should get us out of almost all collisions. I think we will still potentially have problems with any log value that can contain a array of mixed literal and object values. An example would be something like PHP logging [args => func_get_args()]. Elasticsearch is going to reject documents with mixed literal and object values in an array because it only looks at the type of the first array element to type the field.

@EBernhardson mentioned an idea on irc that we could write a logstash plugin to do the hungarian notation step. This might also allow us to come up with some hack to take care of the problem of an array field e.g. args that could contain mixed literal and object values.

The least disruptive thing for existing reports would probably be to leave string value keys unaltered and add a prefix/suffix for values that are integer, float, boolean, or object. So if a log message looked like {"foo":"...", "bar":1, "baz":2.345, "zyzzy":{"a":"..."}} we could munge the key names to actually index {"foo":"...", "bar_int":1, "baz_float":2.345, "zyzzy_obj":{"a":"..."}}.

Alternately we could treat everything as an object with typed member values (e.g. {"foo":{"str":"..."}, "bar":{"str":"1", "int":1}, "baz":{"str":"2.345", "float":2.345}, "zyzzy":{"a":{"str":"..."}}}). I think this would fix the mixed args array problem but at the same time it would require nearly every dashboard to be updated. We could whitelist a core set of fields (@timestamp, type, channel, message, ...) to reduce the number of dashboards that were broken by this.

Change 320441 abandoned by BryanDavis:
logstash: dynamically rename object values

https://gerrit.wikimedia.org/r/320441

Change 388038 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Ignore args when getting backtraces for logging.

https://gerrit.wikimedia.org/r/388038

Change 388079 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] [Logging] Enable logstash logging for EventBus

https://gerrit.wikimedia.org/r/388079

Change 388038 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Ignore args when getting backtraces for logging.

https://gerrit.wikimedia.org/r/388038

Change 388399 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@wmf/1.31.0-wmf.6] Ignore args when getting backtraces for logging.

https://gerrit.wikimedia.org/r/388399

Change 388399 merged by jenkins-bot:
[mediawiki/extensions/EventBus@wmf/1.31.0-wmf.6] Ignore args when getting backtraces for logging.

https://gerrit.wikimedia.org/r/388399

Mentioned in SAL (#wikimedia-operations) [2017-11-06T14:13:16Z] <mobrovac@tin> Synchronized php-1.31.0-wmf.6/extensions/EventBus/EventBus.php: EventBus - Improve logging for T150106 (duration: 00m 48s)

Change 388079 merged by jenkins-bot:
[operations/mediawiki-config@master] [Logging] Enable logstash logging for EventBus

https://gerrit.wikimedia.org/r/388079

Mentioned in SAL (#wikimedia-operations) [2017-11-06T14:18:42Z] <mobrovac@tin> Synchronized wmf-config/InitialiseSettings.php: EventBus: Enable logging - T150106 (duration: 00m 47s)

We're now alerting when logstash index failures happen (T236343) thus I'm boldly resolving this task as followup will happen on Wikimedia-Logstash as issues come up.

fgiunchedi claimed this task.