In this Remark, W seems to be addressing (implicitly) the scheme-content distinction, ie, the idea that the world presents itself to an organism as an undifferentiated stream of sensory stimulation (AKA, content or "intuitions") which - in order to be able to formulate a response - the organism needs to be able to segment according to some rules (AKA, a conceptual scheme). A language constitutes such a conceptual scheme in which a segment is identified with a word of the language.
From that perspective, this Remark attributes to the Augustinian methodology the assumption that the pre-lingual child - like a visitor to a country in which a foreign language is spoken - already has a conceptual scheme (in the case of the visitor, a language) in place and only needs to learn segment-word associations (in the case of the visitor, replacing words in existing associations with words from the foreign language). This assumption is a version of the Given. Sellars refutes it, arguing that a scheme must also be learned and is a "linguistic affair". (I interpret that phrase as meaning not that the child can have a conceptual scheme only after learning a language but that the learning of both is intertwined.)
The early associations are between patterns of visual sensory input and aural sensory input (perhaps the "determinate sense repeatables" of section 26 of Sellars' "Empriricism and the Philosophy of Mind"). In the earliest stages of learning, segments of the visual stream will correspond (ideally) to simple objects and primary colors, segments of the aural segments be words uttered by a teacher - ie, linguistic entities (of course, not yet recognized as such by the child). Later, the child will learn to respond to recognizable segments of the visual input stream by parroting the associated segments of the aural input stream. In time (apparently roughly age four) the child will begin to exhibit an ability that can be reasonably described as "talking" as opposed to merely parroting learned responses, at which point primitive versions of both language and an associated conceptual scheme will have emerged.
In the early stages of learning to associate repeatable segments of the visual and aural input streams, there is potential ambiguity if a visual input stream segment is due to viewing a composite rather than a simple - a problem W addresses in several other Remarks. Ambiguity is reduced by structuring the teaching environment so that it is as easy as possible for the child to associate an aural sensory input segment (ie, an uttered word) with the of visual sensory input segment intended by the teacher.